Plotting and Analysing Weather Data with Edexcel Large Data Set  LDS
Please use Google Chrome or Mozilla FireFox to see the animations properly.
Edexcel's Large Data Set (LDS) is a gamechanger for A Level Statistics. By incorporating a substantial amount of realworld data into the curriculum, Edexcel has aligned the course with the demands of modern data science and artificial intelligence. This is a significant step forward, as it empowers students to develop practical data analysis skills that are highly sought after by industry.
Traditional statistics courses often rely on small, contrived datasets, limiting students' ability to explore complex patterns and relationships. The Edexcel LDS offers a muchneeded departure from this norm. By providing access to a large, diverse dataset, students can gain handson experience with the types of data challenges faced by professionals in fields like data science and analytics.
The LDS covers weather data collected in 1987 from a variety of locations, including five towns in the UK, a city in China, Jacksonville (USA), and Perth (Australia). This global perspective provides students with opportunities to analyze data from different climates and geographical regions.
 Plotting two sets of data on the same grid against the same time period
 You can change the time period interactively to see the changes
 During a given period, you can find the locations and spread of data
 You can check whether the two sets of data have a correlation
 You can take a random sample of a size of your choice
 The precautions you should take while taking samples from this data set
 Interactive box plots
 The units of oktas and knots are fully explained
 The areas where you are supposed to exercise restraint, when it comes to forecasting islandwide weather, based on this particular data set
The Edexcel large data set covers the following data:
Please note that data in some cells in the Excel large data set is missing, represented by n/a characters  a serious challenge for a developer to overcome before plotting.
The large data set can be downloaded from the following link(2020):
Download Edexcel Large Data Set
The following data locations and data spread are interactively updated:
In the animation, you can change the period of data using a slider below the chart; not only does the chart get updated, but also locations of data and spread for that particular period are updated accordingly.
Change Period

Mean 
Mode 
Median 
Standard Deviation 
Maximum 
Minimum 
Interquartile Range 
Temperature: ^{0}C 







Cloud Cover: oktas 







Please note that cloud cover, measured in oktas, is a discrete variable.
Formulae in Use
Mean: x̄ = Σx / n
Standard Deviation: σ = √Σ(x  x̄)²/n or σ = √Σf(x  x̄)²/(Σf)
Q1: 25% of data lies below this
Q3: 75% of data lies below this
Median: 50% of data lies below this
IQR: Q3  Q1
Coding
When the data is too large or too small, we use coding to make calculations easier.
E.g.
x: 111, 121, 131, 141, 151
This data set can be tuned into y as follows by coding:
Let y = (x  1)/10
y: 11, 12, 13, 14, 15
ȳ = Σy/5 = (11 + 12 + 13 + 14 + 15)/5 = 13
Now, the locations of the data can be found in terms of y and then turned into corresponding x values.
The same process can be used if the data in question is too small.
Turning Coded Values into Original Values
Let y = (x  a)/b, where a and b are constants. x and y are original and coded values respectively.
ȳ = Σ y/n
= Σ (x  a)/nb
= Σ x/nb  Σa/nb
= x̄/b  na/nb
= x̄/b  a/b
x̄ = bȳ + a
If ȳ, a and b are known,
x̄ can be calculated easily.
In the above example, ȳ = 13; a = 1; b = 10
x̄ = 10ȳ + 1
x̄ = 131
Variables and Units in the LDS
Variables are characteristics, numbers or quantities that can be counted or measured.
E.g. wind speed, cloud cover, no of fish in a lake, no of girls in a class with black hair
In the large data set, LDS, the following units are used to represent the wind speed and cloud cover.
Knots
The number of nautical miles per hour gives the speed of wind in knots.
Nautical miles are used for navigation.
1 knot = 1.15 mph
You can convert knots into mph by using the following; just put the value in the text box and move the mouse out:
Okta
This is the unit of the measurement of cloud cover. It's a discrete unit and ranges from 0  8  hence a derivation of octave.
◯  0 okta: clear sky
◔  2 oktas: ¼ of the sky covered by clouds
◑  4 oktas: ½ of the sky covered by clouds
◕  6 oktasa: a ¾ of the sky covered by clouds
⬤  8 oktas: a ¼ of the sky covered by clouds
Sampling with LDS
You can take random samples from the LDS, provided that you know how to avoid the cells with no data. For instance, there is no data in the first 16 cells of the Daily mean wind speed column. If you treat the whole column as the population and a random number turns out to be in that region, there is going to an error related to that data. It will be the same for systemic sampling.
These samples from the LDS do not lead to an accurate or reliable forecast for the UK weather for the following reasons.
 The data does not cover the entire United Kingdom.
 The data covers just five areas of the country.
 The data covers a period of 6 months of the year  a part of summer and autumn
Scatter Graphs from Edexcel LDS
The following interactive chart checks whether there is any correlation between the daily temperature and the cloud cover in Heathrow area in the United Kingdom. The temperature and cloud cover are plotted along the xaxis and yaxis respectively; the units are ^{0}C and oktas respectively.
Histograms from Edexcel LDS  9 shades of grey
The following histogram is based on the cloud cover data in Heathrow  from May to October 1987. It's a histogram based on cloud cover, measured in oktas  a discreet variable. It's fully interactive.
Since the data in question is discrete, the above chart can also be described as a bar chart.
Histograms from Edexcel LDS  Relative Humidity
The following histogram is based on continuous data, collected over a period of six months in 1987, in Heathrow area. The data shows that relative humidity stayed above 65%, most of the time. In this context, you may understand why the histogram has been restricted to just 3 classes.
Boxplots from Edexcel Large Data set  interactive
The boxplot below is based on the data collected from May, 1987 to October, 1987 in Heathrow area in the UK, where one of the busiest air ports in the world functions from. As the chart shows, the relative humidity fluctuated between 70% and 90% during the period of six months in the summer / autumn seasons. The boxplot is fully interactive.
Comparing Two Data Sets from Edexcel Large Data Set
In order to compare the daily average temperatures, from Camborne and Heathrow, the following interactive animation has been made.
Change the size of the sample and keep an eye on the boxplots and the frequency tables, as the are automatically updated.
When comparing two data sets, please note the following:
1) Compare median and interquartile range
2) Compare mean and standard deviation
3) Do not compare median and standard deviation
4) Do not compare mean and interquartile range
If you want to contact me, here is the email.
Solving Problems: Edexcel Large Data Set
The above histogram shows how to take a sample of daily mean temperatures in Heathrow in 1987, from May to October. Answer the following when the sample size is 110.
 A formula for frequency and class width
 The frequency of the classes, 6  8 and 18  20
1) Let's take the sample of 110 data values  you can change the sample size to whatever value you want.
Since the frequency of a bar of a histogram ∝ area,
f ∝ area
f = k x area
48 = k x 24
24k = 48
k = 2
f = 2A
2) For class, 6  8,
f = 2 x 1
=2
For class, 18  20,
f = 2 x 5
= 10
Bonus: getting the perfect regression line  fully interactive
With the following animation, you can see how the residual sum of squares determines the perfect regression line. Move the data points closer to the line with your mouse and see the equation of the regression line. It's fun, isn't it?
You will find the following tutorials useful too:
For Developers
I used Fetch() function from the REST API to load the data from a .csv file; the original file was a .xls file, an Excel document. In addition, the following technologies were used to produce the chart and the corresponding statistical values from data.
 The data that comes as a promise was dissected to extract the required data.
 In order to plot data, Chart.js library was used.
 In order to find the statistical values,simple statistics JavaScript library was used.
 Two JavaScript functions was created to turn knots into mph and get a sample from the data set.