### Plotting and Analysing Weather Data with Edexcel Large Data Set - LDS

Please use Google Chrome or Mozilla FireFox to see the animations properly.

Edexcel's Large Data Set (LDS) is a game-changer for A Level Statistics. By incorporating a substantial amount of real-world data into the curriculum, Edexcel has aligned the course with the demands of modern data science and artificial intelligence. This is a significant step forward, as it empowers students to develop practical data analysis skills that are highly sought after by industry.

Traditional statistics courses often rely on small, contrived datasets, limiting students' ability to explore complex patterns and relationships. The Edexcel LDS offers a much-needed departure from this norm. By providing access to a large, diverse dataset, students can gain hands-on experience with the types of data challenges faced by professionals in fields like data science and analytics.

The LDS covers weather data collected in 1987 from a variety of locations, including five towns in the UK, a city in China, Jacksonville (USA), and Perth (Australia). This global perspective provides students with opportunities to analyze data from different climates and geographical regions.

• Plotting two sets of data on the same grid against the same time period
• You can change the time period interactively to see the changes
• During a given period, you can find the locations and spread of data
• You can check whether the two sets of data have a correlation
• You can take a random sample of a size of your choice
• The precautions you should take while taking samples from this data set
• Interactive box plots
• The units of oktas and knots are fully explained
• The areas where you are supposed to exercise restraint, when it comes to forecasting island-wide weather, based on this particular data set

The Edexcel large data set covers the following data:

Please note that data in some cells in the Excel large data set is missing, represented by n/a characters - a serious challenge for a developer to overcome before plotting.

The following data locations and data spread are interactively updated:

In the animation, you can change the period of data using a slider below the chart; not only does the chart get updated, but also locations of data and spread for that particular period are updated accordingly.

Change Period
Mean Mode Median Standard Deviation Maximum Minimum Interquartile Range
Temperature: 0C
Cloud Cover: oktas

Please note that cloud cover, measured in oktas, is a discrete variable.

#### Formulae in Use

Mean: x̄ = Σx / n
Standard Deviation: σ = √Σ(x - x̄)²/n or σ = √Σf(x - x̄)²/(Σf)
Q1: 25% of data lies below this
Q3: 75% of data lies below this
Median: 50% of data lies below this
IQR: Q3 - Q1

#### Coding

When the data is too large or too small, we use coding to make calculations easier.
E.g.
x: 111, 121, 131, 141, 151
This data set can be tuned into y as follows by coding:
Let y = (x - 1)/10
y: 11, 12, 13, 14, 15
ȳ = Σy/5 = (11 + 12 + 13 + 14 + 15)/5 = 13 Now, the locations of the data can be found in terms of y and then turned into corresponding x values.
The same process can be used if the data in question is too small.

Turning Coded Values into Original Values
Let y = (x - a)/b, where a and b are constants. x and y are original and coded values respectively.
ȳ = Σ y/n
= Σ (x - a)/nb
= Σ x/nb - Σa/nb
= x̄/b - na/nb
= x̄/b - a/b
x̄ = bȳ + a If ȳ, a and b are known,
x̄ can be calculated easily.
In the above example, ȳ = 13; a = 1; b = 10
x̄ = 10ȳ + 1
x̄ = 131

#### Variables and Units in the LDS

Variables are characteristics, numbers or quantities that can be counted or measured.
E.g. wind speed, cloud cover, no of fish in a lake, no of girls in a class with black hair

In the large data set, LDS, the following units are used to represent the wind speed and cloud cover.

Knots

The number of nautical miles per hour gives the speed of wind in knots.
Nautical miles are used for navigation.
1 knot = 1.15 mph

You can convert knots into mph by using the following; just put the value in the text box and move the mouse out:

Okta

This is the unit of the measurement of cloud cover. It's a discrete unit and ranges from 0 - 8 - hence a derivation of octave.

◯ - 0 okta: clear sky
◔ - 2 oktas: ¼ of the sky covered by clouds
◑ - 4 oktas: ½ of the sky covered by clouds
◕ - 6 oktasa: a ¾ of the sky covered by clouds
⬤ - 8 oktas: a ¼ of the sky covered by clouds

#### Sampling with LDS

You can take random samples from the LDS, provided that you know how to avoid the cells with no data. For instance, there is no data in the first 16 cells of the Daily mean wind speed column. If you treat the whole column as the population and a random number turns out to be in that region, there is going to an error related to that data. It will be the same for systemic sampling.

These samples from the LDS do not lead to an accurate or reliable forecast for the UK weather for the following reasons.

• The data does not cover the entire United Kingdom.
• The data covers just five areas of the country.
• The data covers a period of 6 months of the year - a part of summer and autumn

#### Scatter Graphs from Edexcel LDS

The following interactive chart checks whether there is any correlation between the daily temperature and the cloud cover in Heathrow area in the United Kingdom. The temperature and cloud cover are plotted along the x-axis and y-axis respectively; the units are 0C and oktas respectively.

Change Period:

Data Source: Edexcel

#### Histograms from Edexcel LDS - 9 shades of grey

The following histogram is based on the cloud cover data in Heathrow - from May to October 1987. It's a histogram based on cloud cover, measured in oktas - a discreet variable. It's fully interactive.

Change Period:

Data Source: Edexcel

Since the data in question is discrete, the above chart can also be described as a bar chart.

#### Histograms from Edexcel LDS - Relative Humidity

The following histogram is based on continuous data, collected over a period of six months in 1987, in Heathrow area. The data shows that relative humidity stayed above 65%, most of the time. In this context, you may understand why the histogram has been restricted to just 3 classes.

Change Period:

Data Source: Edexcel

#### Boxplots from Edexcel Large Data set - interactive

The boxplot below is based on the data collected from May, 1987 to October, 1987 in Heathrow area in the UK, where one of the busiest air ports in the world functions from. As the chart shows, the relative humidity fluctuated between 70% and 90% during the period of six months in the summer / autumn seasons. The boxplot is fully interactive.

Change Period:

Data Source: Edexcel

#### Comparing Two Data Sets from Edexcel Large Data Set

In order to compare the daily average temperatures, from Camborne and Heathrow, the following interactive animation has been made.
Change the size of the sample and keep an eye on the boxplots and the frequency tables, as the are automatically updated.

When comparing two data sets, please note the following:
1) Compare median and interquartile range
2) Compare mean and standard deviation
3) Do not compare median and standard deviation
4) Do not compare mean and interquartile range

If you want to contact me, here is the email.

#### Solving Problems: Edexcel Large Data Set

The above histogram shows how to take a sample of daily mean temperatures in Heathrow in 1987, from May to October. Answer the following when the sample size is 110.

1. A formula for frequency and class width
2. The frequency of the classes, 6 - 8 and 18 - 20

1) Let's take the sample of 110 data values - you can change the sample size to whatever value you want.
Since the frequency of a bar of a histogram ∝ area,
f ∝ area
f = k x area
48 = k x 24
24k = 48
k = 2
f = 2A
2) For class, 6 - 8,
f = 2 x 1
=2
For class, 18 - 20,
f = 2 x 5
= 10

#### Bonus: getting the perfect regression line - fully interactive

With the following animation, you can see how the residual sum of squares determines the perfect regression line. Move the data points closer to the line with your mouse and see the equation of the regression line. It's fun, isn't it?

You will find the following tutorials useful too:

### For Developers

I used Fetch() function from the REST API to load the data from a .csv file; the original file was a .xls file, an Excel document. In addition, the following technologies were used to produce the chart and the corresponding statistical values from data.

• The data that comes as a promise was dissected to extract the required data.
• In order to plot data, Chart.js library was used.
• In order to find the statistical values,simple statistics JavaScript library was used.
• Two JavaScript functions was created to turn knots into mph and get a sample from the data set.