Unit 6: Maths and Science for Technicians – statistics part 1
Mean, median and mode: measures of ‘central tendency’ The mean, median and mode are three kinds of average. If you have a data set (say the angles of impact calculated from 5 blood stains that were all dropped at the same angle) you can find an average or typical value to represent those 5 results. The average can then be used in subsequent calculations. The phrase measure of central tendency is used as an alternative to the word average, more about this later. Suppose you have the following set of data (perhaps lengths of 6 bloodstains in millimetres) 14.3 Average Mean
9.2
18.6
7.4
7.4
13.1
How to find Find the total of the data and then divide the total by the number of results
Example Total is 70.0 70 ÷ 6 = 11.67 = 11.7 (3 s.f.)
Median
Put the numbers in order of size Odd number of results: Pick the middle 7.4 7.4 9.2 13.1 14.3 18.6 value Even number of results: Take the mean (9.2 + 13.1) = 11.15 = 11.2 (3 s.f.) of the middle two values 2
Mode
Find the value that appears most commonly in the list of data
Mode is 7.4
Your turn Work out the three averages for each of the following data sets… • The annual gross salaries paid by a Company A are as follows £12 500 •
£13 500
£13 500
£17 250
£87 500
£30 750
£30 750
Company B pays the following salaries £29 850
Discussion:
KPB 2004
£23 900
£28 950
Compare the mean salaries for the two companies Compare the median salaries for the two companies How do the two kinds of average compare?
http://bodmas.org/bnc/
page 1 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Populations, samples and bias If you were trying to make a definite statement about (say) the average height of adult males in England and Wales, you would have two main alternatives • •
Conduct a census: measure the height of every adult male in England and Wales on a certain day Take a sample: pick at random a sample of (say) 1 000 adult males from England and Wales and measure their heights
The first alternative would be expensive, a logistical nightmare, and might not provide much more information than the second alternative. The key word here is random. One way of picking a sample of adult males from England and Wales would be to • • •
Define the population very carefully by a unique identifier – perhaps National Insurance number. The NI number provides the sampling frame for the population. Then a sample could be picked at random using a computer program to select NI numbers or by picking the digits of the NI numbers from a hat.
It is very important to avoid bias in the sampling process. People asked to pick numbers at random between 1 and 10 will tend to pick numbers near the middle of the range – unless they are deliberately compensating in a statistics lesson!
Estimating population parameters When you calculate the mean height of the sample of 1 000 adult males from England and Wales, the result is called an estimate of the mean height of the population. The estimate is probably very close to the actual mean height of the population, but there may be a small difference called sampling error. One way to quantify the sampling error is to pick another sample of 1 000 adult males and measure their heights… You can treat the means of samples as data items. More on this later…
Your turn: discussion The aliens have landed in the Bull Ring and they kidnap the next 5 people that walk past the statue of the bull. List as many different ways in which this sample of Homo Sapiens could be biased as you can think up in two minutes…
KPB 2004
http://bodmas.org/bnc/
page 2 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Classification of data What is the ‘average’ eye colour for the following data set? Hazel
Blue
Grey
Brown
Green
There is not any kind of ‘typical’ eye colour for this group. If there had been two people with (say) brown eyes, you could assign a modal eye colour to the group.
Variables and data For each of the bloodstains in Scenario 1, you measured the width and the length of the stain. The width and length are referred to as the variables you are measuring, and a third variable was the height of the dissection board for that particular bloodstain. Your table of results constitutes the data set. Variables can be classified as follows… Variables / data
Qualitative
Quantitative
Discrete
Continuous
Examples of each of the three types of variable include Type of variable Qualitative Quantitative discrete Quantitative continuous
Example Ethnic self-classification Number of people on the 97 bus at 0730 each day Heights of people picked at random in a football crowd
The distinction between discrete and continuous variables is often a bit fuzzy. For instance, we often treat money as a continuous variable but strictly speaking it is discrete and quantized in units of 1p. By the same token, height and weight of people are classic examples of continuous variables but we often round heights to the nearest 1cm and weights to the nearest 1Kg – effectively making them discrete!
KPB 2004
http://bodmas.org/bnc/
page 3 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn Discussion Exercise: classify the following variables… Try to classify the following variables – if continuous state the quantization level or typical rounding precision. Variable Mass of a suspected bag of cocaine
Classification
Length of a skid mark in a road accident Number of people visiting a shop each hour of the day Shirt collar/dress/shoe size
Discussion exercise: Bloodstain shape Can you think of any qualitative variables that it might be possible to record for the bloodstains? Can you classify the appearance of bloodstains somehow?
KPB 2004
http://bodmas.org/bnc/
page 4 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Frequency Distributions Finding an average gives you a ‘typical value’ that can represent your data. Part of choosing the best average for your data is to see how spread out the data is. One way of doing this is to draw a histogram or frequency bar chart. If you have a list of 100 numbers, say heights of people, you have to ‘bin’ the numbers into a series of intervals of height. The tally chart is a common way of doing this. Below are the heights in centimetres (rounded to the nearest centimetre) of 100 people. 138 139 139 144 145 145 145 145 146 147
149 149 149 149 149 150 150 150 150 151
152 152 153 153 153 153 153 153 153 153
153 154 154 154 154 154 154 154 154 155
155 156 156 156 156 157 157 158 159 159
160 160 160 160 160 160 161 161 161 161
162 162 162 162 162 163 164 164 164 164
165 165 166 166 166 166 166 167 167 167
167 167 167 168 168 168 168 169 169 169
170 171 171 172 172 172 173 173 173 176
The smallest height is 138 cm and the largest height is 176 cm, so the range is 176 – 138 = 38cm. We can use 9 intervals as follows… Interval 135 ≤ X < 140
Tallies
Frequency
140 ≤ X < 145 145 ≤ X < 150 150 ≤ X < 155 155 ≤ X < 160 160 ≤ X < 165 165 ≤ X < 170 170 ≤ X < 175 175 ≤ X < 180 Total of the frequencies 100
KPB 2004
http://bodmas.org/bnc/
page 5 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
My frequency distribution for the data set above was as follows Interval 135 ≤ X < 140 140 ≤ X < 145 145 ≤ X < 150 150 ≤ X < 155 155 ≤ X < 160 160 ≤ X < 165 165 ≤ X < 170 170 ≤ X < 175 175 ≤ X < 180
Frequency 3 1 11 24 11 20 20 9 1
Below is a histogram produced in MS Excel by means of a subterfuge – I used a bar chart with customised category axis labels. I used the midpoints of the intervals to label each bar. Histogram of heights 25
20
15
10
5
0 137.5
142.5
147.5
152.5
157.5 Height
162.5
167.5
172.5
177.5
/cm
Your turn Below is some data on the heights in mm of watercress seedlings grown on blotting paper… h (mm) Frequency
5 < h ≤ 10 12
10 < h ≤ 15 22
15 < h ≤ 20 10
20 < h ≤ 25 6
25 < h ≤ 30 4
30 < h ≤ 35 1
Plot a histogram of this frequency distribution on graph paper in ‘landscape’ format using suitable scales. Write a sentence describing the shape of the histogram. Where does the modal interval fall?
KPB 2004
http://bodmas.org/bnc/
page 6 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Illustrating small data sets: the dot plot Histograms are often plotted from small numbers of results – really this is not appropriate. A histogram of the sample of 100 heights changes a lot each time. A histogram of a sample of 1000 heights ‘bounces around’ much less. A better kind of diagram for small data sets is the dot plot. The dot plot is best illustrated by example… Suppose two people (SB and MB for the sake of argument) performed a titration 10 times each and recorded the amount of acid needed to neutralise 100 cm3 of alkali. Data was recorded to the nearest 0.1 cm3 SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4 MB
48.8
51.2
50.5
51.3
50.7
51.2
50.5
50.5
49.1
49.6
These results could be compared using a dot plot as drawn below… MB SB
41
45
50
55 Acid volume / cm3
As you can see, in a dot plot • • • •
Each data value is represented by a dot Dots are plotted on a horizontal scale showing an appropriate range of values of the variable If two (or more) dots have the same value, then they are plotted at the correct value of variable but vertically separated Two different samples can be plotted on the same axes to show differences between the samples. The samples are separated by a larger vertical space, and plotted using different symbols for each sample and identified with a key.
KPB 2004
http://bodmas.org/bnc/
page 7 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn Discussion point: Describe the differences between MB and SB’s results as displayed on the dot plot. Find two ways in which the distribution of dots is different for MB and SB. Plot a dot plot on graph paper: Try plotting a dot plot for two more people (KB and PS) repeating this titration. Their results are shown below… KB PS
45.9 46.0
48.1 42.9
51.9 47.7
50.2 43.8
52.1 45.9
50.6 43.0
50.0 47.1
48.8 43.3
49.8 45.9
50.0 47.3
Write a few sentences about what the dot plot shows you. • Compare the spread of each data set. • Compare the middle of each data set (the ‘central tendency’) • Plot a special point with a different symbol (perhaps ↓ ) marking the mean of each set of data.
Link with errors work Recall that when we were discussing the errors work, we looked at precision and accuracy… High precision
Low precision
High accuracy
Low accuracy
Dot plots that show a low spread (tightly bunched dots) might be the result of more precise measurements than dot plots that show a wide spread. A dot plot can help you gauge the amount of random error in a measurement. Systematic error is harder to deal with and may not show in statistical analysis.
KPB 2004
http://bodmas.org/bnc/
page 8 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Measure of dispersion: Standard Deviation Recall the errors work we looked at in the first session: we divided the error up into systematic and random components… Systematic error (unknown) Random error (knowable)
The standard deviation is a measure of how spread out your data is – the dispersion of your data - and can be used to calculate the size of random error. As we are working mostly with samples that are drawn from a larger distribution, we use a formula for the sample standard deviation… S=
"(x ! x )
2
N !1
Where • • • •
the large Σ symbol means ‘find the total of all the following’ x stands for the mean x stands for each data value in turn N stands for the number of data items
What the formula is asking you to do is… • • • • •
Find the mean of the data ( x ) Subtract the mean from each data value in turn and then square the result Find the total of the squares of the deviations from the means Divide this total by one less than the number of data items – you now have the mean square of the deviations from the mean Take the square root to find the sample standard deviation
The best way to learn to apply this formula is to work with an example and to work in columns to represent each of the stages in the calculation… PTO
KPB 2004
http://bodmas.org/bnc/
page 9 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Typical question Calculate the mean and sample standard deviation of the 10 acid volumes obtained by SB in the titration experiment: here are the results again in cm3… SB 48.7 41.9 45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4 The first thing to do is to calculate the mean for this data. I get the total to be 484.3, and so the mean is 48.43 cm3, which we can round to 48.4 to advantage. The next thing to do is arrange the data in the first column of a three column table – we will use the other two columns to record results as we go on… X Data value 48.7
x- x Deviation 48.7 – 48.4 = 0.3
(x - x )2 Square deviation 0.32 = 0.09
41.9
41.9 – 48.4 = -6.5
(-6.5)2 = 42.25
45.8 49.0 46.3 49.0 55.6 44.7 54.9 48.4 Total square deviation
162.01
Once you have your value for the total square deviation, you can complete the calculation as follows… S=
162.01 = 18.0 = 4.24 " 4.2 10 ! 1
KPB 2004
http://bodmas.org/bnc/
page 10 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn Calculate the mean and sample standard deviation of the 10 acid volumes obtained by SB in the titration experiment: here are the results again in cm3… KB
45.9
48.1
51.9
50.2
52.1
50.6
50.0
48.8
49.8
50.0
Use the blank table below to calculate your results… X Data value
x- x Deviation
(x - x )2 Square deviation
Total square deviation
Remember to complete the calculation by dividing the total square deviation by one less than the number of data items and then taking the square root of the result. I found the answer to be 1.8 cm3 rounded to two significant figures. You might want to put marks on your dot plot for this data at ±S either side of the mean, and then ±2S and then ±3S where S is the sample standard deviation.
KPB 2004
http://bodmas.org/bnc/
page 11 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Standard deviation: calculating with MS Excel Excel has functions built in that will calculate a wide range of statistics for you. To calculate the sample standard deviation, you just • • •
Type your data into a column (say the first number is typed in cell A1 and the 10th in cell A10) Click in cell A12 (or any cell outside the data range) and type =stdev(A1:A10) and press enter You should see the standard deviation figure appear
The command =stdev(A1:A10)is called a formula, the ‘=’ sign tells Excel that a calculation is needed. The cell range A1:A10 tells Excel which figures to include in a calculation.
Calculating with your scientific calculator All scientific calculators have a routine for calculating the standard deviation. • • •
Learning how to use your particular calculator’s routine is best done as a ‘simon says’ exercise in the lesson as each make of calculator (and sometimes different models within a manufacturer’s range) will have a different logic. Make absolutely sure that you can get the same answers for mean and standard deviation using your calculator’s statistics mode as you do when using the table layout. Calculators often use the symbol σn-1 for what we have called the sample standard deviation. The symbol σn-1 will often be found as a legend above a key.
Sample standard deviation as an estimate of population standard deviation The sample standard deviation of (say) the diameter of 30 airgun pellets taken at random from a box of 500 pellets provides you with an estimate of the standard deviation of the population in the box (in turn a sample of the whole production of pellets!). There will be some error on the sample standard deviation estimate. We use different symbols for statistics calculated from samples and the values we are trying to estimate for the population…
Mean Standard deviation
KPB 2004
Sample x S or S.D.
http://bodmas.org/bnc/
Population µ σ
page 12 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Homework: Week 4 If you work through the questions below in time for the next lesson, you will have all the mathematical techniques at your fingertips. Next week we look at the Normal Distribution and the standard error of the mean, and then go on to calculate confidence intervals of the mean. Skill questions 1)
2)
Below is some data relating to the reaction time in milli-seconds (thousandths of a second) of a person before (B) and after (A) ingesting 60mL of a rather good malt whisky B
476
524
511
527
514
524
510
510
482
491
A
839
785
816
842
820
849
895
808
890
837
a)
Plot both sets of results on a single dot plot
b)
Calculate the mean and standard deviation for both sets of results
c)
Organise your statistics into a table to allow comparison
Below is a series of 20 measurements of the weight of mature weevils in mg… 346 284
303 277
a)
Calculate the mean and standard deviation of this data set
b)
Which weevil has a weight that is very far from the mean?
c)
Calculate the value of
KPB 2004
371 316
417 325
414 265
438 334
262 342
349 366
409 1500
311
329
(x ! x) where ( x ! x ) is the difference between S the extreme weevil and the mean and S is the standard deviation.
http://bodmas.org/bnc/
page 13 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Questions that need explanations: harder 3)
Two science technicians are calibrating an electrophoresis gel. Technician C measures the position of the 4th calibration band on 7 gels. Technician D measures the position of the 4th calibration band on 9 gels. Their results are shown below in mm. Tech C
10.9
13.1
16.9
15.2
17.1
15.6
15.0
Tech D
6.8
11.3
18.7
15.5
19.2
16.2
14.9
12.6
14.0 Draw diagrams and make calculations to compare the results of the two technicians. Write a short paragraph explaining which of the two technicians is generating the most consistent results. Explain your reasoning. Hint: a dot plot might be a good starting point 4)
Below are the measurements of a sample of airgun pellets in mm to the nearest 0.1mm. 2.3
2.4
2.4
2.5
2.5
2.5
2.6
2.6
2.6
2.6
2.7
2.7
2.7
2.8
2.8
2.9
3.0
3.0
3.1
3.2
3.2
3.3
3.4
3.4
3.5
3.8
4.2
4.2
4.4
6.5
a) b) c)
Draw a dot plot showing the data How would you describe the distribution of dots? Find the median and the mean. Write a sentence comparing the two values of central tendency. Calculate the standard deviation using a calculator routine. Is the 6.5 value an outlier?
d) e)
KPB 2004
http://bodmas.org/bnc/
page 14 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
The five number summary There is an alternative to the mean and standard deviation for describing distributions. The five number summary consists of the following set of numbers • The minimum value in the data set • The lower quartile (LQ) value (where 25% of the data are lower than the quartile) • The median • The upper quartile (UQ) (where 75% of the data are lower than the upper quartile) • The maximum value in the data set
1 (n + 1) th number in a list 2 of the n data items sorted into order of size. The maximum and minimum values are easily found. There are a number of slightly different ways of finding the upper and lower quartiles. The method here is quoted in Armitage, 1994, and is best illustrated by an example. The numbers below could be the widths of 25 samples of steel bar in mm measured with a micrometer... The median is found (as explained earlier) by finding the
4.26
4.70 5.13
4.78
4.85
4.86
4.89
4.96
5.06
5.08
5.09
5.12
5.20 5.68
5.22
5.30
5.30
5.31
5.34
5.37
5.40
5.41
5.53
5.13 5.18
The list is sorted in order of size, and the median value is 5.13mm. Use the formula 1 3 ql = (n + 1) to find the position of the lower quartile, and qu = (n + 1) to find the 4 4 position in the sorted list of the upper quartile. In this case, n = 25 so the positions are 1 3 ql = (25 + 1) = 0.25 ! 26 = 6.5 and qu = (25 + 1) = 19.5 . These figures suggest that 4 4 we pick the mean of the 6th and 7th figures for the LQ and the mean of the 19th and 20th figures for the UQ. My five number summary for this data is (all in mm)... Min LQ Median UQ Max
4.26 4.94 (rounded up) 5.13 5.33 (rounded up) 5.68
1 (10 + 1) = 0.25 ! 11 = 2.75 4 which we round up to be the 3rd data item in the list. The UQ rounds down to 8th. If you had 10 data items, then the LQ has position ql =
KPB 2004
http://bodmas.org/bnc/
page 15 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn The airgun data from Homework Week 4 question 4 is reproduced below (all measurements in mm)... 2.3 2.4 2.4 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7
2.7
2.8
2.8
2.9
3.0
3.0
3.1
3.4
3.4
3.5
3.8
4.2
4.2
4.4
6.5
3.2
3.2
2.7 3.3
Find the 5 number summary for this data.... Min LQ Median UQ Max
Box and whisker plot Another kind of diagram can be drawn from the 5 number summary. The Box and Whisker Plot is very useful for comparing two or more small samples and trying to find similarities or differences in their distributions. You can include the Box and Whisker plot on the same axes as a dot plot. The data set for the steel bars from the previous page is shown below on a scale that is a little wider each end than the maximum and minimum data points... MIN
4.0
LQ
MED
UQ
5.0
MAX
6.0 Diameter in mm
The central box tells you the range of the central 50% of the distribution.
KPB 2004
http://bodmas.org/bnc/
page 16 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
The Range and the Semi-Inter-quartile Range In statistics the range of a set of data is simply the difference between the largest and smallest item in the data set... Range = Max ! Min
The range is handy for scaling graphs, but as a measure of spread or dispersion, the range suffers from some problems... • The range depends on only two values • Sensitive to extreme outliers • The range takes no account of the distribution of values between the maximum and minimum The Semi-interquartile range (SIQR) is just half the difference between the upper and lower quartiles for a set of data...
SIQR =
1 (UQ + LQ) 2
The advantages of the SIQR as a measure of dispersion include • Depends on the quartiles so less influenced by extreme values • Takes account of data distribution through position of the quartiles • Can provide measure of spread when standard deviation may be problematic (e.g. highly skewed distributions)
Your turn The table below contains the five number summaries of two samples of the length of butterfly wings in mm. The sample sizes are also shown in the table as N.
Min LQ Median UQ Max N
Sample A 10.5 12.2 12.8 13.1 13.4 36
Sample B 11.3 12.7 13.4 13.8 14.3 40
Draw box and whisker plots representing these two samples on the same axes. Discussion: What are the differences between the two samples?
KPB 2004
http://bodmas.org/bnc/
page 17 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
The Normal Distribution The distribution of heights of a sample of 1000 adults resident in England might look a bit like this; Height distribution of 1000 people 250 225 200 175 150 125 100 75 50 25 0 132.5
137.5
142.5
147.5
152.5
157.5
162.5
167.5
172.5
177.5
182.5
187.5
192.5
197.5
Height / cm
As you can see, the distribution is • • •
Peaked in the middle of the height range Roughly symmetrical about the modal class Tends to have a ‘bell curved’ shape
Biological variables (height, weight, blood pressure, and so on) tend to have a distribution shape like this as do genuine random errors in measurement. This distribution shape became so common, it was referred to as the normal distribution. Other distribution shapes have become more common as statistics developed as a useful tool in industry and science - but the name has stuck.
Measures of central tendency for Normal distribution For a sample of data that is drawn from a population that has a normal distribution for the variable of interest, the three averages will give the same value. If the median and mean are appreciably different for a given sample, then this may indicate that the distribution is skewed one way or another, and the histogram of the data will look ‘pushed over’. If the median is less than the mean, you have positive skew and if the median is higher than the mean, you have negative skew.
KPB 2004
http://bodmas.org/bnc/
page 18 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Percentage of data items under the Normal curve Normal
curve
0.45 0.4 0.35 0.3
±1! contains 68% of data
0.25 0.2
±2! contains 95.5% of data
0.15 0.1
±3! contains 99.7% of data
0.05 0 -3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
SDs either side of mean
The theoretical normal curve above shows you that... • 68% of data items should lie within ± 1 Standard Deviation from the mean • 95.5% of data items should lie within ± 2 Standard Deviations from the mean • 99.7% of data items should lie within ± 3 Standard deviations from the mean This means that if (say) you had a batch of steel washers with mean thickness 0.35mm and a standard deviation 0.05mm, then you would expect that 99.7% of the washers would have a thickness in the range 0.35 ± 0.15mm, or lie between 0.20 mm and 0.50mm. If you did find a washer that was 0.2mm thick, then this washer had a chance of 1.5 in 1000 (half of 0.3%) of being produced.
Your turn Suppose a batch of airgun pellets has a mean diameter of 0.177 inches with a standard deviation of 0.008 inches. What range of diameters could 99.7% of the pellets be expected to lie within? Does this range include a pellet of 0.200 inches? Discussion point: Would you send someone to prison if there was a 3 out of 1000 chance that they might really be innocent?
KPB 2004
http://bodmas.org/bnc/
page 19 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Standard error of the mean (SEM) Suppose you took a large series of samples (say the heights of 10 adult males resident in England). The means of the samples would show variation - the mean of the means would be a very close approximation to the mean height of the population, and the standard deviation of the sample means would tell you something about the sampling error. You can get an idea of the sample error from a single sample by calculating something called the standard error of the mean (SEM). The standard error of the mean is calculated by dividing the sample standard deviation by the square root of the size of the sample…
SEM =
S N
Your turn Complete the following table Sample size
Mean /cm
10
175
Standard Deviation /cm 2.5
100
175
2.5
1000
175
2.5
10000
175
2.5
Standard error of the mean /cm
Interpreting the SEM As you can see, as the sample size increases, the SEM decreases. This shows you that a larger sample has a smaller chance of having a mean that is different from the mean of the population than a small sample. • •
The SEM does NOT say anything about an individual value for a given data item Samples must always be randomly selected from the population and must be free from bias
KPB 2004
http://bodmas.org/bnc/
page 20 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Central limit theorem “The distribution of an average tends to be Normal, even when the distribution from which the average is computed is decidedly non-Normal.” - Charles Annis As a concrete example, consider the RND function on your calculator or the =RAND() function in MS Excel. This function returns a random number with 3 decimal places between 0.000 and 0.999, and each possible number is equally likely to be returned. The distribution is referred to as ‘uniform’, and a histogram would consist of bars of approximately equal height for a large enough sample. I built a spreadsheet to pick 1000 samples of 100 random numbers each, and the distribution of the means of the samples is shown below… Distribution of sample means from a population with uniform distribution 180 170 160 150 140
Frequency (1000 samples)
130 120 110 100 90 80 70 60 50 40 30 20 10 0 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 UCB
As you can see, the distribution looks to be Normal in shape. • • • •
A typical mean for the means of 1000 samples was 0.500 and the standard deviation of the means was 0.029. A typical mean for a single sample is 0.478 and a sample standard deviation is 0.3 As you can see, the standard deviation of the sample means is approximately the same as the standard deviation of a single sample divided by the square root of the sample size The standard error of the mean calculated from a single sample is a guide to the standard deviation of the distribution of the means.
KPB 2004
http://bodmas.org/bnc/
page 21 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Confidence intervals Because the Standard Error of the Mean is essentially an estimate of the standard deviation of the sample means, we can use the SEM to estimate the chance that the population mean falls within a certain range. Recall that the area under the Normal curve within limits set by the standard deviation is as follows... SD range either side of mean
± 1 SD ± 2 SD ± 3 SD
Percentage of area under the Normal curve within that range 68% 95.5% 99.7%
Turning these facts round and applying the concept to the standard error of the mean results in the following rule for calculating a confidence interval of the mean for a sample... “There is a 95.5% probability that the population mean falls within ± 2 standard errors of the sample mean.” We can modify this rule slightly to give us a formula for the 95% confidence interval of the mean as follows:-
95% CI = x ± 1.96 !
S N
In words, the 95% confidence interval of the mean is given by the mean plus and minus 1.96 multiplied by the standard error of the mean. The factor 1.96 is used instead of 2 so that the probability is 95% (1 in 20) rather than 95.5%. PTO for example calculation...
KPB 2004
http://bodmas.org/bnc/
page 22 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Example calculation Suppose you have the following information for the weight of a sample of sturgeon roe caught in an area of the Caspian Sea - weights were recorded in grammes... N
30
Mean
1.56 g
Sample Standard Deviation
0.2g
The 95% confidence interval of the mean for this sample is given by... S N 0.2 = 1.56 ± 1.96 ! 30 = 1.56 ± 0.072
95% CI = x ± 1.96 !
= 1.45 to 1.63 g
And so we can say that there is a 1 in 20 chance that the sample of sturgeon roe is actually an extremely unrepresentative one and that the population mean lies outside the range 1.45 to 1.63g. Notice that if you took repeated samples perfectly in a random way and without any kind of bias, you would expect, 5% of the time, get a sample with a mean lying outside this range. Try the MS Excel spreadsheet simulation at this point - repeated pressing of the F9 function key will draw a new sample. Try pressing F9 50 times. Note each time the sample mean falls outside the range. The 5% probability splits symmetrically so there is a 2.5% probability that the sample mean is below 1.45g and a 2.5% probability that the sample mean is above 1.63g
KPB 2004
http://bodmas.org/bnc/
page 23 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn Here is an extended example designed to take you from raw data to the 95% confidence intervals for two sets of data. We will use this data in a later section so please do complete the calculation - ask your tutor if you need help. A calculator with a standard deviation routine or access to MS Excel will sure help! The data The weight in KG of sacks of flour was checked with an accurate weighing machine. Random samples were obtained for the check that was carried out in two mills. Samples of 20 sacks were taken in each case. Mill A
9.00
9.05
9.05
9.10
9.10
9.10
9.15
9.15
9.20
9.20
9.25
9.25
9.30
9.40
9.50
9.60
9.90
9.95
9.10
9.30
9.35
9.35
9.40
9.40
9.40
9.45
9.45
9.45
9.50
9.50
9.50
9.55
9.55
9.60
9.70
9.70
9.20 10.00 Mill B 9.45 9.90 The initial data analysis a)
Draw dot plots for each of the samples from the two mills and describe the appearance of the data - use the same axes for both dot plots and include a key
b)
Calculate the 5 number summary for each of the data sets and organise your results into a table
c)
Draw box and whisker plots for each sample and comment on the appearance of the diagrams
A more analytical approach d) Calculate the mean and the sample standard deviation for each of the samples e)
Calculate the 95% confidence interval of the sample mean for each of the samples
f)
Plot the 95% confidence intervals for each sample on your dot plot. Do the intervals overlap?
g)
Discussion: Is Mill A producing sacks that have a different weight from Mill B? Try to make a statement based on probabilities.
KPB 2004
http://bodmas.org/bnc/
page 24 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
The idea of a significance test • • • •
In statistics, it is normal to lay out the precise criteria that you will use to decide a question before designing an experiment, choosing samples and so on. This pre-defined way of working avoids any unconscious bias creeping in A standard set of steps where you are clear what you are trying to decide is called a significance test The table below shows the main steps involved in a significance test You can find much more detail by looking up the terms involved in a statistics text book or on the Web
Stage in general process State the null hypothesis
Specific example in flour sack example “There is no difference in weight between the flour sacks from Mill A and Mill B” “There is a difference between the weight of flour sacks produced in Mill A and in Mill B”
State an alternate hypothesis
Decide the significance level to adopt
Calculate the statistics as dictated by the type of data and number of variables: Different situations call for different statistics
Accept or reject the null hypothesis according to the statistical criteria adopted
This is a two tailed test as we are not asking if sacks from Mill A weigh more than sacks from Mill B or vice versa 5% corresponds to a 1 in 20 chance of a 'significant' result occurring when in fact there is no difference. Look up 'type I' and 'type II' errors in a textbook. Calculate the mean, standard deviation, for the 'new' and the 'old' coins. Calculate the 95% confidence interval of each sample of the means See if the Confidence Intervals overlap If the confidence intervals overlap then you have to accept the null hypothesis of no difference. If the confidence intervals do not overlap then you have to reject the null hypothesis and accept the alternate hypothesis.
This format is the one we shall adopt in scenario 6.
KPB 2004
http://bodmas.org/bnc/
page 25 of 26
Unit 6: Maths and Science for Technicians – statistics part 1
Your turn: dry run • • • •
Below are two sets of data on the girth of Betula pendula as sampled from two parts of a wood You suspect that the sample taken from the North segment may represent a different population from the sample taken from the South Perform a significance test to verify your experimental hypothesis Use a 5% significance level
Data North section girth sample /cm 386 486 627 553 108 240 397 280 16 51 588 138 194 348 173 424 25 761 217 305
South section girth sample /cm 796 818 382 76 470 689 611 719 211 509 73 291 521 854 810 964 364 833 688 988
Note: the experimental hypothesis is what you are trying to find out. You then frame a more specific (usually negative) null hypothesis as part of the significance test.
KPB 2004
http://bodmas.org/bnc/
page 26 of 26