Title: Reminder of 1st lecture
1Reminder of 1st lecture
- Data types
- Summarising qualitative data counts,
- Summarising quantitative data
- Location mean, median
- Spread standard deviation, interquartile
range - Graphical presentation of data
2Qualitative data
Dichotomous (Binary) This variable has only 2
possible categories (mutually exclusive) Nominal
This variable has more than 2 categories,
mutually exclusive and unordered Ordinal This
variable has more than 2 categories, mutually
exclusive and ordered
3Quantitative data
Continuous This is used for something measured
on a scale. The variable can take any value
within a range of values e.g. Height in cm,
Weight in kg. Discrete This variable often
represents counts (integer values) e.g. Age to
nearest year, number of children
4Measures of central tendency
- Mean - arithmetic average (? mu)
- add up all observations and divide by the number
of observations - Median - central value of the distribution
- rank observations and the median is the
observation below which 50 of all values fall
5Histogram
6Appropriate measures of location and spread
7Reminder of 2nd lecture
- The probability for an event or outcome indicates
how likely it is to happen. - The probability of an event has to be between 0
and 1. - A probability of 0 means that the event never
happens. - A probability of 1 means that an event always
happens. - Events that cannot happen together are mutually
exclusive and their probabilities can be added
together. P(A or B) P(A) P(B) - Events that are independent do not affect each
other and their probabilities can be multiplied
together. P(A and B) P(A) x P(B)
8Properties of Normal curve
P0.68 P0.95 P0.999
9Chi-Squared
- Another important distribution related to the
normal is the Chi-squared distribution. - Its use is used when investigating categorical
data.
10Crossstab Example
- If eye and hair colour are not associated then
for example, the Expected number with blue eyes
and blond hair would be
11- So the chi-squared is found by looking a function
of the discrepancy between observed and expected
counts in each cell - summed over all combinations of hair and eye
colour. - If this is large and in the tail of the
distribution, then it may be said that the
observed is not as expected! - More of this later.
12Summary
- Probabilities are integral to all things around
us. - We can derive and understand probabilities.
- We have seen that probabilities build together to
form probability distributions. - Some are theoretical distributions that are well
understood, the most important being the Normal. - Using these theoretical distributions we can
begin to make inferences about the population on
the basis of samples.
13Sampling, sampling distributions and statistical
inference
14Wednesday, 11 October 2006Scots bar staff health
'improved'
- The health of Scotland's bar staff has improved
dramatically since the introduction of a smoking
ban, a medical study has found. - Researchers at Dundee University found
significant health improvements in the first two
months after the March ban. - The results have led to calls for the UK
Government to speed the introduction of a similar
ban south of the border. - But smokers' rights group Forest said the link
between passive smoking and ill health had not
been proven. - The team from the university's asthma and allergy
research group began testing bar workers in and
around Dundee in February, a month before the ban
came into force. - Using a series of indicators, they established
symptoms attributable to passive smoking,
measuring lung function and inflammation in the
bloodstream. - This study provides compelling evidence that
making workplaces smoke free can have a
significant and speedy impact on people's
health Peter Hollins, British Heart
Foundation
15Study of bar workers in Dundee before and after
the smoking ban
- JAMA. 20062961742-1748 To investigate the
association of smoke-free legislation with
symptoms, pulmonary function, and markers of
inflammation of bar workers (n77) - At 1 month The percentage of bar workers with
respiratory and sensory symptoms decreased from
79.2 (n 61) before the smoke-free policy to
53.2 (n 41) - (total change 26 95 confidence interval CI,
13.8 to 38.1 Plt0.001) - Forced expiratory volume in the first second
increased from 96.6 predicted to 104.8 - (change 8.2 95 CI, 3.9 to 12.4 Plt0.001)
- Serum cotinine levels decreased from 5.15 to 3.22
ng/mL - (change 1.93 ng/mL 95 CI, 2.83 to 1.03
ng/mL Plt0.001)
16Study of bar workers in Dundee before and after
the smoking ban
- JAMA. 20062961742-1748
- At 2 months Total white blood cell reduced from
7610 to 6980 cells/µL - (630 cells/µL 95 CI, 1010 to 260 cells/µL
P 0.002) - Neutrophil count was reduced from 4440 to 4030
cells/µL - (410 cells/µL 95 CI, 740 to 90 cells/µL
P 0.03) - Smoke-free legislation was associated with
significant early improvements in symptoms,
spirometry measurements, and systemic
inflammation of bar workers
17Sampling theory
- A population is the totality of observations
obtainable from all subjects possessing some
common specified characteristic - male diabetics
- height of all females in the UK
- A sample is a set of observations which
constitutes part of a population
18Random and biased samples
- Random sampling is a sampling technique where
each member in the population is chosen entirely
by chance, with a known chance of being included
in the sample. Representative of the population. - The most common random sample is obtained when
any one individual or measurement in the
population is as likely to be selected as any
other. - Can also have a biased sample where some
individual or measurements have a greater chance
of being included than others. - With a non-random sampling technique not all
members have a known chance of being selected, or
some members have a zero chance of being
selected. Unrepresentative of the population.
19Selection of sample
- Probability sampling
- Each item/person in the population has a
calculable non-zero chance of being selected for
the sample. - Sampling error (degree to which the sample
differs from the population) can be calculated. - Convenience sampling
- The items/people are selected by the researcher
from the population in a non-random manner. - The sampling error is unknown and cannot be
calculated.
20- Random sampling
- Simple Random sampling,
- Systematic Random sampling,
- Stratified Random sampling,
- Cluster Random sampling,
- Multi-Stage sampling.
- Biased sampling
- Quota sampling
21Probability sampling
- Simple random sample
- Each item in the population has an equal chance
of being selected for the sample
22Random number table
- 84 42 56 53 87 75
- 78 87 77 03 57 09
- 85 86 48 86 12 39
- 65 37 93 76 46 11
- 09 49 41 73 76 49
- 64 06 71 99 37 06
- 46 69 31 24 33 52
- 67 85 07 75 56 96
23Systematic random sampling
- Choose one element at random from the population
of size N - For a sample of size n, choose every N/n element
thereafter
- Advantages - It is simpler and can be more
representative than a simple random sample - Disadvantages - possibility of implicit
clustering, not a simple random sample
24Example systematic random sampling
- A survey of GPs in Scotland to assess counselling
services provided to patients is to be conducted. - 1 in 3 GPs are to be included in the sample.
- List of GPs available is ordered by practice
within Health Board. - Systematic sample will ensure representative
spread across health boards and practices.
25Stratified random sampling
- The strata are subgroups of the population which
are chosen to minimize differences between
members of the same strata and maximize the
differences between members of different strata.
- Main advantages
- Increases the representativeness of the sample
- Increases the precision of the resulting
estimates - Allows comparison between strata
26Example Stratified random sampling
- It is likely that size of practice, site of
practice (rural/urban) may influence whether the
practice employs a counsellor - The list of GPs could be stratified into number
of partners (lt 4, gt 4) in the practice and area
(rural/urban) - A simple random sample from each of the four
lists would constitute the sample - The sample would ensure that each strata (or
combination of strata) are represented in the
sample
27Cluster random sampling
- The clusters are subgroups of the population
which are chosen to maximize differences between
members of the same cluster and therefore
minimize the differences between members of
different clusters
- Advantages - Cheaper and faster than a simple
random sample - Disadvantages - Less representative than a simple
random sample and there is a danger of
contamination between respondents
28Example Cluster random sampling
- For a nutritional survey to be carried out, 20
schools in Scotland will be randomly selected. - All secondary year 4 children from those schools
selected will be interviewed.
29Multi-stage sampling
- Different sampling units are sampled at different
stages - Example
- Geographical areas of the UK would randomly be
selected, from which hospitals would be randomly
selected from which wards/patients would then be
randomly selected.
30Convenience sampling
- Quota sampling
- In this type of sampling an individual is chosen
by an interviewer. - To avoid undue bias the quota is sub-divided into
various categories e.g. male/female, old/young
and so on. - The interviewer is given quotas for each category
and uses discretion to select the interviewees.
31Statistical Inference
POPULATION
SAMPLE
INFERENCE
32- Which sampling approach will lead to unbiased
estimates?
33Statistical Inference
POPULATION
SAMPLE
INFERENCE
34Population parameters and sample statistics
- A population parameter is a measurable
characteristic of the population -
- Values obtained from a sample are estimates of
the population parameters -
35Estimation I
- A parameter is a numerical descriptive measure of
a population. It is calculated from the
observations in the population - mean - m
- standard deviation - s
- A sample statistic is a numerical descriptive
measure of a sample. It is calculated from the
observations in the sample - Sample mean -
- Sample standard deviation - s
36Estimation II
- Population parameters are fixed so long as the
population itself does not change - Sample statistics will vary from sample to
sample, even though samples may be random and the
population does not change
37Sampling distribution
- In theory we could select all possible random
samples from a population and gain an estimate of
the population parameter from each of the
individual samples. - If a histogram of the sample estimates for each
individual sample was plotted, this would form
the sampling distribution of the population
parameter (probability distribution).
38Sampling Distribution
39Sampling distribution
- The sampling distribution of a sample statistic,
calculated from a sample of n measurements, is
the probability distribution of the statistic
40Example
- Imagine that a random sample of 100 individuals
is to be selected from a population - Their height in cm is measured
- The mean height is computed
- Another random sample of 100 individuals from the
same population is taken - Their height in cm in measured
- Their mean height is computed
- This is repeated until 20 random samples have
been taken
4120 samples of size 100
- The first sample of heights of 100 people gives a
mean of 172.03 cm and a standard deviation (SD)
of 6.03 cm. - The second sample gives mean 173.50 cm SD 6.74
cm. - These figures represent the mean height (cm) for
each of the 20 random samples - 172.03 173.50 171.89 171.95 170.59
- 172.63 172.72 171.99 172.50 171.71
- 172.55 172.86 171.58 172.83 172.55
- 171.28 172.62 171.41 171.38 172.26
42Histogram of means of 20 samples
43Histogram of means of 100 samples
44Sampling error
- Each random sample may have a different estimate
of the population parameter due to sampling
variation - Knowledge of the sampling distribution allows us
to assess how close the estimate obtained from
one individual sample is to the true population
parameter. This is known as precision.
45Precision
- The larger the size of the sample the greater the
reduction in sampling error - Taking a larger sample will result in reducing
the sampling variation from the true
population value that we are trying to estimate. - This implies that our estimate would be more
precise.
46Standard Error
- The standard deviation of the sampling
distribution of the mean is known as the standard
error of the mean. - The standard error provides a measure of how far
from the true population value the estimate is
likely to be (the precision)
47Standard deviation/standard error
- The standard deviation, s, is a measure of the
variability of individuals in a sample - The standard error is a measure of the
uncertainty in the sample statistic (e.g. mean,
proportion)
48What does the standard error indicate?
- Consider again the random sample of 100
individuals is to be selected from a population - Their height in cm is measured
- The mean height is computed
- Another random sample of 100 individuals from the
same population is taken - Their height in cm in measured
- Their mean height is computed
- This is repeated until 20 random samples have
been taken
4920 samples of size 100
- These figures represent the mean height (cm) for
each of the 20 random samples - 172.03 173.50 171.89 171.95 170.59
- 172.63 172.72 171.99 172.50 171.71
- 172.55 172.86 171.58 172.83 172.55
- 171.28 172.62 171.41 171.38 172.26
- Mean of the 20 samples 172.14 cm
- SD of 20 sample means 0.689 ?SE (mean)
50Explanation of sampling error
- 5 students in population
- Ages are 22, 25, 28, 30, 35
- Sample of 3 students randomly selected to
estimate age in population (5 students)
5122, 25, 28, 30, 35 ?28
- Sample 1 22,30,35 mean 29
- Sample 2 25,28,35 mean 29.3
- Sample 3 22,28,30 mean 27.7
- Sample 4 25,30,35 mean 30
- Sample 5 28,30,35 mean 31
-
NB Variation in age 22 to 35 Variation in mean
age 27.7 to 31
52Relationship between sample size and precision of
sample estimate
- Heights
- Sample size mean SD Standard error
-
- 100 172.03 6.03 0.60
- 200 172.77 6.42 0.45
- 500 171.99 6.85 0.31
- 1000 172.15 6.84 0.22
-
53Properties of the sampling distribution of the
mean
- The mean of the sampling distribution mean of
the population distribution - Standard deviation of the sampling distribution
? / ?n standard error of the mean - The sampling distribution of the mean is
approximately Normal for large sample sizes
54Central Limit Theorem
- If a random sample of n observations is selected
from a population, then when n is sufficiently
large, the sampling distribution of (the
mean) will be approximately a Normal
distribution. - The larger the sample size, the better the
approximation to the Normal distribution will be. - A sample size of at least 30 will usually be
enough.
55Statistical Inference
Representativeness Size
POPULATION
SAMPLE
INFERENCE
56Statistical Inference
- Concerned with how we draw conclusions from
sample data, about the larger population from
which the sample is selected. - There are two types of inference
- Confidence Intervals (Estimation)
- Hypothesis Testing (Significance Testing)
57Confidence intervals
- When we collect data on a sample of individuals
we would not expect the results from our sample
to be exactly the same as those we would get if
we had data on the whole population. - Using the variability in the sample data we can
calculate a range of values in which the
population value is likely to lie. - We can vary the width of this range depending on
how confident we want to be that we will have
included the true population value (usually set
at 95 confidence).
58Calculation of confidence intervals
- Confidence intervals can be calculated for most
sample estimates using the following notation - Sample estimate ? critical value x standard
error(sample estimate)
59Calculation of confidence interval for a
population mean
- Sample estimate
- sample mean
- Standard error of mean
- sample standard deviation/?n
- For large samples, 5 critical value 1.96
- Point estimate 172.03 cm SE 0.603 cm
- 95 CI (170.85 to 173.21) cm
60Small samples
- For a small sample (nlt30) we use the
t-distribution with (n-1) degrees of freedom - 95 CI for small samples
- When n is large the t-distribution approximates
to the Normal distribution - For n 5, 10, 30, 60, 120,
- t n-1(5) 2.57, 2.23, 2.04, 2.00, 1.98
61Interpretation of CIs
- Consider an infinite number of random samples of
size n from a population - A mean and 95 CI could be computed for each of
the samples - For 95 of the samples from the population, the
95 CI will include the population value, whilst
in 5 of samples the 95 CI will not include the
population value.
62Interpretation of confidence intervals
Sample 2
Sample 3
Sample 4
Sample 5
Sample n
True population value (e.g. ?, ?)
63Example
- Suppose I want to know the average height of UK
people. I have a random sample of 100 people with
a mean of 172.03 and SD of 6.03. - The standard error of the mean is 6.03/?1000.603
- The estimate of the population mean is 172.03
- The 95 confidence interval (CI) for the
population mean is 172.03 - 1.96 x 0.603, 172.03
1.96 x 0.603 - 95 CI is 170.85, 173.21
- I am 95 confident that the average height of UK
people is between 170.85 and 173.21
- A different 100 people would have a slightly
different sample mean and confidence interval - 95 of random samples of 100 would give a 95
confidence interval containing the true
population mean
64100 samples of 100 people from a population of
10000 with known mean 172 cm and SD 6.8 cm
65Change in sample size
- If the sample size is larger, the standard error
is smaller and therefore the CI is also narrower - For my first sample of 100 the standard error of
the mean was 0.603 cm and 95 CI was 170.85,
173.21 cm
- When I took a sample of 200 the mean was 172.77
cm and the SD was 6.42 cm. - The standard error was 6.42/?2000.45 (smaller)
- 95 CI was 172.77 - 1.96x0.45, 172.77 1.96x0.45
- 171.88, 173.65 cm
- For a sample of 1000 the 95 CI was 171.73,
172.58 cm
66Confidence intervals
- Raising the confidence level from 95 to 99
increases the assurance that the confidence
interval contains the population mean, but it
makes the estimate less precise i.e. the width of
the CI is wider - Multiplier changes from 1.96 to 2.58.
- For height with 100 in the sample
- 95 CI (170.85, 173.21) cm
- 99 CI (170.47, 173.58) cm
67Sample size
- It is possible to determine what sample size
should be taken, if we wish to achieve a given
level of precision - This is because precision can be increased by
reducing the size of the standard error - The size of the standard error is based on the
size of the sample
68Practicals
- There are no new computer practicals this week.
- Instead there is the opportunity to complete the
data description practical from last week and
to ask questions in the Friday time slot. - The worked example for the data description
practical will be on the web early next week.