Reminder of 1st lecture

About This Presentation

Title:

Reminder of 1st lecture

Description:

Reminder of 1st lecture. Data types. Summarising qualitative data: counts, ... Dichotomous (Binary): This variable has only 2 ... Reminder of 2nd lecture ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 69

Provided by: moll2

Category:

more less

Transcript and Presenter's Notes

Title: Reminder of 1st lecture

1
Reminder of 1st lecture

Data types
Summarising qualitative data counts,
Summarising quantitative data
Location mean, median
Spread standard deviation, interquartile
range
Graphical presentation of data

2
Qualitative data
Dichotomous (Binary) This variable has only 2
possible categories (mutually exclusive) Nominal
This variable has more than 2 categories,
mutually exclusive and unordered Ordinal This
variable has more than 2 categories, mutually
exclusive and ordered
3
Quantitative data
Continuous This is used for something measured
on a scale. The variable can take any value
within a range of values e.g. Height in cm,
Weight in kg. Discrete This variable often
represents counts (integer values) e.g. Age to
nearest year, number of children
4
Measures of central tendency

Mean - arithmetic average (? mu)
add up all observations and divide by the number
of observations
Median - central value of the distribution
rank observations and the median is the
observation below which 50 of all values fall

5
Histogram

Assessment of normality

6
Appropriate measures of location and spread
7
Reminder of 2nd lecture

The probability for an event or outcome indicates
how likely it is to happen.
The probability of an event has to be between 0
and 1.
A probability of 0 means that the event never
happens.
A probability of 1 means that an event always
happens.
Events that cannot happen together are mutually
exclusive and their probabilities can be added
together. P(A or B) P(A) P(B)
Events that are independent do not affect each
other and their probabilities can be multiplied
together. P(A and B) P(A) x P(B)

8
Properties of Normal curve
P0.68 P0.95 P0.999
9
Chi-Squared

Another important distribution related to the
normal is the Chi-squared distribution.
Its use is used when investigating categorical
data.

10
Crossstab Example

If eye and hair colour are not associated then
for example, the Expected number with blue eyes
and blond hair would be

So the chi-squared is found by looking a function
of the discrepancy between observed and expected
counts in each cell
summed over all combinations of hair and eye
colour.
If this is large and in the tail of the
distribution, then it may be said that the
observed is not as expected!
More of this later.

12
Summary

Probabilities are integral to all things around
us.
We can derive and understand probabilities.
We have seen that probabilities build together to
form probability distributions.
Some are theoretical distributions that are well
understood, the most important being the Normal.
Using these theoretical distributions we can
begin to make inferences about the population on
the basis of samples.

13
Sampling, sampling distributions and statistical
inference

Gordon Prescott

14
Wednesday, 11 October 2006Scots bar staff health
'improved'

The health of Scotland's bar staff has improved
dramatically since the introduction of a smoking
ban, a medical study has found.
Researchers at Dundee University found
significant health improvements in the first two
months after the March ban.
The results have led to calls for the UK
Government to speed the introduction of a similar
ban south of the border.
But smokers' rights group Forest said the link
between passive smoking and ill health had not
been proven.
The team from the university's asthma and allergy
research group began testing bar workers in and
around Dundee in February, a month before the ban
came into force.
Using a series of indicators, they established
symptoms attributable to passive smoking,
measuring lung function and inflammation in the
bloodstream.
This study provides compelling evidence that
making workplaces smoke free can have a
significant and speedy impact on people's
health Peter Hollins, British Heart
Foundation

15
Study of bar workers in Dundee before and after
the smoking ban

JAMA. 20062961742-1748 To investigate the
association of smoke-free legislation with
symptoms, pulmonary function, and markers of
inflammation of bar workers (n77)
At 1 month The percentage of bar workers with
respiratory and sensory symptoms decreased from
79.2 (n 61) before the smoke-free policy to
53.2 (n 41)
(total change 26 95 confidence interval CI,
13.8 to 38.1 Plt0.001)
Forced expiratory volume in the first second
increased from 96.6 predicted to 104.8
(change 8.2 95 CI, 3.9 to 12.4 Plt0.001)
Serum cotinine levels decreased from 5.15 to 3.22
ng/mL
(change 1.93 ng/mL 95 CI, 2.83 to 1.03
ng/mL Plt0.001)

16
Study of bar workers in Dundee before and after
the smoking ban

JAMA. 20062961742-1748
At 2 months Total white blood cell reduced from
7610 to 6980 cells/µL
(630 cells/µL 95 CI, 1010 to 260 cells/µL
P 0.002)
Neutrophil count was reduced from 4440 to 4030
cells/µL
(410 cells/µL 95 CI, 740 to 90 cells/µL
P 0.03)
Smoke-free legislation was associated with
significant early improvements in symptoms,
spirometry measurements, and systemic
inflammation of bar workers

17
Sampling theory

A population is the totality of observations
obtainable from all subjects possessing some
common specified characteristic
male diabetics
height of all females in the UK
A sample is a set of observations which
constitutes part of a population

18
Random and biased samples

Random sampling is a sampling technique where
each member in the population is chosen entirely
by chance, with a known chance of being included
in the sample. Representative of the population.
The most common random sample is obtained when
any one individual or measurement in the
population is as likely to be selected as any
other.
Can also have a biased sample where some
individual or measurements have a greater chance
of being included than others.
With a non-random sampling technique not all
members have a known chance of being selected, or
some members have a zero chance of being
selected. Unrepresentative of the population.

19
Selection of sample

Probability sampling
Each item/person in the population has a
calculable non-zero chance of being selected for
the sample.
Sampling error (degree to which the sample
differs from the population) can be calculated.
Convenience sampling
The items/people are selected by the researcher
from the population in a non-random manner.
The sampling error is unknown and cannot be
calculated.

Random sampling
Simple Random sampling,
Systematic Random sampling,
Stratified Random sampling,
Cluster Random sampling,
Multi-Stage sampling.
Biased sampling
Quota sampling

21
Probability sampling

Simple random sample
Each item in the population has an equal chance
of being selected for the sample

22
Random number table

84 42 56 53 87 75
78 87 77 03 57 09
85 86 48 86 12 39
65 37 93 76 46 11
09 49 41 73 76 49
64 06 71 99 37 06
46 69 31 24 33 52
67 85 07 75 56 96

23
Systematic random sampling

Choose one element at random from the population
of size N
For a sample of size n, choose every N/n element
thereafter

Advantages - It is simpler and can be more
representative than a simple random sample
Disadvantages - possibility of implicit
clustering, not a simple random sample

24
Example systematic random sampling

A survey of GPs in Scotland to assess counselling
services provided to patients is to be conducted.
1 in 3 GPs are to be included in the sample.
List of GPs available is ordered by practice
within Health Board.
Systematic sample will ensure representative
spread across health boards and practices.

25
Stratified random sampling

The strata are subgroups of the population which
are chosen to minimize differences between
members of the same strata and maximize the
differences between members of different strata.

Main advantages
Increases the representativeness of the sample
Increases the precision of the resulting
estimates
Allows comparison between strata

26
Example Stratified random sampling

It is likely that size of practice, site of
practice (rural/urban) may influence whether the
practice employs a counsellor
The list of GPs could be stratified into number
of partners (lt 4, gt 4) in the practice and area
(rural/urban)
A simple random sample from each of the four
lists would constitute the sample
The sample would ensure that each strata (or
combination of strata) are represented in the
sample

27
Cluster random sampling

The clusters are subgroups of the population
which are chosen to maximize differences between
members of the same cluster and therefore
minimize the differences between members of
different clusters

Advantages - Cheaper and faster than a simple
random sample
Disadvantages - Less representative than a simple
random sample and there is a danger of
contamination between respondents

28
Example Cluster random sampling

For a nutritional survey to be carried out, 20
schools in Scotland will be randomly selected.
All secondary year 4 children from those schools
selected will be interviewed.

29
Multi-stage sampling

Different sampling units are sampled at different
stages
Example
Geographical areas of the UK would randomly be
selected, from which hospitals would be randomly
selected from which wards/patients would then be
randomly selected.

30
Convenience sampling

Quota sampling
In this type of sampling an individual is chosen
by an interviewer.
To avoid undue bias the quota is sub-divided into
various categories e.g. male/female, old/young
and so on.
The interviewer is given quotas for each category
and uses discretion to select the interviewees.

31
Statistical Inference
POPULATION
SAMPLE
INFERENCE
32

Which sampling approach will lead to unbiased
estimates?

33
Statistical Inference
POPULATION
SAMPLE
INFERENCE
34
Population parameters and sample statistics

A population parameter is a measurable
characteristic of the population
Values obtained from a sample are estimates of
the population parameters

35
Estimation I

A parameter is a numerical descriptive measure of
a population. It is calculated from the
observations in the population
mean - m
standard deviation - s
A sample statistic is a numerical descriptive
measure of a sample. It is calculated from the
observations in the sample
Sample mean -
Sample standard deviation - s

36
Estimation II

Population parameters are fixed so long as the
population itself does not change
Sample statistics will vary from sample to
sample, even though samples may be random and the
population does not change

37
Sampling distribution

In theory we could select all possible random
samples from a population and gain an estimate of
the population parameter from each of the
individual samples.
If a histogram of the sample estimates for each
individual sample was plotted, this would form
the sampling distribution of the population
parameter (probability distribution).

38
Sampling Distribution
39
Sampling distribution

The sampling distribution of a sample statistic,
calculated from a sample of n measurements, is
the probability distribution of the statistic

40
Example

Imagine that a random sample of 100 individuals
is to be selected from a population
Their height in cm is measured
The mean height is computed
Another random sample of 100 individuals from the
same population is taken
Their height in cm in measured
Their mean height is computed
This is repeated until 20 random samples have
been taken

41
20 samples of size 100

The first sample of heights of 100 people gives a
mean of 172.03 cm and a standard deviation (SD)
of 6.03 cm.
The second sample gives mean 173.50 cm SD 6.74
cm.
These figures represent the mean height (cm) for
each of the 20 random samples
172.03 173.50 171.89 171.95 170.59
172.63 172.72 171.99 172.50 171.71
172.55 172.86 171.58 172.83 172.55
171.28 172.62 171.41 171.38 172.26

42
Histogram of means of 20 samples
43
Histogram of means of 100 samples
44
Sampling error

Each random sample may have a different estimate
of the population parameter due to sampling
variation
Knowledge of the sampling distribution allows us
to assess how close the estimate obtained from
one individual sample is to the true population
parameter. This is known as precision.

45
Precision

The larger the size of the sample the greater the
reduction in sampling error
Taking a larger sample will result in reducing
the sampling variation from the true
population value that we are trying to estimate.
This implies that our estimate would be more
precise.

46
Standard Error

The standard deviation of the sampling
distribution of the mean is known as the standard
error of the mean.
The standard error provides a measure of how far
from the true population value the estimate is
likely to be (the precision)

47
Standard deviation/standard error

The standard deviation, s, is a measure of the
variability of individuals in a sample
The standard error is a measure of the
uncertainty in the sample statistic (e.g. mean,
proportion)

48
What does the standard error indicate?

Consider again the random sample of 100
individuals is to be selected from a population
Their height in cm is measured
The mean height is computed
Another random sample of 100 individuals from the
same population is taken
Their height in cm in measured
Their mean height is computed
This is repeated until 20 random samples have
been taken

49
20 samples of size 100

These figures represent the mean height (cm) for
each of the 20 random samples
172.03 173.50 171.89 171.95 170.59
172.63 172.72 171.99 172.50 171.71
172.55 172.86 171.58 172.83 172.55
171.28 172.62 171.41 171.38 172.26
Mean of the 20 samples 172.14 cm
SD of 20 sample means 0.689 ?SE (mean)

50
Explanation of sampling error

5 students in population
Ages are 22, 25, 28, 30, 35
Sample of 3 students randomly selected to
estimate age in population (5 students)

51
22, 25, 28, 30, 35 ?28

Sample 1 22,30,35 mean 29
Sample 2 25,28,35 mean 29.3
Sample 3 22,28,30 mean 27.7
Sample 4 25,30,35 mean 30
Sample 5 28,30,35 mean 31

NB Variation in age 22 to 35 Variation in mean
age 27.7 to 31
52
Relationship between sample size and precision of
sample estimate

Heights
Sample size mean SD Standard error
100 172.03 6.03 0.60
200 172.77 6.42 0.45
500 171.99 6.85 0.31
1000 172.15 6.84 0.22

53
Properties of the sampling distribution of the
mean

The mean of the sampling distribution mean of
the population distribution
Standard deviation of the sampling distribution
? / ?n standard error of the mean
The sampling distribution of the mean is
approximately Normal for large sample sizes

54
Central Limit Theorem

If a random sample of n observations is selected
from a population, then when n is sufficiently
large, the sampling distribution of (the
mean) will be approximately a Normal
distribution.
The larger the sample size, the better the
approximation to the Normal distribution will be.
A sample size of at least 30 will usually be
enough.

55
Statistical Inference
Representativeness Size
POPULATION
SAMPLE
INFERENCE
56
Statistical Inference

Concerned with how we draw conclusions from
sample data, about the larger population from
which the sample is selected.
There are two types of inference
Confidence Intervals (Estimation)
Hypothesis Testing (Significance Testing)

57
Confidence intervals

When we collect data on a sample of individuals
we would not expect the results from our sample
to be exactly the same as those we would get if
we had data on the whole population.
Using the variability in the sample data we can
calculate a range of values in which the
population value is likely to lie.
We can vary the width of this range depending on
how confident we want to be that we will have
included the true population value (usually set
at 95 confidence).

58
Calculation of confidence intervals

Confidence intervals can be calculated for most
sample estimates using the following notation
Sample estimate ? critical value x standard
error(sample estimate)

59
Calculation of confidence interval for a
population mean

Sample estimate
sample mean
Standard error of mean
sample standard deviation/?n
For large samples, 5 critical value 1.96
Point estimate 172.03 cm SE 0.603 cm
95 CI (170.85 to 173.21) cm

60
Small samples

For a small sample (nlt30) we use the
t-distribution with (n-1) degrees of freedom
95 CI for small samples
When n is large the t-distribution approximates
to the Normal distribution
For n 5, 10, 30, 60, 120,
t n-1(5) 2.57, 2.23, 2.04, 2.00, 1.98

61
Interpretation of CIs

Consider an infinite number of random samples of
size n from a population
A mean and 95 CI could be computed for each of
the samples
For 95 of the samples from the population, the
95 CI will include the population value, whilst
in 5 of samples the 95 CI will not include the
population value.

62
Interpretation of confidence intervals

Sample 1

Sample 2
Sample 3
Sample 4
Sample 5
Sample n
True population value (e.g. ?, ?)
63
Example

Suppose I want to know the average height of UK
people. I have a random sample of 100 people with
a mean of 172.03 and SD of 6.03.
The standard error of the mean is 6.03/?1000.603
The estimate of the population mean is 172.03

The 95 confidence interval (CI) for the
population mean is 172.03 - 1.96 x 0.603, 172.03
1.96 x 0.603
95 CI is 170.85, 173.21
I am 95 confident that the average height of UK
people is between 170.85 and 173.21

A different 100 people would have a slightly
different sample mean and confidence interval
95 of random samples of 100 would give a 95
confidence interval containing the true
population mean

64
100 samples of 100 people from a population of
10000 with known mean 172 cm and SD 6.8 cm
65
Change in sample size

If the sample size is larger, the standard error
is smaller and therefore the CI is also narrower
For my first sample of 100 the standard error of
the mean was 0.603 cm and 95 CI was 170.85,
173.21 cm

When I took a sample of 200 the mean was 172.77
cm and the SD was 6.42 cm.
The standard error was 6.42/?2000.45 (smaller)
95 CI was 172.77 - 1.96x0.45, 172.77 1.96x0.45
171.88, 173.65 cm

For a sample of 1000 the 95 CI was 171.73,
172.58 cm

66
Confidence intervals

Raising the confidence level from 95 to 99
increases the assurance that the confidence
interval contains the population mean, but it
makes the estimate less precise i.e. the width of
the CI is wider
Multiplier changes from 1.96 to 2.58.
For height with 100 in the sample
95 CI (170.85, 173.21) cm
99 CI (170.47, 173.58) cm

67
Sample size

It is possible to determine what sample size
should be taken, if we wish to achieve a given
level of precision
This is because precision can be increased by
reducing the size of the standard error
The size of the standard error is based on the
size of the sample

68
Practicals

There are no new computer practicals this week.
Instead there is the opportunity to complete the
data description practical from last week and
to ask questions in the Friday time slot.
The worked example for the data description
practical will be on the web early next week.

Write a Comment

User Comments (0)