Title: Classification, analysis
1Classification, analysis interpretation of
quantitative data
2There Are Three Kinds Of Lies Lies, Damned
Lies, And Statistics
(quoted by B Disraeliattributed to Mark Twain)
3Session Overview
- To discuss ways in which quantitative data can be
classified analysed - Identify some important considerations in the use
statistics in your dissertation - Demonstrate SPSS
4Research Process
Research process
5Data analysis depends on..
- The number of variables being examined
- The level of measurement
- whether for descriptive or inferential purposes?
6 7Number of variables
- Single variable (Univariate)
- describe individual characteristics separately
eg age, sex, income level - 2 variables (Bivariate)
- sex income level (do males earn more than
females?) - 3 or more variables (Multivariate)
- sex, income and education (are differences in
income level due to sex and/or education?)
8Methods of analysis
9Descriptive Inferential Statistics
- Descriptive Statistics
- Methods used to summarise or describe
observations eg frequency distributions, average,
range, standard deviation - Inferential Statistics
- use predictions from a sample(s) to make
generalisations about a population(s) - statistical tests
10Frequency Count
- In order to see patterns in data it is useful to
classify the data. This might involve dividing
the range of measured values into groups (ten is
a reasonable number) and then placing the
subjects into a group. The number of subjects in
a category is the frequency count for that
category.
11A 50-54 B 55-59 C 60-64 D 65-69 E 70-74 F
75-79 G 80-84 H 85-89 I 90-94 J 95-99
Take your pulse. Identify which letter
corresponds to your pulse rate
12(No Transcript)
13- Here is an example of what we would expect to see
from a reasonably homogenous and representative
sample. - Group Frequency
- A 1
- B 2
- C 4
- D 6
- E 9
- F 15
- G 9
- H 6
- I 4
- J 2
14Frequency Distribution Chart
- This refers to a graph which shows the frequency
(number of subjects) in each category on a ranked
scale. - The categories are on the horizontal axis and the
frequency count is on the vertical axis
15Descriptive Measures of central tendency
- Mean (Average)
- -sum of the observations divided by sample size
- Mode
- -the value with the largest frequency
- - only for large samples in frequency form
- Median
- - middle value when values are placed in order
- - if n is an even number, take the average of the
two values nearest to the middle
16Example five pulses in ascending order 60 72
72 75 81 Mean ? 72 (360
divided by 5) Mode 72 Most common
number Median 72 Middle value of data set
17Although the mean is the most commonly used
descriptive statistic, certainly by popular
media, there are situations when it is
inappropriate. Scenario 1 The media reports
that the average income in a village is 100,000.
This might sound like a wealthy place with a high
number of well paid jobs, until you discover that
one villager is a manager of an international oil
company and has an income of 1million. In this
case the data is skewed and it would be more
useful to know the median income.
18(No Transcript)
19Scenario 2 An Olympic Athletes conference
accidentally mingles with the Kenco Coffee Club
AGM on a coffee break. Someone measures
everybodys pulses and when the pulses are
plotted, the following graph is plotted.
20(No Transcript)
21Descriptive statistics variation
- Range
- difference between the largest and smallest
values in your sample - Standard deviation
- standard deviation of the observations from the
mean
22Quantitative Analysis
Interpreting numerical data Part 2 Inferential
statistics
23Research Process
Inferential statistics
24Hypothesis
- Supposition about the data
- Is there a difference between data sets A B?
- Null hypothesis - there is no difference between
data sets A B - ACCEPT ? REJECT
- Significance Testing
25Significance testing
- to determine whether an observed difference (or
association) between 2 or more sets of data is
real or could have arisen by chance - statistical tests have been derived to apply to
different types of data, the end result is a
significance level or probability (p) value.
26Dont have disease
Have disease
Trial group
0
100
0
Control group
100
27Dont have disease
Have disease
Trial group
100
0
0
Control group
100
28Dont have disease
Have disease
Trial group
75
25
55
Control group
45
29Probability
Probability number of desired outcomes number
of possible outcomes
Examples Chance of getting head on coin is
1/2 Probability of obtaining number three on a
dice is 1/6
30Coincidence
If the probability of each of two separate events
is known, the coincidence of both events
happening together is then calculated by
multiplying the two probabilities
together. Example, probability of obtaining both
a head on coin and a three on a dice in any one
go is 1/2 X 1/6 1/12 Below are all the
possible outcomes H1 T1 H2 T2 H3 T3 H4 T
4 H5 T5 H6 T6
31Comparing two sets of data
People with brown hair
People with blue eyes
The area shaded is the overlap, commonly known as
the intersect. It represents a region in which
data from a subject is common to both groups.
In this case it represents people with both
blue eyes and brown hair.
32Using Probability to compare two sets of data
Statistical tests calculate a P value. This
indicates the probability that data from group A
could also have come from group B. If P is less
than 0.05 we say there is a significant
difference between A and B.
P0.8
P0.05
P0
33Probability
- By convention, a significance level of 5
(p0.05) is considered to be acceptable - 5 risk that the null hypothesis is true
- put the other way round, there is a 95
probability that any observed difference is not
the result of chance.
34Significance level
35Parametric v Non-parametric
- Parametric tests
- used on normally distributed data
- used on interval/ratio data not nominal or
ordinal - samples have equal variances
- Non-parametric tests
- nominal (Chi squared)
- ordinal or interval/ratio data
36Which tests?
37Correlation
So far we have considered how to make
comparisons of one variable between two sets of
data taken from from either one group of subjects
or two groups of subjects. Sometimes we need to
compare the relationship between two different
variables. This is known as correlation.
38Scatter Plots
A scatter plot can illustrate how one variable
changes relative to another variable. Here
height and weight are shown together.
39Calculating Correlation
Data may fall within a range or band of values as
shown in the plot below, rather than on a perfect
straight line. We need to be able to quantify
the strength of the relationship between the two
variables in situations like this. We need to
know how close to a straight line our data falls.
40Correlation Coefficient
Formulae exist which can calculate this.
Pearsons or Spearmans formulae calculate a
value called rho, which can be between -1 and 1.
It is given the symbol r. The higher the value
of r, the stronger the correlation
41The Correlation Coefficient
The correlation coefficient can be between -1 and
1 The sign ( or -) tells us the direction of
the gradient. The value tells us how close to a
straight line the data falls.
r 1
r -1
r 0
42Some more examples
r -0.9
r0.3
r -0.6
43Causal and Non-causal Relationships
It is quite common for two variables to show a
strong correlation. But does this mean that a
change in one causes a change in the other ? The
answer is that correlation does not mean that
there is necessarily a cause and effect. A
strong correlation may be the result of two very
different variables which may have changed for
two very different reasons over a period of time
or under certain circumstances.
44Example of Non-Causal Correlation
Variable 1 Coronary heart disease exhibits a
winter peak and summer trough in incidence and
mortality, in countries both north and south of
the equator. In England and Wales, the winter
peak accounts for an additional 20,000 deaths per
annum. It is likely that this reflects seasonal
variations in risk factors. Seasonal variations
have been demonstrated in a number of lifestyle
risk factors such a physical activity and diet.
However, a number of studies have also suggested
a direct effect of environmental temperature on
physiological and rheological factors. Pell JP.
Cobbe SM. Seasonal variations in coronary heart
disease QJM. 92(12)689-96, 1999 Dec
45Variable 2 Ice cream sales vary depending upon
time of year. More ice creams are sold when the
weather is warmer, less when it is colder.
46If the numerical data from the ice cream company
is plotted against the numerical data for heart
attacks, the result seems to suggest that an
increase in ice cream consumption could cause a
decrease in heart attacks. Of course this is
wrong, however there is an associative
relationship. Environmental temperature has an
effect on both variables
47Non-Causal Correlation
A strong positive correlation has been
demonstrated between change in number of storks
and human population in European Cities. But
does this prove that babies are delivered by
storks ?
Adapted from Mould, R. (1989) Introductory
medical statistics Bristol Adam Hilger p170
48Which test?
- type of data
- paired or unpaired?
- normally distributed?
- associations or differences?
- Plus..
- Use flow charts
- ask supervisor or statistician
49Questions
- What stats measures have been used?
- Are they appropriate and how do you know?
- Are the stats presented appropriately and in a
way that enables you to understand? - Are appropriate conclusions drawn
50Finally...
- statistical significance may not mean clinically
significant. - significance testing relies on probabilities,no
definitive answer (risk that may wrongly accept
or reject the Null hypothesis) - Make sure that conclusions you draw from your
results are supported by statistical evidence - dont make more of the data than is there
- small sample size so less power may not get any
statistically significant findings