Title: Studentforelesning 1 1999
1Introduction and descriptive statistics 30th
August 2006 Tron Anders Moger
2New England Journal of Medicine, Editorial, Jan.
6, 2000, p. 42-49
- The eleven most important developments in
medicine in the past millennium - Elucidation of human anatomy and physiology
- Discovery of cells and their substructures
- Elucidation of the chemistry of life
- Application of statistics to medicine
- Development of anesthesia
- Discovery of the relation of microbes to disease
- Elucidation of inheritance and genetics
- Knowledge of the immune system
- Development of body imaging
- Discovery of antimicrobial agents
- Development of molecular pharmacotherapy
3Introduction
- A lot of knowledge appear through numbers and
quantitative data. - Problems in interpreting statistical results are
often underestimated. - Important to learn numerical literacy the
ability to understand numbers and quantitative
relationships.
4Number of births in former East Germany
5Mortality in Tanzania and Norway
6Research and numbers
- Numbers often appear in medical research.
- The numbers are often uncertain, they have
variability - They must be organized in order to interpret them
- Wish to generalize the results to the general
population
7Statistical data
- Appear from
- Numerical measurements with an instrument on a
continuous scale (Continuous data). Examples - Fever 39.6 (Unproblematic)
- IQ 116 (Problematic)
- Categorization (categorical data). Examples
- Man / woman (Unproblematic)
- Depressed / Not depressed
- (Problematic)
8Variability in the data
- Reliability Precision of data? How much will
they differ if the measurements are repeated? - Validity Do we capture what we are really
interested in? Is the measurement relevant?
9Reliability of lung function measurements6
repeated measurements on 12 students.
10Reliability of questionnaire/interview
- Alcohol use (men 31-50 years)
- Mean number of times alcohol users say that they
have felt intoxicated - 1993 (questionnaire) 14.1 times per year
- 1994 (interview) 7.3 times per year
- In 1994 they used the word drunk.
11Reliability of clinical study
- Sackett et al Clinical Epidemiology (Little,
Brown and Company, 1985). Pictures of the eye of
100 patients are studied by two clinicians to see
if there is evidence of retinopathy - Second
clinician - No Yes
- First No 46
10 - clinician Yes 12 32
- Observed agreement
- (4632)/100 78
12Sources of variation in data
- Laboratory variation
- Observer variation
- Instrument variation
- Measurement variation
- Biological variation between individuals
- Day to day variation within the same
individual/hospital
13Generalization
- Sample The units, experiments, individuals etc.
that are in the study E.g. - 15 patients with migraine
- Neurophysiological study on rats
- Population The collection of units etc. one
wishes the results to apply for - All patients with migraine
- All repetitions of the neurophysiological
experiment
14Pairs of terms
- Sample
- Histogram
- Mean
- Proportion
- Measurements of cholesterol level
- Weather
- Population
- Probability distribution
- Expectation
- Risk
- Cholesterol level in the population
- Climate
15Types of data
- Continuous data. Data measured on a continuous
scale, e.g. height, weight, age. Can be truly
continuous (with decimals) or discrete (integers) - Categorical data. Data in categories, e.g.
gender, education level, grouped age, hospital
department. Can be nominal or ordinal.
16Data in SPSS (and other statistical software)
- IMPORTANT One line in the data file always
correspond to one observation! - Common to have an id variable for each
observation - If a measurement is missing, leave the cell empty
- To create a new variable in SPSS, choose
Data-gtInsert variable in the Data View window, or
by writing the variable name in Name in the
Variable View window
17Data coding
- The value of the variable for continuous data
- For categorical data, define a suitable coding,
e.g. 0male and 1female, or 0grammar schoole,
1high school and 2college/university degree - In Variable View, the definition of the coding
can be defined in Values - In Label you can write further information about
the variable
18Descriptive statistics
- Tables
- Graphs, plots
- Measures of central tendency
- Measures of variability
19Types of graphs
- Histogram
- Box-plot
- Scatter plot
- Line plot
- Bar plot
20The age of 100 medical students
21How can you get an overview of these data in
SPSS? Explore!
- Choose Analyze - Descriptive Statistics -
Explore. Select the relevant variables by
clicking them, and transferring them to Dependent
List. Choose Plots, remove the check on Stem
and leaf and check Histogram instead. Click
Continue and OK.
22Histogram The distribution of age among the
students (n100)
23Box-plot The distribution of age among the
students
24Measures of central tendency
- Mean
-
- The students 22.2 years
- Median
- The middle observation when the observations
are arranged in increasing order - The students 22.0 years
- The mean is influenced by extreme observation.
The median is robust
25Measures of variability
- Standard deviation
- The students 3.06 years
- Coefficient of variation s/ 100
- The students 13.8
- Quartiles Arrange the data in increasing order.
The 25 quartile is at the observation where 25
of the observations have lower values, and 75 of
the observations have higher values. (In SPSS
Check Percentiles in the Statistics meny in
Explore) - The students 25 quartile 20.0 years 75
quartile 23.0 years
26How to get separate plots for each category of a
categorical variable, e.g. gender
- Click Analyze - Descriptive Statistics -
Explore. Move the continuous variable to
Dependent List. - Move gender to Factor List
- Thats it!
27Separate boxplots for each gender
28Relationship between two continuous variables
Scatter plot!
- Choose Graphs - Scatter - Define. Choose a
variable for the Y-axis and one for the X-axis - Separate markers for separate groups is achieved
by transferring the categorical variable to Set
Markers by - Can also include regression lines by choosing
Fit line at total, or a line for each category
by choosing Fit line at subgroups.
29- Scatter plot, weight versus height for the
students
30- Scatter plot, weight versus height, with
regression lines - Will talk much more about regression later
31Correlation coefficient
- A numerical measure of the relationship between
two continuous variables x and y - Range between -1 and 1
- Values close to 0 No relationship
- Values close to 1 or -1 Almost linear
relationship
32Descriptive statistics for categorical variables
- Not very useful to calculate the mean for e.g.
educational level - Would like to find the percentages within each
category in the study - Analyze-gtDescriptive Statistics -gtFrequencies
- Move the variable to Variables(s)
33Frequency table
Last column shows the cumulative distribution
always sums up to 100
34Simple bar plot
35Relationships between categorical variables
- Choose Analyze-gtDescriptive Statistics
-gtCrosstabs - Move one variable to Rows, and another to Columns
- Click Cells, and check relevant percentages
(Rows, Columns or Total)
36Crosstable Relationship between race and smoking
37Bar plot Relationship between race and smoking
38Line plot for ordinal categorical variables
(time-series plot)
39Conclusion
- Tons of different options on how to present
results - You will (hopefully) learn to understand which
option is most relevant for each problem during
this course