Introduction to statistics and data - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to statistics and data

Description:

Title: Disordered Eating, Menstrual Irregularity, and Bone Mineral Density in Young Female Runners Author: John Last modified by: Kristin Created Date – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 129
Provided by: John492
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to statistics and data


1
Introduction to statistics and data
2
Looking at numbers
  • Group exercise Whats the math problem in each
    of the four examples Ive given you?

3
EXAMPLE 1.
Table 2. Outcome volume for the experimental and
standard groups mean (SD).
Location Week 0 Week 0 Week 0 Week 12 Week 12 Change (Week 0 Week 12) Change (Week 0 Week 12)
Location experimental standard experimental experimental standard experimental standard
Affected side 3135 (748) 3333 (1368) 2982 (715) 2982 (715) 3331 (1383) 154 (168) 2 (306)
Contralateral side 2595 (672) 2654 (761) 2553 (606) 2553 (606) 2631 (736) 42 (193) 23 (219)

plt .05 greater than the contralateral side
4
EXAMPLE 2.
Objective The study objective is to determine
the efficacy of a new treatment cream as a
therapeutic option for eczema. Methods
Prospective study under institutional review
board approval of ten patients with eczema, who
were all treated with the experimental cream.
Three blinded independent investigators evaluated
overall improvement, as well as changes in
scaliness and redness, graded on a quartile (0-3)
scale 0none, 1mild (1-33), 2moderate
(34-66), 3excellent (67-100). Results All
patients showed overall improvement as measured
by blinded investigators. Of patients showing
overall improvement, 78 were graded as having
either excellent or moderate improvement.
Ninety-six percent of subjects demonstrated
improvements in scaliness and redness.
Limitations Small sample size
5
EXAMPLE 3.
Table 1   -- Baseline characteristics by height
and follow-up for incident cancer in the Million
Women Study     Height in cm All
women     lt155 155 160 165 170 175   Mean
measured height (SD) 1528 (41) 1565
(23) 1604 (29) 1649 (29) 1690 (29) 1738
(43) 1609 (64)   Characteristics at
recruitment   Number of women 233?516 196?773 388
?515 288?893 143?289 46?138 1?297?124   Mean
age, years (SD) 563 (49) 562 (49) 562
(49) 560 (48) 560 (48) 558 (48) 561
(49)   Socioeconomic status, n () in lowest
quintile 59?220 (26) 42?862 (22) 73?119
(19) 48?190 (17) 23?262 (16) 7?664
(17) 197   Current smokers, n () 50?775
(23) 40?500 (22) 72?763 (20) 51?678
(19) 26?147 (19) 8?369 (19) 205   Alcohol
intake, n () 7 units per week 47?138
(20) 43?324 (22) 92?126 (24) 73?597
(26) 36?742 (26) 11?734 (26) 237   Body-mass
index, n () BMI 30 54?550 (25) 38?493
(20) 65?622 (18) 42?004 (15) 18?370
(13) 5?320 (12) 180   Strenuous exercise, n
() once a week or more 76?917 (35) 69?607
(37) 147?103 (39) 116?614 (42) 58?339
(42) 18?699 (42) 390   Age at menarche, n ()
14 years 79?858 (35) 69?718 (36) 139?607
(37) 108?550 (38) 57?852 (41) 20?176
(45) 374   Parity, n () nulliparous 22?827
(10) 19?149 (10) 40?296 (10) 33?267
(12) 17?985 (13) 6?900 (15) 108   Number of
full-term pregnancies, n () with three or
more 82?436 (35) 67?118 (34) 127?826
(33) 91?287 (32) 44?074 (31) 13?335
(29) 329   Age at first birth, n () 25
years 67?250 (33) 61?042 (35) 129?031
(38) 103?017 (41) 52?677 (43) 17?492
(46) 382   Postmenopausal, n () 162?551
(81) 136?544 (81) 269?384 (81) 197?618
(80) 97?855 (80) 30?900 (79) 805   Ever use
of oral contraceptives, n () 133?979
(58) 114?105 (59) 228?669 (60) 173?520
(61) 85?522 (60) 27?571 (60) 595   Current
use of HRT, n () 75?151 (33) 63?865
(33) 128?891 (34) 98?086 (34) 48?516
(34) 15?637 (34) 336 Follow-up for cancer
incidence   Woman-years, millions 21 18 35 26
13 04 117   Number of incident
cancers 15?792 14?213 28?806 22?571 11?902 4?092 9
7?376     The categories of height are those
reported at recruitment, and mean values are
those measured in a randomly selected
sample.     Standardised to the distribution
of categories of self-reported height in our
whole analysis population.
6
EXAMPLE 4.
Original data
Data re-use
7
Clinical Data Example
  • 1. Kline et al. (2002)
  • The researchers analyzed data from 934 emergency
    room patients with suspected pulmonary embolism
    (PE). Only about 1 in 5 actually had PE. The
    researchers wanted to know what clinical factors
    predicted PE.
  • I will use four variables from their dataset
    today
  • Pulmonary embolism (yes/no)
  • Age (years)
  • Shock index heart rate/systolic BP
  • Shock index categories take shock index and
    divide it into 10 groups (lowest to highest shock
    index)

8
Descriptive Statistics
9
Types of Variables Overview
Categorical
Quantitative
continuous
discrete
ordinal
nominal
binary
2 categories more categories
order matters numerical
uninterrupted
10
Categorical Variables
  • Also known as qualitative.
  • Categories.
  • treatment groups
  • exposure groups
  • disease status

11
Categorical Variables
  • Dichotomous (binary) two levels
  • Dead/alive
  • Treatment/placebo
  • Disease/no disease
  • Exposed/Unexposed
  • Heads/Tails
  • Pulmonary Embolism (yes/no)
  • Male/female

12
Categorical Variables
  • Nominal variables Named categories Order
    doesnt matter!
  • The blood type of a patient (O, A, B, AB)
  • Marital status
  • Occupation

13
Categorical Variables
  • Ordinal variable Ordered categories. Order
    matters!
  • Staging in breast cancer as I, II, III, or IV
  • Birth order1st, 2nd, 3rd, etc.
  • Letter grades (A, B, C, D, F)
  • Ratings on a scale from 1-5
  • Ratings on always usually many times once in
    a while almost never never
  • Age in categories (10-20, 20-30, etc.)
  • Shock index categories (Kline et al.)

14
Quantitative Variables
  • Numerical variables may be arithmetically
    manipulated.
  • Counts
  • Time
  • Age
  • Height

15
Quantitative Variables
  • Discrete Numbers a limited set of distinct
    values, such as whole numbers.
  • Number of new AIDS cases in CA in a year (counts)
  • Years of school completed
  • The number of children in the family (cannot have
    a half a child!)
  • The number of deaths in a defined time period
    (cannot have a partial death!)
  • Roll of a die

16
Quantitative Variables
  • Continuous Variables - Can take on any number
    within a defined range.
  • Time-to-event (survival time)
  • Age
  • Blood pressure
  • Serum insulin
  • Speed of a car
  • Income
  • Shock index (Kline et al.)

17
Review Question 1
  • Which of the following variables would be
    considered a continuous variable?
  • Favorite fruit
  • Gender
  • Decade of birth
  • Age at first birth
  • Parity

18
Review Question 2
  • Which of the following variables would be
    considered a nominal (categorical) variable?
  • Favorite fruit
  • Gender
  • Decade of birth
  • Age at first birth
  • Parity

19
Looking at Data
  • ü How are the data distributed?
  • Where is the center?
  • What is the range?
  • Whats the shape of the distribution (e.g.,
    Gaussian, binomial, exponential, skewed)?
  • ü Are there outliers?
  • ü Are there data points that dont make sense?

20
The first rule of statistics USE COMMON
SENSE!90 of the information is contained in
the graph.
21
Frequency Plots (univariate)
  • Categorical variables
  • Bar Chart
  • Continuous variables
  • Box Plot
  • Histogram

22
Bar Chart
  • Used for categorical variables to show frequency
    or proportion in each category.
  • Translate the data from frequency tables into a
    pictorial representation

23
Bar Chart categorical variables
no
yes
24
Note how much easier it is to extract information
from a bar chart than from a table!
25
Box plot and histograms
  • To show the distribution (shape, center, range,
    variation) of continuous variables.

26
Shape of a Distribution
  • Describes how data are distributed
  • Measures of shape
  • Symmetric or skewed

Right-Skewed
Left-Skewed
Symmetric

Mean Median
Mean lt Median
Median lt Mean
27
(No Transcript)
28
Bins of size 0.1 (automatically generated)
Note the right skew
29
100 bins (too much detail)
30
2 bins (too little detail)
31
Also shows the right skew
32
Distribution Shape and Box-and-Whisker Plot
Right-Skewed
Left-Skewed
Symmetric
Q1
Q2
Q3
Q1
Q2
Q3
Q1
Q2
Q3
33
Box Plot Age
100.0
More symmetric
66.7
Years
33.3
0.0
AGE
Variables
34
Histogram Age
Not skewed, but not bell-shaped either
35
Some histograms from your class (n25)
Starting with politics
36
(No Transcript)
37
(No Transcript)
38
Health Care Law
39
Feelings about math and writing
40
Optimism
41
Diet
42
Habits
43
Homework and optimism? (bivariate)
44
Review Question 3
  • Which of the following graphics should be used
    for categorical variables?
  • Histogram
  • Box plot
  • Bar Chart
  • Stem-and-leaf plot

45
Review Question 4
  • What is the first thing you should do when you
    get new data?
  • Run a ttest
  • Calculate a p-value
  • Plot your data
  • Run multivariate regression

46
Review Question 5
  • Approximately what percent of subjects had
    pulses between 80 and 90?
  • 200
  • 100
  • 90
  • 50
  • 10

47
Review Question 6
  • What is the maximum pulse that any subject had?
  • 100
  • lt100
  • gt100
  • gt100

48
Review Question 7
  • This distribution of the variable (pulse) would
    be described as?
  • Symmetric
  • Right-skewed
  • Left-skewed

49
Measures of central tendency
  • Mean
  • Median
  • Mode

50
Central Tendency
  • Mean the average the balancing point
  • calculation the sum of values divided by the
    sample size

In math shorthand
51
Mean example
  • Some data
  • Age of participants 17 19 21 22 23 23
    23 38

52
Mean of age in Klines data
Descriptive Statistics Report Page/Date/Time 1
3/30/2006 102514 AM Database C\Program
Files\NCSS97\Data\Dawson\kline.S0 Means Section
of AGE Geometric Harmonic Parameter Mean Median
Mean Mean Sum Mode Value 50.19334 49 46.66865 43.
00606 46730 49 556.9546
53
Mean of age in Klines data
54
Mean of Pulmonary Embolism? (Binary variable?)
80.56 (750)
19.44 (181)
55
Mean
  • The mean is affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Mean 3
Mean 4
56
Central Tendency
  • Median the exact middle value
  • Calculation
  • If there are an odd number of observations, find
    the middle value
  • If there are an even number of observations, find
    the middle two values and average them.

57
Median example
  • Some data
  • Age of participants 17 19 21 22 23 23
    23 38

Median (2223)/2 22.5
58
Median of age in Klines data
Means Section of AGE Geometric Harmonic Paramet
er Mean Median Mean Mean Sum Mode Value 50.19334 4
9 46.66865 43.00606 46730 49
59
Median of age in Klines data
60
Does PE have a median?
  • Yes, if you line up the 0s and 1s, the middle
    number is 0.

61
Median
  • The median is not affected by extreme values
    (outliers).

0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median 3
Median 3
62
Central Tendency
  • Mode the value that occurs most frequently

63
Mode example
  • Some data
  • Age of participants 17 19 21 22 23 23
    23 38

Mode 23 (occurs 3 times)
64
Mode of age in Klines data
Means Section of AGE Geometric Harmonic Paramet
er Mean Median Mean Mean Sum Mode Value 50.19334 4
9 46.66865 43.00606 46730 49
65
Mode of PE?
  • 0 appears more than 1, so 0 is the mode.

66
Mode
  • Not affected by extreme values
  • Used for either numerical or categorical data
  • There may may be no mode
  • There may be several modes

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14
No Mode
Mode 9
67
Which measure of central tendency is best?
  • Mean is generally used, unless extreme values
    (outliers) exist
  • Then median is often used, since the median is
    not sensitive to extreme values.
  • Example Median home prices may be reported for a
    region less sensitive to outliers

68
Measures of Variation/Dispersion
  • Range
  • Percentiles/quartiles
  • Interquartile range
  • Standard deviation/Variance

69
Range
  • Difference between the largest and the smallest
    observations.

70
Range of age 94 years-15 years 79 years
14.0
9.3
Percent
4.7
0.0
0.0
33.3
66.7
100.0
AGE (Years)
71
Range of PE?
  • 1-0 1

72
Quartiles
25
25
25
25
Q1
Q2
Q3
  • The first quartile, Q1, is the value for which
    25 of the observations are smaller and 75 are
    larger
  • Q2 is the same as the median (50 are smaller,
    50 are larger)
  • Only 25 of the observations are greater than the
    third quartile

73
Interquartile Range
  • Interquartile range 3rd quartile 1st quartile
    Q3 Q1

74
Interquartile Range age
Median (Q2)
Q1
Q3
maximum
minimum
25 25 25
25
15 35 49
65 94
Interquartile range 65 35 30
75
Sample Variance
  • Average (roughly) of squared deviations of values
    from the mean

76
Why squared deviations?
  • Adding deviations will yield a sum of 0.
  • Absolute values are tricky!
  • Squares eliminate the negatives.
  • Result
  • Increasing contribution to the variance as you go
    farther from the mean.

77
Standard Deviation
  • Most commonly used measure of variation
  • Shows variation about the mean
  • Has the same units as the original data

78
Calculation ExampleSample Standard Deviation
Age data (n8) 17 19 21 22 23 23 23
38
79
Std. dev is a measure of the average scatter
around the mean.
14.0
Estimation method if the distribution is bell
shaped, the range is around 6 SD, so here rough
guess for SD is 79/6 13
9.3
Percent
4.7
0.0
0.0
33.3
66.7
100.0
AGE (Years)
80
Std. Deviation age
  • Variation Section of AGE
  • Standard
  • Parameter Variance Deviation
  • Value 333.1884 18.25345

81
Std Dev of Shock Index
250.0
187.5
Estimation method if the distribution is bell
shaped, the range is around 6 SD, so here rough
guess for SD is 1.4/6 .23
Count
125.0
62.5
0.0
0.0
0.5
1.0
1.5
2.0
SI
82
Std. Deviation SI
  • Variation Section of SI
  • Standard Std Error Interquartile
  • Parameter Variance Deviation of
    Mean Range Range
  • Value 4.155749E-02 0.2038566 6.681129E-03 0.24604
    32 1.430856

83
Std. Dev of binary variable, PE
Std. dev is a measure of the average scatter
around the mean.
80.56
19.44
84
Std. Deviation PE
  • Variation Section of PE
  • Standard
  • Parameter Variance Deviation
  • Value 0.156786 0.3959621

85
Comparing Standard Deviations
Data A
Mean 15.5 S 3.338
11 12 13 14 15 16 17 18
19 20 21
Data B
Mean 15.5 S 0.926
11 12 13 14 15 16 17 18
19 20 21
Data C
Mean 15.5 S 4.570
11 12 13 14 15 16 17 18
19 20 21
  • SSlide from Statistics for Managers Using
    Microsoft Excel 4th Edition, 2004 Prentice-Hall

86
Bienaymé-Chebyshev Rule
  • Regardless of how the data are distributed, a
    certain percentage of values must fall within K
    standard deviations from the mean

87
Symbol Clarification
  • S Sample standard deviation (example of a
    sample statistic)
  • ? Standard deviation of the entire population
    (example of a population parameter) or from a
    theoretical probability distribution
  • X Sample mean
  • µ Population or theoretical mean

88
The beauty of the normal curve
No matter what ? and ? are, the area between ?-?
and ?? is about 68 the area between ?-2? and
?2? is about 95 and the area between ?-3? and
?3? is about 99.7. Almost all values fall
within 3 standard deviations.
89
68-95-99.7 Rule
90
Summary of Symbols
  • S2 Sample variance
  • S Sample standard dev
  • ?2 Population (true or theoretical) variance
  • ? Population standard dev.
  • X Sample mean
  • µ Population mean
  • IQR interquartile range (middle 50)

91
Review Question 8
  • All of the following are measures of data
    variation EXCEPT
  • Variance
  • Interquartile range
  • Standard deviation
  • Range
  • Mean

92
Review Question 9
  • All of the following are influenced by outliers
    EXCEPT
  • Variance
  • Interquartile range
  • Standard deviation
  • Range
  • Mean

93
Review Question 10
  • If you have right-skewed data, which of the
    following will be true?
  • Mean gt median
  • Mean gt median
  • Median gt mean
  • Median gt mean
  • Mean median

94
Review Question 11
  • How much of your data is guaranteed to fall
    within 2 standard deviations of the mean?
  • Nonethere are no guarantees.
  • 95
  • 99
  • 75
  • 89

95
Examples of bad graphics
96
Whats wrong with this graph?
97
From Visual Revelations Graphical Tales of Fate
and Deception from Napoleon Bonaparte to Ross
Perot Wainer, H. 1997, p.29.
98
Correctly scaled X-axis
99
Report of the Presidential Commission on the
Space Shuttle Challenger Accident, 1986 (vol 1,
p. 145) The graph excludes the observations
where no O-rings failed.
100
Smooth curve at least shows the trend toward
failure at high and low temperatures
  • http//www.math.yorku.ca/SCS/Gallery/

101
Even better graph all the data (including
non-failures) using a logistic regression model
102
Whats wrong with this graph?
103
(No Transcript)
104
Whats the message here?
105
(No Transcript)
106
For more examples
  • http//www.math.yorku.ca/SCS/Gallery/

107
Class exercise
  • Whats wrong with these graphs?

108
From Johnson R. Just the Essentials of
Statistics. Duxbury Press, 1995.
109
From Johnson R. Just the Essentials of
Statistics. Duxbury Press, 1995.
110
Lying with statistics
  • More accurately, misleading with statistics

111
Example 1 projected statistics
  • Lifetime risk of melanoma
  • 1935 1/1500
  • 1960 1/600
  • 1985 1/150
  • 2000 1/74
  • 2006 1/60
  • http//www.melanoma.org/mrf_facts.pdf

112
Example 1 projected statistics
  • How do you think these statistics are
    calculated?
  • How do we know what the lifetime risk of a person
    born in 2006 will be?

113
Example 1 projected statistics
  • Interestingly, a clever clinical researcher
    recently went back and calculated (using SEER
    data) the actual lifetime risk (or risk up to 70
    years) of melanoma for a person born in 1935.
  • The answer?
  • Closer to 1/150 (one order of magnitude off)
  • (Martin Weinstock of Brown University, AAD
    conference 2006)

114
Example 2 propagation of statistics
  • In many papers and reviews of eating disorders in
    women athletes, authors cite the statistic that
    15 to 62 of female athletes have disordered
    eating.
  • Ive found that this statistic is attributed to
    about 50 different sources in the literature and
    cited all over the place with or without
    citations...

115
For example
  • In a recent review (Hobart and Smucker, The
    Female Athlete Triad, American Family Physician,
    2000)
  • Although the exact prevalence of the female
    athlete triad is unknown, studies have reported
    disordered eating behavior in 15 to 62 percent of
    female college athletes.
  • No citations given.

116
And
  • Fact Sheet on eating disorders
  • Among female athletes, the prevalence of eating
    disorders is reported to be between 15 and
    62.Citation given Costin, Carolyn. (1999) The
    Eating Disorder Source Book A comprehensive
    guide to the causes, treatment, and prevention of
    eating disorders. 2nd edition. Lowell House Los
    Angeles.

117
And
  • From a Fact Sheet on disordered eating from a
    college website
  • Eating disorders are significantly higher (15 to
    62 percent) in the athletic population than the
    general population.
  • No citation given.

118
And
  • Studies report between 15 and 62 of college
    women engage in problematic weight control
    behaviors (Berry Howe, 2000). (in The Sport
    Journal, 2004)
  • Citation Berry, T.R. Howe, B.L. (2000, Sept).
    Risk factors for disordered eating in female
    university athletes. Journal of Sport Behavior,
    23(3), 207-219.

119
And
  • 1999 NY Times article
  • But informal surveys suggest that 15 percent to
    62 percent of female athletes are affected by
    disordered behavior that ranges from a
    preoccupation with losing weight to anorexia or
    bulimia.

120
And
  • It has been estimated that the prevalence of
    disordered eating in female athletes ranges from
    15 to 62. ( in Journal of General Internal
    Medicine 15 (8), 577-590.)Citations
  • Steen SN. The competitive athlete. In Rickert
    VI, ed. Adolescent Nutrition Assessment and
    Management. New York, NY Chapman and Hall
    1996223 47.
  • Tofler IR, Stryer BK, Micheli LJ. Physical and
    emotional problems of elite female gymnasts. N
    Engl J Med. 1996335281 3.

121
Where did the statistics come from?
The 15 Dummer GM, Rosen LW, Heusner WW, Roberts
PJ, and Counsilman JE. Pathogenic weight-control
behaviors of young competitive swimmers.
Physician Sportsmed 1987 15 75-84. The to
Rosen LW, McKeag DB, OHough D, Curley VC.
Pathogenic weight-control behaviors in female
athletes. Physician Sportsmed. 1986 14
79-86. The 62Rosen LW, Hough DO. Pathogenic
weight-control behaviors of female college
gymnasts. Physician Sportsmed 1988 16140-146.
122
Where did the statistics come from?
  • Study design? Control group?
  • Cross-sectional survey (all)
  • No non-athlete control groups
  • Population/sample size?
  • Convenience samples
  • Rosen et al. 1986 182 varsity athletes from two
    midwestern universities (basketball, field
    hockey, golf, running, swimming, gymnastics,
    volleyball, etc.)
  • Dummer et al. 1987 486 9-18 year old swimmers at
    a swim camp
  • Rosen et al. 1988 42 college gymnasts from 5
    teams at an athletic conference

123
Where did the statistics come from?
  • Measurement?
  • Instrument Michigan State University Weight
    Control Survey
  • Disordered eating at least one pathogenic
    weight control behavior
  • Self-induced vomiting
  • fasting
  • Laxatives
  • Diet pills
  • Diuretics
  • In the 1986 survey, they required use 1/month in
    the 1988 survey, they required use twice-weekly
  • In the 1988 survey, they added fluid restriction

124
Where did the statistics come from?
  • Findings?
  • Rosen et al. 1986 32 used at least one
    pathogenic weight-control behavior (ranges 8
    of 13 basketball players to 73.7 of 19 gymnasts)
  • Dummer et al. 1987 15.4 of swimmers used at
    least one of these behaviors
  • Rosen et al. 1988 62 of gymnasts used at least
    one of these behaviors

125
Citation Tree
Figure 4A from Smith N P et al. J Exp Biol
20072101576-1583.
126
Figure 4B from Smith N P et al. J Exp Biol
20072101576-1583.
127
Homework
  • Problem Set 1
  • Reading Chapters 1-6 Vickers.
  • Read weekly journal article
  • Fill out a Journal Article Review Sheet (on
    class website).
  • Who wants to lead journal article discussion next
    week?

128
References
  • http//www.math.yorku.ca/SCS/Gallery/
  • Kline et al. Annals of Emergency Medicine 2002
    39 144-152.
  • Statistics for Managers Using Microsoft Excel
    4th Edition, 2004 Prentice-Hall
  • Tappin, L. (1994). "Analyzing data relating to
    the Challenger disaster". Mathematics Teacher,
    87, 423-426
  • Tufte. The Visual Display of Quantitative
    Information. Graphics Press, Cheshire,
    Connecticut, 1983.
  • Visual Revelations Graphical Tales of Fate and
    Deception from Napoleon Bonaparte to Ross Perot
    Wainer, H. 1997.
  • Johnson R. Just the Essentials of Statistics.
    Duxbury Press, 1995.
Write a Comment
User Comments (0)
About PowerShow.com