Title: Introduction%20to%20Biostatistics:%20Data%20Collection.%20Descriptive%20Statistics
1Introduction to Research Methods In the Internet
Era
Introduction to Biostatistics
Data Collection Descriptive Statistics
Thomas Songer, PhD with acknowledgment to several
slides provided by M Rahbar and Moataza Mahmoud
Abdel Wahab
2Key Lecture Concepts
- Distinguish between different strategies for
obtaining a sample from a population - Distinguishing between different forms of data
collection - Identify key approaches to organize and portray
your data - Understand the measures of central tendency and
variability in your data
2
3Descriptive Inferential Statistics
Descriptive Statistics deal with the
enumeration, organization and graphical
representation of data from a sample Inferential
Statistics deal with reaching conclusions from
incomplete information, that is, generalizing
from the specific sample Inferential
statistics use available information in a sample
to draw inferences about the population from
which the sample was selected
Rahbar
4Epidemiology is
- The study of disease and its treatment, control,
and prevention in a population of individuals. - Whole populations may be examined, but
- More frequently, samples of the population may be
examined. Samples that are studied must be
representative of the population for the results
to be generalized to the total population.
Torrence 1997
4
5Hypothetical Population
Representative? Y N
Sample 1
Representative? Y N
Sample 2
Representative? Y N
Sample 3
5
6Sampling Approaches
- Convenience Sampling select the most accessible
and available subjects in target population.
Inexpensive, less time consuming, but sample is
nearly always non-representative of target
population. - Random Sampling (Simple) select subjects at
random from the target population. Need to
identify all in target population first.
Provides representative sample frequently.
6
7Sampling Approaches
- Systematic Sampling Identify all in target
population, and select every xth person as a
subject. - Stratified Sampling Identify important
sub-groups in your target population. Sample
from these groups randomly or by convenience.
Ensures that important sub-groups are included in
sample. May not be representative. - More complex sampling
7
8Sampling Error
- The discrepancy between the true population
parameter and the sample statistic - Sampling error likely exists in most studies, but
can be reduced by using larger sample sizes - Sampling error approximates 1 / vn
- Note that larger sample sizes also require time
and expense to obtain, and that large sample
sizes do not eliminate sampling error
8
9Research Process
Research question
Hypothesis
Identify research design
Data collection
Presentation of data
Data analysis
Interpretation of data
9
Polgar, Thomas
10Types of Data Collection
- Surveys/Questionnaires
- Self-report
- Interviewer-administered
- proxy
- Direct medical examination
- Direct measurement (e.g. blood draws)
- Administrative records
10
11Understanding and Presenting Data
11
12Types of Data
- Categorical (e.g., Sex, Marital Status, income
category) - Continuous (e.g., Age, income, weight, height,
time to achieve an outcome) - Discrete (e.g.,Number of Children in a family)
- Binary or Dichotomous (e.g., response to all Yes
or No type of questions)
12
13Brain Size and IQ
What types of data do these variables represent?
Gender FSIQ VIQ PIQ Weight Height MRI Count
Female 133 132 124 118 64.5 816932
Male 140 150 124 124 72.5 1001121
Male 139 123 150 143 73.3 1038437
Male 133 129 128 172 68.8 965353
Female 137 132 134 147 65 951545
Female 99 90 110 146 69 928799
Female 138 136 131 138 64.5 991305
Female 92 90 98 175 66 854258
Male 89 93 84 134 66.3 904858
Male 133 114 147 172 68.8 955466
Female 132 129 124 118 64.5 833868
13
14Scale of Data
1. Nominal These data do not represent an
amount or quantity (e.g., Marital Status, Sex)
2. Ordinal These data represent an ordered
series of relationship (e.g., level of
education) 3. Interval These data is measured
on an interval scale having equal units but an
arbitrary zero point. (e.g. Temperature in
Fahrenheit) 4. Interval Ratio Variable such
as weight for which we can compare meaningfully
one weight versus another (say, 100 Kg is twice
50 Kg)
14
15Organizing Data and Presentation
- Frequency Table
- Frequency Histogram
- Relative Frequency Histogram
- Frequency polygon
- Relative Frequency polygon
- Bar chart
- Pie chart
- Box plot
15
16Frequency Table
- Generally, the first approach to examining your
data. - Identifies distribution of variables overall
- Identifies potential outliers
- Investigate outliers as possible data entry
errors - Investigate a sample of others for data entry
errors
16
17Frequency Table
A research study has been conducted examining
the number of children in the families living in
a community. The following data has been
collected based on a random sample of n 30
families from the community. 2, 2, 5, 3, 0, 1,
3, 2, 3, 4, 1, 3, 4, 5, 7, 3, 2, 4, 1, 0, 5, 8,
6, 5, 4 , 2, 4, 4, 7, 6 Organize this data in a
Frequency Table!
17
18XNo. of Children Count (Frequency) Relative Freq.
0 2 2/300.067
1 3 3/300.100
2 5 5/300.167
3 5 5/300.167
4 6 6/300.200
5 4 4/300.133
6 2 2/300.067
7 2 2/300.067
8 1 1/300.033
18
19Frequency Table
Now, construct a similar frequency table for the
age of patients with Heart related problems in a
clinic. The following data has been collected
based on a random sample of n 30 patients who
went to the emergency room of the clinic for
Heart related problems. The measurements are
42, 38, 51, 53, 40, 68, 62, 36, 32, 45, 51, 67,
53, 59, 47, 63, 52, 64, 61, 43, 56, 58, 66, 54,
56, 52, 40, 55, 72, 69.
19
20Age Groups Frequency Relative Frequency
32 -36 yr 2 2/300.067
37- 41 yr 3 3/300.100
42-46 yr 4 4/300.134
47-51 yr 3 3/300.100
52-56 yr 8 8/300.267
57-61 yr 3 3/300.100
62-66 yr 4 4/300.134
67-72 yr 3 3/300.100
Total n30
20
21Frequency Polygon
- Use to identify the distribution of your data
21
22Table 1 in a paper
Describe your study population in a frequency
table
Table Title
Name of variable (Units of variable) Frequency (n) Mean (SD)
- - Categories -
Total
22
23Measures of Central Tendency
Where is the heart of distribution? 1. Mean
2. Median 3. Mode
23
24Sample Mean
The arithmetic mean (or, simply, mean) is
computed by summing all the observations in the
sample and dividing the sum by the number of
observations. For a sample of five household
incomes, 6000, 10,000, 10,000, 14000, 50,000 the
sample mean is,
24
25Median
In a list ranked from smallest measurement to the
highest, the median is the middle value In our
example of five household incomes, first we rank
the measurements  6,000 10,000 10,000
14,000 50,000 Sample Median is 10,000
25
26Mode
- In nominal data
- The value which occurs with the greatest frequency
26
27Measures of non-central locations
- Quartiles
- Quintiles
- Percentiles
27
28Measures of Dispersion or Variability
- Range (present highest and lowest value in a
distribution. The difference between these
values is the range) - Variance
- Standard deviation (the square root of the
variance)
28
29Sample Variance
S standard deviation (square root of
variance)
29
30Calculation of Variance and Standard deviation
30
31Mean and Standard deviation (SD)
7 8 7 7 7 6
7 7 7 7 7 7
3 2 7 8 13 9
Mean 7 SD0.63
Mean 7 SD0
Mean 7 SD4.04
31
32Empirical Rule
- For a Normal distribution approximately,
-
- a) 68 of the measurements fall within one
standard deviation around the mean - b) 95 of the measurements fall within two
standard deviations around the mean - c) 99.7 of the measurements fall within three
standard deviations around the mean
32
33Suppose the reaction time of a particular drug
has a Normal distribution with a mean of 10
minutes and a standard deviation of 2 minutes
- Approximately,
- a) 68 of the subjects taking the drug will have
reaction time between 8 and 12 minutes - b) 95 of the subjects taking the drug will have
reaction tome between 6 and 14 minutes - c) 99.7 of the subjects taking the drug will
have reaction tome between 4 and 16 minutes
33