Title: Biostat 200 Introduction to Biostatistics
1Biostat 200Introduction to Biostatistics
2Lecture 1
3Course instructors
- Judy Hahn, M.A., Ph.D.
- Judy.hahn_at_ucsf.edu
- (415) 206-4435
- TAs
- Michelle Odden, Ph.D., M.S.
- Megumi Okumura, M.D.
- Maya Vijayaraghavan, M.D.
- Robin Wallace. M.D.
4The details
- Lectures Tuesdays 1030-1230
- Labs Thursday 1030-12
- Lab 1 Room CB 6702
- Lab 2 Room CB 6704
- Office hrs Thursday 12-1 Room CB 5715
- Course credits 3
5The details
- Readings
- Required readings will be from Principles of
Biostatistics by M. Pagano and K. Gauvreau.
Duxbury. 2nd edition. - Please read the assigned chapters before lecture,
and review them after lecture
6The details
- Assignments will be posted on Thursdays with due
dates Sunday at 5 p.m. 1.5 weeks later - Data collection (Assignment 1 only)
- Data analysis and interpretation
- Exercises in the book
- Reading and interpretation of scientific
publications - You must attend Lab 1 to receive assignment 1
7The details
- Grading
- Homework (75)
- 5 Assignments
- Varying in length each homework problem is worth
(usually 10) points toward final homework score - Final exam (25)
- LATE ASSIGNMENTS WILL NOT BE ACCEPTED!!!
8Assigments
- Send to your TAs
- Lab 1 Megan Okumura, Robin Wallace
- ticr.biostat200.1_at_gmail.com
- Lab 2 Michelle Odden, Maya Vijayaraghavan
- ticr.biostat200.2_at_gmail.com
9What I do and why
10Course goals
- Familiarity with basic biostatistics terms and
nomenclature - Ability to summarize data and do basic
statistical analyses using STATA - Ability to understand basis statistical analyses
in published journals - Understanding of key concepts including
statistical hypothesis testing critical
quantitative thinking - Foundation for more advance analyses
11Todays topics
- Variables- numerical versus categorical
- Tables (frequencies)
- Graphs (histograms, box plots, scatter plots,
line graphs) - Required reading Pagano Chapter 2
12Types of data
- Data are made up of a set of variables
- Categorical variables any variable that is not
numerical (values have no numerical meaning)
(e.g. gender, race, drug, disease status) - Nominal variables
- Ordinal variables
Pagano and Gauvreau, Chapter 2
13Types of data
- Categorical variables
- Nominal variables
- The data are unordered (e.g. RACE 1Caucasian,
2Asian American, 3African American) - A subset of these variables are Binary or
dichotomous variables have only two categories
(e.g. GENDER 1male, 2female) - Ordinal variables
- The data are ordered (e.g. AGE 110-19 years,
220-29 years, 330-39 years likelihood of
participating in a vaccine trial)
Pagano and Gauvreau, Chapter 2
14Types of data
- Numerical (quantitative) variables naturally
measured as numbers for which meaningful
arithmetic operations make sense (e.g. height,
weight, age, salary, viral load, CD4 cell counts) - Discrete variables can be counted (e.g. number
of children in household 0, 1, 2, 3, etc.) - Continuous variables can take any value within a
given range (e.g. weight 2974.5 g, 3012.6 g)
Pagano and Gauvreau, Chapter 2
15Types of data
- Manipulation of variables
- Continuous variables can be discretized
- E.g., age can be rounded to whole numbers
- Continuous or discrete variables can be
categorized - E.g., age categories
- Categorical variables can be re-categorized
- E.g., lumping from 5 categories down to 2
Pagano and Gauvreau, Chapter 2
16Frequency tables
- Categorical variables are summarized by
- Frequency counts how many are in each category
- Relative frequency or percent (a number from 0 to
100) - Or proportion (a number from 0 to 1)
Gender of new HIV clinic patients, 2006-2007, Mbarara, Uganda. Gender of new HIV clinic patients, 2006-2007, Mbarara, Uganda.
n ()
Male 415 (39)
Female 645 (61)
Total 1060 (100)
Pagano and Gauvreau, Chapter 2
17Frequency tables
- Continuous variables can categorized in
meaningful ways - Choice of cutpoints
- Even intervals
- Meaningful cutpoints related to a health outcome
or decision - Equal percentage of the data falling into each
category
Pagano and Gauvreau, Chapter 2
18Frequency tables
CD4 cell counts (mm3) of newly diagnosed HIV positives at Mulago Hospital, Kampala (N268) CD4 cell counts (mm3) of newly diagnosed HIV positives at Mulago Hospital, Kampala (N268)
n ()
50 40 (14.9)
50-200 72 (26.9)
201-350 58 (21.6)
350 98 (36.6)
Pagano and Gauvreau, Chapter 2
19Bar charts
- General graph for categorical variables
- Graphical equivalent of a frequency table
- The x-axis does not have to be numerical
Pagano and Gauvreau, Chapter 2
20Histograms
- Bar chart for numerical data The number of bins
and the bin width will make a difference in the
appearance of this plot and may affect
interpretation
histogram cd4count, fcolor(blue) lcolor(black)
width(50) name(cd4_by50) title(CD4 among new HIV
positives at Mulago) xtitle(CD4 cell count)
percent
Pagano and Gauvreau, Chapter 2
21Histograms
- This histogram has less detail but gives us the
of persons with CD4 lt350 cells/mm3
histogram cd4count, fcolor(blue) lcolor(black)
width(350) name(cd4_by350) title(CD4 among new
HIV positives at Mulago) xtitle(CD4 cell count)
percent
Pagano and Gauvreau, Chapter 2
22- What does this graph tell us?
23Box plots
- Middle linemedian (50th percentile)
- Middle box25th to 75th percentiles
(interquartile range) - Bottom whisker Data point at or above 25th
percentile 1.5IQR - Top whisker Data point at or below 75th
percentile 1.5IQR
Pagano and Gauvreau, Chapter 2
24Box plots
graph box cd4count, box(1, fcolor(blue)
lcolor(black) fintensity(inten100)) title(CD4
count among new HIV positives at Mulago)
Pagano and Gauvreau, Chapter 2
25Box plots by another variable
- We can divide up our graphs by another variable
- What type of variable is gender?
26Histograms by another variable
27Numerical variable summaries
- Mode the value (or range of values) that occurs
most frequently - Sometimes there is more than one mode, e.g. a
bi-modal distribution (both modes do not have to
be the same height) - The mode only makes sense when the values are
discrete, rounded off, or binned
Pagano and Gauvreau, Chapter 3
28Scatter plots
Pagano and Gauvreau, Chapter 2
29The importance of good graphs
http//niemann.blogs.nytimes.com/2009/09/14/good-n
ight-and-tough-luck/
30Numerical variable summaries
- Measures of central tendency where is the
center of the data? - Median the 50th percentile the middle value
- If n is odd the median is the (n1)/2
observations (e.g. if n31 then median is the
16th highest observation) - If n is even the median is the average of the
two middle observations (e.g. if n30 then the
median is the average of the 15th and16th
observation - Median CD4 cell count in previous data set 234.5
Pagano and Gauvreau, Chapter 3
31Numerical variable summaries
- Range
- Minimum to maximum or difference (e.g. age range
15-58 or range43) - CD4 cell count range (0-1368)
- Interquartile range (IQR)
- 25th and 75th percentiles (e.g. IQR for age
23-36) or difference (e.g. 13) - Less sensitive to extreme values
- CD4 cell count IQR (92-422)
Pagano and Gauvreau, Chapter 3
32Numerical variable summaries
- Measures of central tendency where is the
center of the data? - Mean arithmetic average
- Means are sensitive to very large or small values
- Mean CD4 cell count 296.9
- Mean age 32.5
Pagano and Gauvreau, Chapter 3
33Interpreting the formula
- ? is the symbol for the sum of the elements
immediately to the right of the symbol - These elements are indexed (i.e. subscripted)
with the letter i - The index letter could be any letter, though i is
commonly used) - The elements are lined up in a list, and the
first one in the list is denoted as x1 , the
second one is x2 , the third one is x3 and the
last one is xn . - n is the number of elements in the list.
Pagano and Gauvreau, Chapter 3
34Numerical variable summaries
- Sample variance
- Amount of spread around the mean, calculated in a
sample by - Sample standard deviation (SD) is the square root
of the variance - The standard deviation has the same units as the
mean - SD of CD4 cell count 255.4
- SD of Age 11.2
Pagano and Gauvreau, Chapter 3
35Numerical variable summaries
- Coefficient of variation
- For the same relative spread around a mean, the
variance will be larger for a larger mean - Can use to compare variability across
measurements that are on a different scale (e.g.
IQ and head circumference) - CV for CD4 cell count 86.0
- CV for age 34.5
Pagano and Gauvreau, Chapter 3
36Pocket/wallet change
- Histogram , boxplot
- Mode, Median, 25th percentile, 75th percentile
- Mean, SD
- Differ by gender?
37For next time
- Read Pagano and Gauvreau
- Chapters 1-3 (Review of todays material)
- Chapter 6