Title: Introduction to Biostatistics
1Introduction to Biostatistics
Prof Haroon Saloojee Division of Community
Paediatrics
2Introduction to BiostatisticsLecture 1
Summarising your data 1
3The evidence-based clinicians motto
- In God we trust.
- All others must bring data.
4Challenges
- Statistical ideas can be difficult and
intimidating - Thus
- Statistical results are often skipped-over when
reading scientific literature - Data is often misinterpreted
5Misinterpretation of Data
- Celebrating birthdays is healthy
-
Statistics show that those that celebrate the
most birthdays, live the longest.
6You may think that
- A Bar Chart is a map of the locations of the
nearest taverns - A p-value is the result of a urinalysis
- A t-test is a taste test between rooibos tea and
Five Roses tea
7Course Structure
- BIO-SADISTICS
- Four 45-minute lectures
- PowerPoint presentations on student web site
- Some text (content) also on web page
- Plus, additional internet links
8Syllabus for the Course
- ?? SESSION 1 Summarizing your data 1
- Types of data (quantitative and categorical
variables) - Describing data- average (mean, median, and mode)
- Displaying data graphically (box plots,
histograms, bar charts, pie diagrams) - Frequency distributions
- SESSION 2 Summarizing your data 2
- The normal distribution
- Describing data spread (range, variance,
standard deviation, z score) - Quartiles, percentiles
- Standard error of the mean
- Confidence intervals
- SESSION 3 Sampling principles
- Study Population
- The sample
- Random sampling
- Non random sampling
- Sampling bias
- Sample size and power
- SESSION 4 Statistical tests and the concept of
significance - Hypothesis testing
- p value
- Statistical versus clinical significance
- Parametric versus non-parametric methods
9Free textbook on-line
Statistics at Square One
http//bmj.bmjjournals.com/collections/statsbk/ind
ex.shtml
10http//www.medstatsaag.com/mcqs.asp
Relevant topics Handling data 1, 4, 5, 6,
7 Sampling 10, 11 Hypothesis testing 17, 18
11Todays Lecture
- What types of data are there?
- (numerical vs. categorical variables)
- Describing data - measures of central tendency
(mean, median and mode) - Summarising data graphically (histograms, box
plots, bar charts, pie diagrams)
12Types of data
13Types of Data
- Numerical data
- Discrete
- Examples
- No. of children
- No. asthma attacks in a week
- No. of rooms in home
14Types of Data
- Numerical data
- Continuous
- Any value on the continuum is possible (even
fractions or decimals) - Examples
- Weight
- Age
- Temperature
- Heart rate
15Types of Data
- Categorical data
- Nominal
- Mutually exclusive unordered categories
- Examples
- Sex (male, female)
- Eye colour (brown, grey, green, blue)
- Are you happy? (Yes, No)
- Diarrhoea (Present, absent)
- Can summarize in
- Tables using counts and percentages
- Bar Chart
16Types of Data
- Categorical data
- Ordinal (ordered categories)
- Examples
- Degree of agreement
- (Strongly Agree, Agree, Disagree, Strongly
disagree) - Severity of injury
- Severe, Moderate, Mild
- Income level
- High, medium, low
17PRACTICE
Discrete or Continuous ?
Nominal or Ordinal?
- mg of tar in cigarettes
- number of people in a car
- high to low temperature in
- any day
- weight
- time
- number of children in the
- average family
- Average / above avg / below average
- Colours of Smarties
- Grades (A, B, C, D, F)
Continuous
Ordinal
Discrete
Nominal
Continuous
Ordinal
Continuous
Continuous
Discrete
18(No Transcript)
19Data Summaries
- It is ALWAYS a good idea to summarise your data
- You become familiar with the data and the
characteristics of the people that you are
studying - You can also identify problems or errors with the
data (data management issues).
20Summarising and Describing Continuous Data
- Measures of the centre of data (central tendency)
- Mean
- Median
- Mode
21Definitions
- The arithmetic mean is what is commonly called
the average. The mean is the sum of all the
scores divided by the number of scores. - The median is the middle of a distribution half
the scores are above the median and half are
below the median. - The mode is the most frequently occurring score
in a distribution
22- It has been said that a fellow with one leg
frozen in ice and the other leg in boiling water
is comfortable - on average.
- J.M. Yancy
23Sample Mean X
- The Average or Arithmetic Mean
- Add up data, then divide by sample size (n)
- The sample size n is the number of observations
(pieces of data) - ?? Example
- Systolic blood pressures (mmHg)
- X1 120
- X2 80
- X3 90
- X4 110
- X5 95
- n 5
24Notation
S (sigma) denotes the summation of a set of
values x is the variable usually used to
represent the individual data values n
represents the number of data values in a
sample N represents the number of data values in
a population
x is pronounced x-bar and denotes the mean of
a set of Sample values
- µ is pronounced mu and denotes the mean of all
values in a population
25Definitions
- Mean
- the value obtained by adding the scores and
dividing the total by the number of scores
S x
x
Sample
n
S x
µ
Population
N
26Notes on Sample Mean
- Also called sample average or arithmetic mean
- Sensitive to extreme values
- - One data point could make a great change in
sample mean - Why is it called the sample mean?
- To distinguish it from population mean
27Population Versus Sample
- Population - The entire group you want
information about - For example The blood pressure of all
20-year-old male university students in South
Africa - Sample - A part of the population from which we
actually collect information and draw conclusions
about the whole population - For example Sample of blood pressures (n50)
of 20-year-old male university students in South
Africa - The sample mean X is not the population mean µ
28Population Versus Sample
- We dont know the population mean µ but would
like to know it - We draw a sample from the population
- We calculate the sample mean X
- How close is X to µ?
- Statistical theory will tell us how close X is to
µ - Statistical inference is the process of trying to
draw conclusions about the population from the
sample
29Weighted Mean
S (w x)
x
S
w
Your grade in many courses are weighted means
(averages). In other words, some things count
(are weighted) more than others.
30Geometric Means
These are histograms rotated 90º, and box
plots. Note how the log transformation gives a
symmetric distribution.
31(No Transcript)
32- 5 5 5 3 1 5 1 4
3 5 2 -
- 1 1 2 3 3 4 5 5
5 5 5 - (in order)
- exact middle MEDIAN is 4
- 1 1 3 3 4 5 5 5
5 5 -
-
- no exact middle -- shared by two numbers
- MEDIAN is 4.5
4 5
4.5
2
33Mode
- The score that occurs most frequently
- Bimodal
- Multimodal
- No Mode
- The only measure of central tendency that can be
used with nominal data
34Examples
- Mode is 5
- Bimodal 2 6
- No Mode
a. 5 5 5 3 1 5 1 4 3 5 b.
2 2 2 3 4 5 6 6 6 7 9 c.
2 3 6 7 8 9 10
d. 2 2 3 3 3 4 e. 2 2 3
3 4 4 5 5
35Shapes of the Distribution
36Shapes of the Distribution
37Distribution Characteristics
38Shapes of the Distribution
Example Height of students in the class
39Shapes of the Distribution
Example Serum cholesterol level
40Shapes of the Distribution
Example Birth weight of newborn babies
41Shapes of the Distribution
42(No Transcript)
43Some visual ways to summarize data
- Tables
- Frequency table
- Graphs
- Histograms
- Bar graphs
- Box plots
- Line plots
- Scatter graphs
- Charts
- Bar chart
- Pie diagram
44Frequency Tables
- Summarizes a variable with counts and percentages
- The variable is categorical
- Note that you can take a continuous variable and
create categories with it - How do you create categories for a continuous
variable? - Choose cutoffs that are biologically meaningful
- Natural breaks in the data
45Example of frequency table
When raw data are arranged with frequencies, they
are said to form a frequency table for ungrouped
data. When the data are divided into groups/
classes, they are called grouped data. The
classes have to be decided according to the range
of data and size of class. The number of
observations lying in a particular class is
called its frequency and the table showing
classes with frequencies is called a frequency
table. The total of frequencies of a particular
class and of all classes prior to that class is
called the cumulative frequency of that class.
46Graphical Summaries
- Histograms
- Continuous or ordinal data on horizontal axis
- Bar Graphs
- Nominal data
- No order to horizontal axis
- Box Plots
- Continuous data
47Histogram
A histogram is a graphic representation of the
frequency distribution of a variable. Vertical
rectangles (bars) are drawn in such a way that
their bases lie on a linear scale representing
different intervals, and their heights are
proportional to the frequencies of the values
within each of the intervals.
48Bar Chart
A bar chart is a method of presenting discrete
data organized in such a way that each
observation can fall into one of mutually
exclusive categories. The frequencies (or
percentages) are listed along the Y axis and the
categories of the variable along the X axis. The
heights of the bars correspond to the
frequencies. The bars should be of equal width
and they should not be touching me other bars.
49Difference between bar chart and histogram
- Bar charts for categories that are separate
- Histograms if you got categories by dividing up
continuous data. - Bars do not touch, histogram rectangles do touch.
50Line graph
If the mid-points of the top of the bars of a
histogram are connected together by a line and if
the bars were omitted from the display, the
resultant graph will be a line graph (also called
a frequency polygon). Line graphs are good at
showing trends over a period of time. When trends
of rates (e.g. death rate, Infant Mortality Rate,
etc.) are to be displayed it is better done with
line graphs rather than histograms.
51Scatter plot
Also called a scattergram. This a method of
displaying the distribution of two variables in
relation to each other another. The value of one
variables is measured on the X axis and the
values of the other on the Y axis. The variables
have to be on a continuous scale. Each plot thus
has two values (coordinates) from the Y and X
axis scales. A wide scatter of the plots denotes
poor correlation between the two variables. If
the two variables are perfectly correlated, then
all the plots will fall on the diagonal
(regression line).
52Survival curve
53Pie chart
This is a circular diagram (can be shown as 2-D
or 3-D) divided into segments, each representing
a category or subset of data (part of the whole).
The amount for each category is proportional to
the area of the sector (slice of the pie). The
total area of the circle is 100 and it
represents the total population that is being
shown.
54Pictures of DataContinuous Variables
- Histograms
- Means and medians do not tell whole story
- Differences in spread (variability)
- Differences in shape of the distribution
55How to Make a Histogram
- Divide range of data into intervals (bins) of
equal width - Count the number of observations in each class
- Draw the histogram
- Label scales
56Pictures of Data Histograms
57Pictures of Data Histograms
58Pictures of Data Histograms
59Box plot
- Another common visual display tool is the box
plot - Gives good insight into distribution shape in
terms of skewness and outlying values - Very nice tool for easily comparing distribution
of continuous data in multiple groups can be
plotted side by side
60Box plot
A box plot provides an excellent visual summary
of many important aspects of a distribution. The
box stretches from the lower hinge (defined as
the 25th percentile) to the upper hinge (the 75th
percentile) and therefore contains the middle
half of the scores in the distribution. The
median is shown as a line across the box.
Therefore 1/4 of the distribution is between this
line and the top of the box and 1/4 of the
distribution is between this line and the bottom
of the box.
61Hospital Length of Stay
62Box plot Length of Stay
63Box plot Length of Stay
64Misuse of graphics
- " It pays to be wide awake in studying any graph.
The thing looks so simple, so frank, and .so
appealing. that the careless are easily fooled. "
- M J Moroney. - Graphs and charts are often misused. The honest
researcher must have a good handle on how graphs
can be used to deliberately mislead people so
that such misadventures can be avoided. - Common tricks used to mislead
- The problem of scaling
- The Advertiser's Graph
- The transformed graph
- The chart with too much data
65Which graph to use?
Statistical methods depend on the form of a set
of data, which can be assessed with some common
useful graphics Graph Name Y-axis X-axis
Histogram Count Category Scatterpl
ot Continuous Continuous Dot
Plot Continuous Category Box
Plot Percentiles Category Line Plot Mean
or value Category
66Example of MCQ 1
The arithmetic mean of a set of values a) Is a
particular type of average.b) Is a useful
summary measure of location if the data are
skewed to the right.c) Coincides with the median
if the distribution of the data is
symmetrical.d) Is always greater than the
median.e) Cannot be calculated if the data set
contains both positive and negative values
67Example of MCQ 2
- A histogram
- a) Can be used instead of a pie chart to display
categorical data. - b) Is similar to a bar chart but there are no
gaps between the bars. - c) Contains contiguous bars, with the height of
each bar being proportional to the frequency of
the observations in the range specified by the
bar. - d) Can be used to display either a frequency or a
relative frequency distribution. - e) Is used to show the relationship between two
variables.
68Any questions?