Title: Biostatistics and Computer Applications
1Biostatistics and Computer Applications
Dafeng Hui Introduction Descriptive
Statistics SAS
2Introduction
- Introduction to the basic concepts of statistics
as applied to problems in biological science. - Goal of the course
- Understand statistical concepts (population,
sample, t-test, slope, significant etc.) - Identify appropriate methods for your data (e.g.,
paired t-test or independent t-test, one-way
block or two-way ANOVA) - Select correct SAS procedures to analyze data
(you may use different SAS procedure for one
purpose, which more is more suitable) - Scientific reading and interpretation.
3BiostatisticsComputer Applications
- Why study Biostatistics?
- Statistical methods are widely used in biological
field - Examples are from biological field, practical and
useful - Focus on application instead of mathematical
derivation - Help to evaluate the paper in an intelligent
manner. - Statistics - the science and art of obtaining
reliable results and conclusions from data that
is subject to variation. - Biostatistics (Biometry)- the application of
statistics to the biologic sciences.
4BiostatisticsComputer Applications
- Why Computer Applications?
- Statistical methods are mostly difficult and
complicated (ANOVA, regression etc) - Advances in computer technology and statistical
software development make the application of
statistical method much easier today than before
- Software such as SAS needs time to learn.
5Is Biostatistics hard to study?
- Factors make it hard for some students to learn
statistics - The terminology is deceptive. To understand
statistics, you have to understand the
statistical meaning of terms such as significant,
error and hypothesis are distinct from ordinary
uses of these words. - Statistics requires mastering abstract concepts.
It is not easy to think about theoretical
concepts such as populations, probability
distributions, and null hypotheses.
6Is Biostatistics hard? (cont)
- Statistics is at the interface of mathematics and
science. To really grasp the concepts of
statistics, you need to be able to think about it
from both angles. - The derivation of many statistical tests involves
difficult math. However, you can learn to use
statistical tests and interpret the results even
if you do not fully understand how they work. You
only need to know enough about how the tool works
so that you can avoid using them in inappropriate
situations. - Basically, you can calculate statistical tests
and interpret results even if you dont
understand how the equations were derived, as
long as you know enough to use the statistical
tests appropriately.
7Questions about this class
- Is this class to be hard?
- No. Concept is easy and procedure is clear.
- Why do we spend time on theoretical stuff?
- Helpful to understand the application
- Do we need to know all the stuff?
- You may not need all, but be prepared
8Role of statistics in Biological Science
Statistics 1.Mathematical model /
hypothesis 2.Study design 3.Descriptive
statistics 4.Inferential statistics
- Science
- 1.Idea or Question
- 2.Collect data/make observations
- 3.Describe data / observations
- 4.Assess the strength of evidence for / against
the hypothesis
9Contents of the course
- Descriptive statistics
- Graph, table, mean and standard deviation
- Inferential statistics
- Probability and distribution
- Hypothesis test
- Analysis of Variation
- Correlation and regression analysis
- Other special topic
10Basic Concept
- Data
- numerical facts, measurements, or observations
obtained from an investigation, experiment aimed
at answering a question - Statistical analyses deal with numbers
- Variable
- a characteristic that can take on different
values for different persons, places or things - Statistical analyses need variability otherwise
there is nothing to study - Examples
- Concentration of a substance, pH values obtained
from atmospheric precipitation, birth weight of
babies whose mothers are smokers, etc.
11Basic Concept (cont.)
- Type of Variable
- Continuous variable
- Between any two values of a variable, there is
another possible value - Examples height, weight, concentration
- Discrete variable
- Value can be only integer
- Example number of people, plant etc.
12Basic Concept (cont.)
- Population
- Population a set or collection of objects we are
interested in. (finite, infinite) - Parameter a descriptive measure associated with
a variable of an entire population, usually
unknown because the whole population cannot be
enumerated. - For example,
- Plant height under warming conditions
- Graduates in US Smokers in the world.
13Basic Concept (cont.)
- Sample
- Sample a small number of subjects from a
population to make inference about the
population - Random sample A sample of size n drawn from a
population of size N in such a way that every
possible sample of size n has the same chance of
being selected. - Statistic a descriptive measure associated with
a random variable of a sample. -
14Basic Concept (cont.)
- Population and Sample
-
- Sample?Population, Statistic?Parameter
population
Parameter predict properties of
sample
Generalize to a population
sample
statistic
15Descriptive Statistics
- Graphical Summaries
- Frequency distribution
- Histogram
- Stem and Leaf plot
- (Barplot, Boxplot)
- Numerical Summaries
- Location - mean, median, mode.
- Spread - range, variance, standard deviation
- (Shape skewness, kurtosis)
16Frequency Distribution (discrete var.)
- Example Number of sedge plant, Carex flacca,
found in 800 sample quadrats (1m2) in an
ecological study of grasses - 1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2,
0, 1, 2, - .
- 1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4,
7, 2, 1,0 - How is the plant number in a quadrat distributed?
17Frequency Distribution (discrete var.)
- Table 1. The frequency, relative frequency,
cumulative frequencies of plant sedge in a
quadrat.
- frequency - number of times value occurs in
data.(probability for population). - relative frequency - the of the time that the
value occurs (frequency/n). - cumulative relative frequency - the of the
sample that is equal to or smaller than the value
(cumulative frequency/n).
18Frequency Distribution (Conti. Var.)
- Grouping of continuous outcome
- Examples weight, height.
- Better understanding of what data show rather
than individual values - Example Fiber length of a cotton (n106)
- Data
- 27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6
- 31.8,32.0,27.8
19Frequency Distribution
Table 2. Frequency and relative frequency
distribution of fiber length (mm) of a cotton
variety (n106)
20Frequency Distribution (cont. var.)
- Calculate Range Rmax(X)-min(x)5.13
- Set Number of intervals g and interval range i
- Some rules exist, but generally create 8-15
equal sized intervals, g11 - i R/(g-1)0.5
- Set intervals
- L1min(X)-i /227.0, L2L1i 27.5,
- Count number in each interval
21Frequency Distribution
Table 2. Frequency and relative frequency
distribution of fiber length (mm) of a cotton
variety (n106)
22Histogram (Bar graph) and polygon
- Histogram graph of frequencies
- Can be used to visually compare frequencies
- Easier to assess magnitude of differences rather
than trying to judge numbers - Frequency polygon - similar to histogram
Fig. 1. Frequency distribution of plants in a
quadrat.
23Histogram (Bar graph) and polygon
Fig. 2. Frequency distribution in fiber length of
a cotton.
24Stem-and-Leaf Displays
- Another way to assess frequencies
- Does preserve individual measure information, so
not useful for large data sets - Stem is first digit(s) of measurements, leaves
are last digit of measurements - Most useful for two digit numbers, more
cumbersome for three digits
20 X 30 XXX 40 XXXX 50 XX 60 X
2 1 3 244 4 2468 5 26 6 4
Stem leaf
25Summary
- In practice, descriptive statistics play a major
role - Always the first 1-2 tables/figures in a paper
- Statistician needs to know about each variable
before deciding how to analyze to answer research
questions - In any analysis, 90 of the effort goes into
setting up the data - Descriptive statistics are part of that 90
26Descriptive StatisticsMeasures of Location
- Descriptive measure computed from population data
- parameter - Descriptive measure computed from sample data -
statistic - Most common measures of location
- Mean
- Median
- Mode
- Geometric Mean, harmonic mean
27Arithmetic mean (population)
- Suppose we have N measurements of a particular
variable in a population.We denote these N
measurements as - X1, X2, X3,,XN
- where X1 is the first measurement, X2 is the
second, etc. - Definition
- More accurately called the arithmetic mean, it is
defined as the sum of measures observed divided
by the number of observations.
28Arithmetic mean (sample)
- Sample Suppose we have n measurements of a
particular variable in a population with N
measurements.The n measurements are - X1, X2, X3,,Xn
- where X1 is the first measurement, X2 is the
second, etc. - Definition
29Arithmetic mean (sample)
- Some Properties of the Arithmetic Mean
- 1. ,
-
- Prove 1.
- 2.
30Median
- Frequently used if there are extreme values in a
distribution or if the distribution is non-normal - Definition
- That value that divides the ordered array into
two equal parts - If an odd number of observations, the median Md
will be the (n1)/2 observation - ex. median of 11 observations is the 6th
observation - If an even number of observations, the median Md
will be the midpoint between the middle two
observations - ex. median of 12 observations is the midpoint
between 6th and 7th
31Mode
- Definition
- Value that occurs most frequently in data set
- Example
- 2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo5
- If all values different, no mode
- May be more than one mode
- Bimodal or multimodal
- Not used very frequently in practice
32Example Central Location
Suppose the ages of the 10 trees you are studying
are 34,24,56,52,21,44,64,34,
42,46 Then the mean age of this group is To
find the median, first order the
data 21,24,34,34,42,44,46,52,56,64 The mode
is 34 years Mo34 (occurred twice).
Mean are commonly used
33Geometric mean
- Used to calculate mean growth rate
- Definition
- Antilog of the mean of the log xi
-
34Geometric mean
- Example Root growth at 25oC, calculate mean
growth rate (mm/d). -
35Descriptive Statistics Measures of Dispersion
- Look at these two data sets
- Set 1 100, 30, 20, 7, 20, 30, 100
- Set 2 10, 3, 2, 7, -2, -3, -10
- If we calculate mean
- Set 1.
- Set 2.
- How to measure dispersion (spread, variability)?
36Descriptive Statistics Measures of Dispersion
- Common measures
- Range
- Variance and Standard deviation
- Coefficient of variation
- Many distributions are well-described by measure
of location and dispersion
37Range
- Range is the difference between the largest and
smallest values in the data set - RMax(Xi)-Min(Xi)
- Heavily influenced by two most extreme values and
ignores the rest of the distribution - Set 1 100, 30, 20, 7, 20, 30, 100
- Set 2 10, 3, 2, 7, -2, -3, -10
- R1200
- R220
38Variance and standard deviation (population)
- Suppose we have N measurements of a particular
variable in a population X1, X2, X3,,XN, - The mean is , as ,
we define - as variance, unit is X unit2
- as standard deviation
39Variance and standard deviation (sample)
- Suppose we have n measurements of a particular
variable in a sample X1, X2, X3,,Xn, - The mean is , we define
- ?
- as mean squares, or sample variance
- ?
- as standard deviation
40Variance and standard deviation
-
-
- Corrected Sum of Squares (CSS)
- Degree of freedom
- n-1 used because if we know n-1 deviations, the
nth deviation is known - Deviations have to sum to zero
41Example
- Suppose the ages of the 10 trees you are studying
are 34,24,56,52,21,44,64,34,42,46, We calculated
- Calculate range, variation, standard deviation
and CV.
R64-2143 y, s21692.1/9188.01 y2, s13.72 y.
42Coefficient of Variation
- Relative variation rather than absolute variation
such as standard deviation - Definition of C.V.
- Useful in comparing variation between two
distributions - Used particularly in comparing laboratory
measures to identify those determinations with
more variation
43Example
- Set 1 100, 30, 20, 7, 20, 30, 100
- Set 2 10, 3, 2, 7, -2, -3, -10
- Calculate , s2, s and CV.
- Set s2 s CV
- 1 1 3773.7 61.4 61.4
- 2 1 44.7 6.7 6.7
44Descriptive Statistics (Summmary)
- Graphical Summaries
- Frequency distribution
- Histogram
- Stem and Leaf plot
- Boxplot
- Numerical Summaries
- Location - mean, median, mode.
- Dispersion - range, variance, standard deviation
- Shape (lab)
45Software
- Statistical software
- SAS
- SPSS
- Stata
- BMDP
- MINITAB
- Graphical software
- Sigmaplot
- Harvard Graphics
- PowerPoint
- Excel
46 SAS
- Statistical Analysis System (SAS)
- World leader in business-intelligence software
and services - Founded in 1976, SAS serves more than 39,000
business, government and university sites in 118
countries.
47 SAS introduction
- SAS has grown far beyond its origins as a
"statistics package" and has positioned itself as
"enterprise software", i.e. a complete system to
manage, analyze, and present information,
especially in a business environment. - Standard statistical software
48Design and function of SAS
- Provides tools to scientists, so they do not need
to spend time on the data analysis, but data
collection and results interpretation. - Four data-driven tasks data access, data
management, data analysis and data presentation
49SAS programming
- SAS windows
- A simple SAS program
- DATA step
- PROC step
- SAS procedures for descriptive statistics
- UNIVARIATE, MEANS, SUMMARY
50 51Box Plots (explain later)
- Descriptive method to convey information about
measures of location and dispersion - Box-and-Whisker plots
- Construction of boxplot
- Box is IQR
- Line at median
- Whiskers at smallest and largest observations
- Other conventions can be used, especially to
represent extreme values
52Box Plots
Drug