Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
22
Module Title of Module
3Module 5
- David Wishart
- Informatics and Statistics for Metabolomics
- June 16-17, 2011
4Distributions Significance
5Univariate Statistics
6Univariate Statistics
- Univariate means a single variable
- If you measure a population using some single
measure such as height, weight, test score, IQ,
you are measuring a single variable - If you plot that single variable over the whole
population, measuring the frequency that a given
value is reached you will get the following
7A Bell Curve
of each
Height
Also called a Gaussian or Normal Distribution
8Features of a Normal Distribution
m mean
- Symmetric Distribution
- Has an average or mean value (m) at the centre
- Has a characteristic width called the standard
deviation (s) - Most common type of distribution known
9Normal Distribution
- Almost any set of biological or physical
measurements will display some some variation and
these will almost always follow a Normal
distribution - The larger the set of measurements, the more
normal the curve - Minimum set of measurements to get a normal
distribution is 30-40
10Gaussian Distribution
11Some Equations
Mean m Sxi
N
s2 S(xi - m)2
Variance
N
s S(xi - m)2
Standard Deviation
N
12Standard Deviations (Z-values)
13Significance
- Based on the Normal Distribution, the probability
that something is gt1 SD away (larger or smaller)
from the mean is 32 - Based on the Normal Distribution, the probability
that something is gt2 SD away (larger or smaller)
from the mean is 5 - Based on the Normal Distribution, the probability
that something is gt3 SD away (larger or smaller)
from the mean is 0.3
14Significance
- In a test with a class of 400 students, if you
score the average you typically receive a C - In a test with a class of 400 students, if you
score 1 SD above the average you typically
receive a B - In a test with a class of 400 students if you
score 2 SD above the average you typically
receive an A,
15The P-value
- The p-value is the probability of obtaining a
test statistic (a score, a set of events, a
height) at least as extreme as the one that was
actually observed - One "rejects the null hypothesis" when the
p-value is less than the significance level a
which is often 0.05 or 0.01 - When the null hypothesis is rejected, the result
is said to be statistically significant
16P-value
- If the average height of an adult (MF) human is
5 7 and the standard deviation is 5, what is
the probability of finding someone who is more
than 6 10? - If you choose an a of 0.05 is a 6 11
individual a member of the human species? - If you choose an a of 0.01 is a 6 11 individual
a member of the human species?
17P-value
- If you flip a coin 20 times and the coin turns up
heads 14/20 times the probability that this would
occur is 60,000/1,048,000 0.058 - If you choose an a of 0.05 is this coin a fair
coin? - If you choose an a of 0.10 is this coin a fair
coin?
18Mean, Median Mode
Mode
Median
Mean
19Mean, Median, Mode
- In a Normal Distribution the mean, mode and
median are all equal - In skewed distributions they are unequal
- Mean - average value, affected by extreme values
in the distribution - Median - the middlemost value, usually half way
between the mode and the mean - Mode - most common value
20Different Distributions
Unimodal Bimodal
21Other Distributions
- Binomial Distribution
- Poisson Distribution
- Extreme Value Distribution
- Skewed or Exponential Distribution
22Binomial Distribution
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10
10 5 1
P(x) (p q)n
23Poisson Distribution
24Extreme Value Distribution
- Arises from sampling the extreme end of a normal
distribution - A distribution which is skewed due to its
selective sampling - Skew can be either right or left
Gaussian Distribution
25Skewed Distribution
- Resembles an exponential or Poisson-like
distribution - Lots of extreme values far from mean or mode
- Hard to do useful statistical tests with this
type of distribution
Outliers
26Fixing a Skewed Distribution
- A skewed distribution or exponentially decaying
distribution can be transformed into a normal
or Gaussian distribution by applying a log
transformation - This brings the outliers a little closer to the
mean because it rescales the x-variable, it also
makes the distribution much more Gaussian
27Log Transformation
Skewed distribution
Normal distribution
28Log Transformation on Real Data
29Distinguishing 2 Populations
Normals
Leprechauns
30The Result
of each
Height
Are they different?
31What about these 2 Populations?
32The Result
of each
Height
Are they different?
33Students t-Test
- Also called the t-Test
- Used to determine if 2 populations are different
- Formally allows you to calculate the probability
that 2 sample means are the same - If the t-Test statistic gives you a p0.4, and
the a is 0.05, then the 2 populations are the
same - If the t-Test statistic gives you a p0.04, and
the a is 0.05, then the 2 populations are
different - Paired and unpaired t-Tests are available, paired
if used for before after expts. while
unpaired is for 2 randomly chosen samples
34Students t-Test
- A t-Test can also be used to determine whether 2
clusters are different if the clusters follow a
normal distribution
Variable 1
Variable 2
35Distinguishing 3 Populations
Normals
Leprechauns
Elves
36The Result
of each
Height
Are they different?
37Distinguishing 3 Populations
38The Result
of each
Height
Are they different?
39ANOVA
- Also called Analysis of Variance
- Used to determine if 3 or more populations are
different, it is a generalization of the t-Test - Formally ANOVA provides a statistical test (by
looking at group variance) of whether or not the
means of several groups are all equal - Uses an F-measure to test for significance
- 1-way, 2-way, 3-way and n-way ANOVAs, most common
is 1-way which just is concerned about whether
any of the 3 populations are different, not
which pair is different
40ANOVA
- ANOVA can also be used to determine whether 3
clusters are different if the clusters follow a
normal distribution
Variable 1
Variable 2
41Normalization
42Normalization
- What if we measured the top population using a
ruler that was miscalibrated or biased (inches
were short by 10)? We would get the following
result
of each
Height
43Normalization
- Normalization adjusts for systematic bias in the
measurement tool - After normalization we would get
of each
Height
44Data Comparisons Dependencies
45Data Comparisons
- In many kinds of experiments we want to know what
happened to a population before and after
some treatment or intervention - In other situations we want to measure the
dependency of one variable against another - In still others we want to assess how the
observed property matches the predicted property - In all cases we will measure multiple samples or
work with a population of subjects - The best way to view this kind of data is through
a scatter plot
46A Scatter Plot
47Scatter Plots
- If there is some dependency between the two
variables or if there is a relationship between
the predicted and observer variable or if the
before and after treatments led to some
effect, then it is possible to see some clear
patterns to the scatter plot - This pattern or relationship is called correlation
48Correlation
correlation Uncorrelated -
correlation
49Correlation
High correlation
Low correlation
Perfect correlation
50Correlation Coefficient
r 0.85
r 0.4
r 1.0
51Correlation Coefficient
- Sometimes called coefficient of linear
correlation or Pearson product-moment correlation
coefficient - A quantitative way of determining what model (or
equation or type of line) best fits a set of data - Commonly used to assess most kinds of
predictions, simulations, comparisons or
dependencies
52Students t-Test (Again)
- The t-Test can also be used to assess the
statistical significance of a correlation - It specifically determines whether the slope of
the regression line is statistically different
than 0
53Correlation and Outliers
Experimental error or something important?
A single bad point can destroy a good
correlation
54Outliers
- Can be both good and bad
- When modeling data -- you dont like to see
outliers (suggests the model is bad) - Often a good indicator of experimental or
measurement errors -- only you can know! - When plotting metabolite concentration data you
do like to see outliers - A good indicator of something significant
55Detecting Clusters
Height
Weight
56Is it Right to Calculate a Correlation
Coefficient?
Height
r 0.73
Weight
57Or is There More to This?
male
Height
female
Weight
58Clustering Applications in Bioinformatics
- Metabolomics and Cheminformatics
- Microarray or GeneChip Analysis
- 2D Gel or ProteinChip Analysis
- Protein Interaction Analysis
- Phylogenetic and Evolutionary Analysis
- Structural Classification of Proteins
- Protein Sequence Families
59Clustering
- Definition - a process by which objects that are
logically similar in characteristics are grouped
together. - Clustering is different than Classification
- In classification the objects are assigned to
pre-defined classes, in clustering the classes
are yet to be defined - Clustering helps in classification
60Clustering Requires...
- A method to measure similarity (a similarity
matrix) or dissimilarity (a dissimilarity
coefficient) between objects - A threshold value with which to decide whether an
object belongs with a cluster - A way of measuring the distance between two
clusters - A cluster seed (an object to begin the clustering
process)
61Clustering Algorithms
- K-means or Partitioning Methods - divides a set
of N objects into M clusters -- with or without
overlap - Hierarchical Methods - produces a set of nested
clusters in which each pair of objects is
progressively nested into a larger cluster until
only one cluster remains - Self-Organizing Feature Maps - produces a cluster
set through iterative training
62K-means or Partitioning Methods
- Make the first object the centroid for the first
cluster - For the next object calculate the similarity to
each existing centroid - If the similarity is greater than a threshold add
the object to the existing cluster and
redetermine the centroid, else use the object to
start new cluster - Return to step 2 and repeat until done
63K-means or Partitioning Methods
Initial cluster choose 1 choose 2
test join
centroid centroid
64Hierarchical Clustering
- Find the two closest objects and merge them into
a cluster - Find and merge the next two closest objects (or
an object and a cluster, or two clusters) using
some similarity measure and a predefined
threshold - If more than one cluster remains return to step 2
until finished
65Hierarchical Clustering
Initial cluster pairwise
select select
compare closest
next closest
66Hierarchical Clustering
A
A
A
B
B
C
D
C
B
E
F
Find 2 most similar metabolite expression
levels or curves
Find the next closest pair of levels or curves
Iterate
Heat map
67Multivariate Statistics
68Multivariate Statistics
- Multivariate means multiple variables
- If you measure a population using multiple
measures at the same time such as height, weight,
hair colour, clothing colour, eye colour, etc.
you are performing multivariate statistics - Multivariate statistics requires more complex,
multidimensional analyses or dimensional
reduction methods
69A Typical Metabolomics Experiment
70A Metabolomics Experiment
- Metabolomics experiments typically measure many
metabolites at once, in other words the
instruments are measuring multiple variables and
so metabolomic data are inherently multivariate
data - Metabolomics requires multivariate statistics
71Multivariate Statistics The Trick
- The key trick in multivariate statistics is to
find a way that effectively reduces the
multivariate data into univariate data - Once done, then you can apply the same univariate
concepts such as p-values, t-Tests and ANOVA
tests to the data - The trick is dimensional reduction
72Dimension Reduction PCA
- PCA Principal Componenent Analysis
- Process that transforms a number of possibly
correlated variables into a smaller number of
uncorrelated variables called principal
components - Reduces 1000s of variables to 2-3 key features
Scores plot
73Principal Component Analysis
Hundreds of peaks 2 components
Scores plot
PCA captures what should be visually detectable
If you cant see it, PCA probably wont help
74Visualizing PCA
- PCA of a bagel
- One projection produces a weiner
- Another projection produces an O
- The O projection captures most of the variation
and has the largest eigenvector (PC1) - The weiner projection is PC2 and gives depth info
75PCA - The Details
- PCA involves the calculation of the eigenvalue
(singular value) decomposition of a data
covariance matrix - PCA is an orthogonal linear transformation
- PCA transforms data to a new coordinate system so
that the greatest variance of the data comes to
lie on the first coordinate (1st PC), the second
greatest variance on the 2nd PC etc.
t1 t2 .. tm
x1 x2 x3, variables . xn
s1 s2 s3 samples. sk
p1 p2 pk
Scores t (eigen vectors uncorrelated orthogonal)
..
Loadings p
scores loadings x data t1 p1x1 p2x2 p3x3
pnxn
76Visualizing PCA
- Airport data from USA
- 5000 samples
- X1 - latitude
- X2 - longitude
- X3 - altitude
- What should you expect?
Data from Roy Goodacre (U of Manchester)
77Visualizing PCA
PCA is equivalent to K-means clustering
78K-means Clustering
Initial cluster choose 1 choose 2
test join
centroid centroid
79PCA Clusters
- Once dimensional reduction has been achieved you
obtain clusters of data that are mostly normally
distributed with means and variances (in PCA
space) - It is possible to use t-Tests and ANOVA tests to
determine if these clusters or their means are
significantly different or not
80PCA and ANOVA
- ANOVA can also be used to determine whether 3
clusters are different if the clusters follow a
normal distribution
PC 1
PC 2
81PCA Plot Nomenclature
- PCA Generate 2 kinds of plots, the scores plot
and the loadings plot - Scores plot (on right) plots the data using the
main principal components
82PCA Loadings Plot
- Loadings plot shows how much each of the
variables (metabolites) contributed to the
different principal components - Variables at the extreme corners contribute most
to the scores plot separation
83PCA Details/Advice
- In some cases PCA will not succeed in identifying
any clear clusters or obvious groupings no matter
how many components are used. If this is the
case, it is wise to accept the result and assume
that the presumptive classes or groups cannot be
distinguished - As a general rule, if a PCA analysis fails to
achieve even a modest separation of classes, then
it is probably not worthwhile using other
statistical techniques to try to separate them
84PCA Q2 and R2
- The performance of a PCA model can be
quantitatively evaluated in terms of an R2 and/or
a Q2 value - R2 is the correlation index and refers to the
goodness of fit or the explained variation (range
0-1) - Q2 refers to the predicted variation or quality
of prediction (range 0-1) - Typically Q2 and R2 track very closely together
85PCA R2
- R2 is a quantitative measure (with a maximum
value of 1) that indicates how well the PCA model
is able to mathematically reproduce the data in
the data set - A poorly fit model will have an R2 of 0.2 or 0.3,
while a well-fit model will have an R2 of 0.7 or
0.8.
86PCA Q2
- To guard against over-fitting, the value Q2 is
commonly determined. Q2 is usually estimated by
cross validation or permutation testing to assess
the predictive ability of the model relative to
the number of principal components used in the
model - Generally a Q2 gt 0.5 if considered good while a
Q2 of 0.9 is outstanding
87PCA vs. PLS-DA
- Partial Least Squares Discriminant Analysis
- PLS-DA is a supervised classification technique
while PCA is an unsupervised clustering technique - PLS-DA uses labeled data while PCA uses no
prior knowledge - PLS-DA enhances the separation between groups of
observations by rotating PCA components such that
a maximum separation among classes is obtained
88Other Supervised Classification Methods
- SIMCA Soft Independent Modeling of Class
Analogy - OPLS Orthoganol Project of Least Squares
- Support Vector Machines
- Random Forest
- Naïve Bayes Classifiers
- Neural Networks
89Breaching the Data Barrier
Unsupervised Methods PCA K-means
clustering Factor Analysis
Supervised Methods PLS-DA LDA PLS-Regression
Machine Learning Neural Networks Support Vector
Machines Bayesian Belief Net
90Data Analysis Progression
- Unsupervised Methods
- PCA or cluster to see if natural clusters form or
if data separates well - Data is unlabeled (no prior knowledge)
- Supervised Methods/Machine Learning
- Data is labeled (prior knowledge)
- Used to see if data can be classified
- Helps separate less obvious clusters or features
- Statistical Significance
- Supervised methods always generate clusters --
this can be very misleading - Check if clusters are real by label permutation
91Testing Significance
PCA
Labelled data
PLS-DA/SVM
PLS-DA/SVM
Permuted data
92Note of Caution
- Supervised classification methods are powerful
- Learn from experience
- Generalize from previous examples
- Perform pattern recognition
- Too many people skip the PCA or clustering steps
and jump straight to supervised methods - Some get great separation and think the job is
done - this is where the errors begin - Too many dont assess significance using
permutation testing or n-fold cross validation - If separation isnt partially obvious by
eye-balling your data, you may be treading on
thin ice