Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
2(No Transcript)
3Module 2 Clustering, Classification and Feature
Selection
Sohrab Shah Centre for Translational and Applied
Genomics Molecular Oncology Breast Cancer
Research Program BC Cancer Agency sshah_at_bccrc.ca
4Module Overview
- Introduction to clustering
- distance metrics
- hierarchical, partitioning and model based
clustering - Introduction to classification
- building a classifier
- avoiding overfitting
- cross validation
- Feature Selection in clustering and
classification
5Introduction to clustering
- What is clustering?
- unsupervised learning
- discovery of patterns in data
- class discovery
- Grouping together objects that are most similar
(or least dissimilar) - objects may be genes, or samples, or both
- Example question Are there samples in my cohort
that can be subgrouped based on molecular
profiling? - Do these groups have correlation to clinical
outcome?
6Distance metrics
- In order to perform clustering, we need to have a
way to measure how similar (or dissimilar) two
objects are - Euclidean distance
- Manhattan distance
- 1-correlation
- proportional to Euclidean distance, but invariant
to range of measurement from one sample to the
next
dissimilar
similar
7Distance metrics compared
Euclidean
Manhattan
1-Pearson
Conclusion distance matters!
8Other distance metrics
- Hamming distance for ordinal, binary or
categorical data
9Approaches to clustering
- Partitioning methods
- K-means
- K-medoids (partitioning around medoids)
- Model based approaches
- Hierarchical methods
- nested clusters
- start with pairs
- build a tree up to the root
10Partitioning methods
- Anatomy of a partitioning based method
- data matrix
- distance function
- number of groups
- Output
- group assignment of every object
11Partitioning based methods
- Choose K groups
- initialise group centers
- aka centroid, medoid
- assign each object to the nearest centroid
according to the distance metric - reassign (or recompute) centroids
- repeat last 2 steps until assignment stabilizes
12K-medoids in action
13K-means vs K-medoids
K-means K-medoids
Centroids are the mean of the clusters Centroids are an actual object that minimizes the total within cluster distance
Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix
Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects
kmeans pam
14Partitioning based methods
Advantages Disadvantages
Number of groups is well defined Have to choose the number of groups
A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster
Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations
15Agglomerative hierarchical clustering
16Hierarchical clustering
- Anatomy of hierarchical clustering
- distance matrix
- linkage method
- Output
- dendrogram
- a tree that defines the relationships between
objects and the distance between clusters - a nested sequence of clusters
17Linkage methods
single
complete
distance between centroids
average
18Linkage methods
- Ward (1963)
- form partitions that minimizes the loss
associated with each grouping - loss defined as error sum of squares (ESS)
- consider 10 objects with scores (2, 6, 5, 6, 2,
2, 2, 2, 0, 0, 0)
ESSOnegroup (2 -2.5)2 (6 -2.5)2 .......
(0 -2.5)2 50.5 On the other hand, if the
10 objects are classified according to their
scores into four sets, 0,0,0, 2,2,2,2, 5,
6,6 The ESS can be evaluated as the sum of
squares of four separate error sums of
squares ESSOnegroup ESSgroup1 ESSgroup2
ESSgroup3 ESSgroup4 0.0 Thus, clustering
the 10 scores into 4 clusters results in no loss
of information.
19Linkage methods in action
- clustering based on single linkage
- single lt- hclust(dist(t(exprMatSub),method"euclid
ean"), methodsingle") - plot(single)
20Linkage methods in action
- clustering based on complete linkage
- complete lt- hclust(dist(t(exprMatSub),method"eucl
idean"), method"complete") - plot(complete)
21Linkage methods in action
- clustering based on centroid linkage
- centroid lt- hclust(dist(t(exprMatSub),method"eucl
idean"), methodcentroid") - plot(centroid)
22Linkage methods in action
- clustering based on average linkage
- average lt- hclust(dist(t(exprMatSub),method"eucli
dean"), methodaverage") - plot(average)
23Linkage methods in action
- clustering based on Ward linkage
- ward lt- hclust(dist(t(exprMatSub),method"euclidea
n"), methodward") - plot(ward)
24Linkage methods in action
Conclusion linkage matters!
25Hierarchical clustering analyzed
Advantages Disadvantages
There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure
No need to specify number groups ahead of time Its necessary to cut the dendrogram in order to produce clusters
Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be undone
26Model based approaches
- Assume the data are generated from a mixture of
K distributions - What cluster assignment and parameters of the K
distributions best explain the data? - Fit a model to the data
- Try to get the best fit
- Classical example mixture of Gaussians (mixture
of normals) - Take advantage of probability theory and
well-defined distributions in statistics
27Model based clustering array CGH
28Model based clustering of aCGH
Problem patient cohorts often exhibit molecular
heterogeneity making rarer shared CNAs hard to
detect
Approach Cluster the data by extending the
profiling to the multi-group setting
A mixture of HMMs HMM-Mix
Distribution of calls in a group
CNA calls
Raw data
Shah et al (Bioinformatics, 2009)
29Advantages of model based approaches
- In addition to clustering patients into groups,
we output a model that best represents the
patients in a group - We can then associate each model with clinical
variables and simply output a classifier to be
used on new patients - Choosing the number of groups becomes a model
selection problem (ie the Bayesian Information
Criterion) - see Yeung et al Bioinformatics (2001)
30Clustering 106 follicular lymphoma patients with
HMM-Mix
Initialisation
Clinical
Converged
- Recapitulates known FL subgroups
- Subgroups have clinical relevance
30
31Feature selection
- Most features (genes, SNP probesets, BAC clones)
in high dimensional datasets will be
uninformative - examples unexpressed genes, housekeeping genes,
passenger alterations - Clustering (and classification) has a much higher
chance of success if uninformative features are
removed - Simple approaches
- select intrinsically variable genes
- require a minimum level of expression in a
proportion of samples - genefilter package (Bioonductor) Lab1
- Return to feature selection in the context of
classification
32Advanced topics in clustering
- Top down clustering
- Bi-clustering or two-way clustering
- Principal components analysis
- Choosing the number of groups
- model selection
- AIC, BIC
- Silhouette coefficient
- The Gap curve
- Joint clustering and feature selection
33What Have We Learned?
- There are three main types of clustering
approaches - hierarchical
- partitioning
- model based
- Feature selection is important
- reduces computational time
- more likely to identify well-separated groups
- The distance metric matters
- The linkage method matters in hierarchical
clustering - Model based approaches offer principled
probabilistic methods
34Module Overview
- Clustering
- Classification
- Feature Selection
35Classification
- What is classificiation?
- Supervised learning
- discriminant analysis
- Work from a set of objects with predefined
classes - ie basal vs luminal or good responder vs poor
responder - Task learn from the features of the objects
what is the basis for discrimination? - Statistically and mathematically heavy
36Classification
poor response
poor response
good response
good response
37Example DLBCL subtypes
Wright et al, PNAS (2003)
38DLBCL subtypes
Wright et al, PNAS (2003)
39Classification approaches
- Wright et al PNAS (2003)
- Weighted features in a linear predictor score
- aj weight of gene j determined by t-test
statistic - Xj expression value of gene j
- Assume there are 2 distinct distributions of LPS
1 for ABC, 1 for GCB
40Wright et al, DLBCL, contd
- Use Bayes rule to determine a probability that a
sample comes from group 1 - probability density function
that represents group 1
41Learning the classifier, Wright et al
- Choosing the genes (feature selection)
- use cross validation
- Leave one out cross validation
- Pick a set of samples
- Use all but one of the samples as training,
leaving one out for test - Fit the model using the training data
- Can the classifier correctly pick the class of
the remaining case? - Repeat exhaustively for leaving out each sample
in turn - Repeat using different sets and numbers of genes
based on t-statistic - Pick the set of genes that give the highest
accuracy
42Overfitting
- In many cases in biology, the number of features
is much larger than the number of samples - Important features may not be represented in the
training data - This can result in overfitting
- when a classifier discriminates well on its
training data, but does not generalise to
orthogonally derived data sets - Validation is required in at least one external
cohort to believe the results - example the expression subtypes for breast
cancer have been repeatedly validated in numerous
data sets
43Overfitting
- To reduce the problem of overfitting, one can use
Bayesian priors to regularize the parameter
estimates of the model - Some methods now integrate feature selection and
classification in a unified analytical framework - see Law et al IEEE (2005) Sparse Multinomial
Logistic Regression (SMLR) http//www.cs.duke.edu
/amink/software/smlr/ - Cross validation should always be used in
training a classifier
44Evaluating a classifier
- The receiver operator characteristic curve
- plots the true positive rate vs the false
positive rate
- Given ground truth and a probabilistic classifier
- for some number of probability thresholds
- compute the TPR
- proportion of positives that were predicted as
true - compute the FPR
- number of false predictions over the total number
of predictions
45Other methods for classification
- Support vector machines
- Linear discriminant analysis
- Logistic regression
- Random forests
- See
- Ma and Huang Briefings in Bioinformatics (2008)
- Saeys et al Bioinformatics (2007)
46Questions?
47Lab Clustering and feature selection
- Get familiar clustering tools and plotting
- Feature selection methods
- Distance matrices
- Linkage methods
- Partition methods
- Try to reproduce some of the figures from Chin et
al using the freely available data
48Module 2 Lab
- Coffee break
- Back at 1500