Title: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
1MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS
- Elena Marchiori
- IBIVU
- Vrije Universiteit Amsterdam
2Summary
- Machine Learning
- Supervised Learning classification
- Unsupervised Learning clustering
3Machine Learning (ML)
- Construct a computational model from a dataset
describing properties of an unknown (but
existent) system.
System (unknown)
observations
properties
?
ML
Computational model
prediction
4Supervised Learning
- The dataset describes examples of input-output
behaviour of a unknown (but existent) system. - The algorithm tries to find a function
equivalent to the system. - ML techniques for classification K-nearest
neighbour, decision trees, Naïve Bayes, Support
Vector Machines.
5Supervised Learning
property of interest
System (unknown)
observations
supervisor
Training data
?
ML algorithm
new observation
model
prediction
Unsupervised learning
6Example A Classification Problem
- Categorize images of fishsay, Atlantic salmon
vs. Pacific salmon - Use features such as length, width, lightness,
fin shape number, mouth position, etc. - Steps
- Preprocessing (e.g., background subtraction)
- Feature extraction
- Classification
example from Duda Hart
7Classification in Bioinformatics
- Computational diagnostic early cancer detection
- Tumor biomarker discovery
- Protein folding prediction
- Protein-protein binding sites prediction
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
8Classification Techniques
- Naïve Bayes
- K Nearest Neighbour
- Support Vector Machines (next lesson)
-
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
9Bayesian Approach
- Each observed training example can incrementally
decrease or increase probability of hypothesis
instead of eliminate an hypothesis - Prior knowledge can be combined with observed
data to determine hypothesis - Bayesian methods can accommodate hypotheses that
make probabilistic predictions - New instances can be classified by combining the
predictions of multiple hypotheses, weighted by
their probabilities
Kathleen McKeowns slides
10Bayesian Approach
- Assign the most probable target value, given
lta1,a2,angt - VMAPargmax P(vj a1,a2,an)
- Using Bayes Theorem
- VMAPargmax P(a1,a2,anvj)P(vi) vj?V
P(a1,a2,an) argmax
P(a1,a2,anvj)P(vi) vj?V - Bayesian learning is optimal
- Easy to estimate P(vi) by counting in training
data - Estimating the different P(a1,a2,anvj) not
feasible - (we would need a training set of size
proportional to the number of possible instances
times the number of classes)
Kathleen McKeowns slides
11Bayes Rules
- Product Rule P(a ? b) P(ab)P(b) P(ba)P(a)
- Bayes rule P(ab)P(ba)P(a)
P(b) - In distribution form P(YX)P(XY)P(Y)
aP(XY)P(Y) P(X)
Kathleen McKeowns slides
12Naïve Bayes
- Assume independence of attributes
- P(a1,a2,anvj)?P(aivj)
i - Substitute into VMAP formula
- VNBargmax P(vj)?P(aivj) vj?V
i
Kathleen McKeowns slides
13VNBargmax P(vj)?P(aivj) vj?V
S-length S-width P-length Class
1 high high high Versicolour
2 low high low Setosa
3 low high low Verginica
4 low high med Verginica
5 high high high Versicolour
6 high high med Setosa
7 high high low Setosa
8 high high high Versicolour
9 high high high Versicolour
Kathleen McKeowns slides
14Estimating Probabilities
- What happens when the number of data elements is
small? - Suppose true P(S-lengthlowverginica).05
- There are only 2 instances with CVerginica
- We estimate probability by nc/n using the
training set - Then S-length low Verginica must 0
- Then, instead of .05 we use estimated probability
of 0 - Two problems
- Biased underestimate of probability
- This probability term will dominate if future
query contains S-lengthlow
Kathleen McKeowns slides
15Instead use m-estimate
- Use priors as well
- ncmp nm
- Where p prior estimate of P(S-lengthlowvergini
ca) - m is a constant called the equivalent sample size
- Determines how heavily to weight p relative to
the observed data - Typical method assume a uniform prior of an
attribute (e.g. if values low,med,high -gt p 1/3)
Kathleen McKeowns slides
16K-Nearest Neighbour
- Memorize the training data
- Given a new example, find its k nearest
neighbours, and output the majority vote class. - Choices
- How many neighbours?
- What distance measure?
17Application in Bioinformatics
- A Regression-based K nearest neighbor algorithm
for gene function prediction from heterogeneous
data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics
2006, 7 - For each dataset k, for each pair of genes p
compute similarity fk(p) of p wrt k-th data - Construct predictor of gene pair similarity, e.g.
logistic regression - H f(p,1),,f(p,m) ? H(f(p,1),,f(p,m)) such
that - H high value if genes of p have similar
functions. - Given a new gene g find kNN using H as distance
- Predict the functional classes C1, .., Cn of g
with confidence equal to - Confidence(Ci) 1- ? (1- Pij) with gj neighbour
of g and Ci in the set of classes of gj
(probability that at least one prediction is
correct, that is 1 probability that all
predictions are wrong)
18Classification CV error
N samples
- Training error
- Empirical error
- Error on independent test set
- Test error
- Cross validation (CV) error
- Leave-one-out (LOO)
- N-fold CV
splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
Supervised learning
19Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
Gene selection
Train and test the gene-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
Supervised learning
20Difference between CV1 and CV2
- CV1 gene selection within LOOCV
- CV2 gene selection before before LOOCV
- CV2 can yield optimistic estimation of
classification true error - CV2 used in paper by Golub et al.
- 0 training error
- 2 CV error (5.26)
- 5 test error (14.7)
- CV error different from test error!
Supervised learning
21Significance of classification results
- Permutation test
- Permute class label of samples
- LOOCV error on data with permuted labels
- Repeat process a high number of times
- Compare with LOOCV error on original data
- P-value ( times LOOCV on permuted data lt
LOOCV on original data) / total of permutations
considered
Supervised learning
22Unsupervised Learning
ML for unsupervised learning attempts to
discover interesting structure in the available
data
Unsupervised learning
23Unsupervised Learning
- The dataset describes the structure of an unknown
(but existent) system. - The computer program tries to identify structure
of the system (clustering, data compression). - ML techniques hierarchical clustering, k-means,
Self Organizing Maps (SOM), fuzzy clustering
(described in a future lesson).
24Clustering
- Clustering is one of the most important
unsupervised learning processes for organizing
objects into groups whose members are similar in
some way. - Clustering finds structures in a collection of
unlabeled data. - A cluster is a collection of objects which are
similar between them and are dissimilar to the
objects belonging to other clusters.
25Clustering Algorithms
- Start with a collection of n objects each
represented by a pdimensional feature vector xi
, i1, n. - The goal is to associatethe n objects to k
clusters so that objects within a clusters are
more similar than objects between clusters. k
is usually unknown. - Popular methods hierarchical, k-means, SOM,
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
26Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
27Hierarchical Clustering (Cont.)
- Multilevel clustering level 1 has n clusters ?
level n has one cluster. - Agglomerative HC starts with singleton and merge
clusters. - Divisive HC starts with one sample and split
clusters.
28Nearest Neighbor Algorithm
- Nearest Neighbor Algorithm is an agglomerative
approach (bottom-up). - Starts with n nodes (n is the size of our
sample), merges the 2 most similar nodes at each
step, and stops when the desired number of
clusters is reached.
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
29Nearest Neighbor, Level 2, k 7 clusters.
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
30Nearest Neighbor, Level 3, k 6 clusters.
31Nearest Neighbor, Level 4, k 5 clusters.
32Nearest Neighbor, Level 5, k 4 clusters.
33Nearest Neighbor, Level 6, k 3 clusters.
34Nearest Neighbor, Level 7, k 2 clusters.
35Nearest Neighbor, Level 8, k 1 cluster.
36Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles
- Keys
- Similarity
- Clustering
Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
37Clustering in Bioinformatics
- Microarray data quality checking
- Does replicates cluster together?
- Does similar conditions, time points, tissue
types cluster together? - Cluster genes ? Prediction of functions of
unknown genes by known ones - Cluster samples ? Discover clinical
characteristics (e.g. survival, marker status)
shared by samples. - Promoter analysis of commonly regulated genes
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
38Functional significant gene clusters
Two-way clustering
Sample clusters
Gene clusters
39Bhattacharjee et al. (2001) Human lung carcinomas
mRNA expression profiling reveals distinct
adenocarcinoma subclasses. Proc. Natl. Acad. Sci.
USA, Vol. 98, 13790-13795.
40Similarity Measurements
Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
41Similarity Measurements
- Pearson Correlation Trend Similarity
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
42Similarity Measurements
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
43Similarity Measurements
- Euclidean Distance Absolute difference
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
44Clustering
C1
Merge which pair of clusters?
C2
C3
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
45Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters
C2
C1
Tend to generate long chains
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
46Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters
C2
C1
Tend to generate clumps
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
47Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).
C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
48Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.
C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
49Considerations
- What genes are used to cluster samples?
- Expression variation
- Inherent variation
- Prior knowledge (irrelevant genes)
- Etc.
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
50K-means Clustering
- Initialize the K cluster representatives ws,
e.g. to randomly chosen examples. - Assign each input example x to the cluster c(x)
with the nearest corresponding weight vector - Update the weights
- Increment n by 1 and go until no noticeable
changes of the cluster representatives occur. -
Unsupervised learning
51Example I
Initial Data and Seeds
Final Clustering
Unsupervised learning
52Example II
Initial Data and Seeds
Final Clustering
Unsupervised learning
53SOM Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our clustering system should be
able to do the same thing.
Unsupervised learning
54Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
Unsupervised learning
55Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
56Concept of the SOM
Input space
Reduced feature space
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
Unsupervised learning
57Concept of the SOM
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3
Mg
We can use it for clustering
Unsupervised learning
58SOM learning algorithm
- Initialization. n0. Choose random small values
for weight vectors components. - Sampling. Select an x from the input examples.
- Similarity matching. Find the winning neuron i(x)
at iteration n - Updating adjust the weight vectors of all
neurons using the following rule - Continuation n n1. Go to the Sampling step
until no noticeable changes in the weights are
observed.
Unsupervised learning
59Neighborhood Function
- Gaussian neighborhood function
- dji lateral distance of neurons i and j
- in a 1-dimensional lattice j - i
- in a 2-dimensional lattice rj - ri where
rj is the position of neuron j in the lattice.
Unsupervised learning
60Initial h function (Example )
Unsupervised learning
61Some examples of real-life applications
- Helsinki University of Technology web site
- http//www.cis.hut.fi/research/refs/
- Contains gt 5000 papers on SOM and its
applications - Brain research modeling of formation of various
topographical maps in motor, auditory, visual and
somatotopic areas. - Clusterization of genes, protein properties,
chemical compounds, speech phonemes, sounds of
birds and insects, astronomical objects,
economical data, business and financial data
.... - Data compression (images and audio), information
filtering. - Medical and technical diagnostics.
Unsupervised learning
62Issues in Clustering
- How many clusters?
- User parameter
- Use model selection criteria (Bayesian
Information Criterion) with penalization term
which considers model complexity. See e.g.
X-means http//www2.cs.cmu.edu/dpelleg/kmeans.ht
ml - What similarity measure?
- Euclidean distance
- Correlation coefficient
- Ad-hoc similarity measures
Unsupervised learning
63Validation of clustering results
- External measures
- According to some external knowledge
- Consideration of bias and subjectivity
- Internal measures
- Quality of clusters according to the data
- Compactness and separation
- Stability
-
- See e.g. J.Handl, J.Knowles, D.B.Kell
- Computational cluster validation in postgenomic
data analysis, Bioinformatics, 21(15)3201-3212,
2005
Unsupervised learning
64Molecular Classification of CancerClass
Discovery and Class Prediction by Gene Expression
Monitoring
Bioinformatics Application
- T.R. Golub et al., Science 286, 531 (1999)
Unsupervised learning
65Identification of cancer types
- Why is Identification of Cancer Class (tumor
sub-type) important? - Cancers of Identical grade can have widely
variable clinical courses (i.e. acute
lymphoblastic leukemia, or Acute myeloid
leukemia). - Tradition Method
- Morphological appearance.
- Enzyme-based histochemical analyses.
- Immunophenotyping.
- Cytogenetic analysis.
Golub et al 1999
Unsupervised learning
66Class Prediction
- How could one use an initial collection of
samples belonging to know classes to create a
class Predictor? - Identification of Informative Genes
- Weighted Vote
Golub et al slides
Unsupervised learning
67Data
- Initial Sample 38 Bone Marrow Samples (27 ALL,
11 AML) obtained at the time of diagnosis. - Independent Sample 34 leukemia consisted of 24
bone marrow and 10 peripheral blood samples (20
ALL and 14 AML).
Golub et al slides
Unsupervised learning
68Validation of Gene Voting
- Initial Samples 36 of the 38 samples as either
AML or ALL and two as uncertain. All 36 samples
agrees with clinical diagnosis. - Independent Samples 29 of 34 samples are
strongly predicted with 100 accuracy.
Golub et al slides
Unsupervised learning
69Class Discovery
- Can cancer classes be discovered automatically
based on gene expression? - Cluster tumors by gene expression
- Determine whether the putative classes produced
are meaningful.
Golub et al slides
Unsupervised learning
70Cluster tumors
- Self-organization Map (SOM)
- Mathematical cluster analysis for recognizing and
clasifying feautres in complex, multidimensional
data (similar to K-mean approach) - Chooses a geometry of nodes
- Nodes are mapped into K-dimensional space,
initially at random. - Iteratively adjust the nodes.
Golub et al slides
Unsupervised learning
71Validation of SOM
- Prediction based on cluster A1 and A2
- 24/25 of the ALL samples from initial dataset
were clustered in group A1 - 10/13 of the AML samples from initial dataset
were clustered in group A2
Golub et al slides
Unsupervised learning
72Validation of SOM
- How could one evaluate the putative cluster if
the right answer were not known? - Assumption class discovery could be tested by
class prediction. - Testing of Assumption
- Construct Predictors based on clusters A1 and A2.
- Construct Predictors based on random clusters
Golub et al slides
Unsupervised learning
73Validation of SOM
- Predictions using predictors based on clusters A1
and A2 yields 34 accurate predictions, one error
and three uncertains.
Golub et al slides
Unsupervised learning
74Validation of SOM
Golub et al slides
Unsupervised learning
75CONCLUSION
- In Machine Learning, every technique has its
assumptions and constraints, advantages and
limitations - My view
- First perform simple data analysis before
applying fancy high tech ML methods - Possibly use different ML techniques and then
ensemble results - Apply correct cross validation method!
- Check for significance of results (permutation
test, stability of selected genes) - Work in collaboration with data producer
(biologist, pathologist) when possible!
ML in bioinformatics