MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

About This Presentation

Title:

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Description:

Title: talk proteomics Author: Elena Marchiori Last modified by: elena Created Date: 3/28/2002 7:39:25 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:209

Avg rating:3.0/5.0

Slides: 76

Provided by: Elen128

Category:

more less

Transcript and Presenter's Notes

Title: MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

1
MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Elena Marchiori
IBIVU
Vrije Universiteit Amsterdam

2
Summary

Machine Learning
Supervised Learning classification
Unsupervised Learning clustering

3
Machine Learning (ML)

Construct a computational model from a dataset
describing properties of an unknown (but
existent) system.

System (unknown)
observations
properties

?
ML
Computational model
prediction
4
Supervised Learning

The dataset describes examples of input-output
behaviour of a unknown (but existent) system.
The algorithm tries to find a function
equivalent to the system.
ML techniques for classification K-nearest
neighbour, decision trees, Naïve Bayes, Support
Vector Machines.

5
Supervised Learning
property of interest
System (unknown)
observations

supervisor
Training data
?
ML algorithm
new observation
model
prediction
Unsupervised learning
6
Example A Classification Problem

Categorize images of fishsay, Atlantic salmon
vs. Pacific salmon
Use features such as length, width, lightness,
fin shape number, mouth position, etc.
Steps
Preprocessing (e.g., background subtraction)
Feature extraction
Classification

example from Duda Hart
7
Classification in Bioinformatics

Computational diagnostic early cancer detection
Tumor biomarker discovery
Protein folding prediction
Protein-protein binding sites prediction

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
8
Classification Techniques

Naïve Bayes
K Nearest Neighbour
Support Vector Machines (next lesson)

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
9
Bayesian Approach

Each observed training example can incrementally
decrease or increase probability of hypothesis
instead of eliminate an hypothesis
Prior knowledge can be combined with observed
data to determine hypothesis
Bayesian methods can accommodate hypotheses that
make probabilistic predictions
New instances can be classified by combining the
predictions of multiple hypotheses, weighted by
their probabilities

Kathleen McKeowns slides
10
Bayesian Approach

Assign the most probable target value, given
lta1,a2,angt
VMAPargmax P(vj a1,a2,an)
Using Bayes Theorem
VMAPargmax P(a1,a2,anvj)P(vi) vj?V
P(a1,a2,an) argmax
P(a1,a2,anvj)P(vi) vj?V
Bayesian learning is optimal
Easy to estimate P(vi) by counting in training
data
Estimating the different P(a1,a2,anvj) not
feasible
(we would need a training set of size
proportional to the number of possible instances
times the number of classes)

Kathleen McKeowns slides
11
Bayes Rules

Product Rule P(a ? b) P(ab)P(b) P(ba)P(a)
Bayes rule P(ab)P(ba)P(a)
P(b)
In distribution form P(YX)P(XY)P(Y)
aP(XY)P(Y) P(X)

Kathleen McKeowns slides
12
Naïve Bayes

Assume independence of attributes
P(a1,a2,anvj)?P(aivj)
i
Substitute into VMAP formula
VNBargmax P(vj)?P(aivj) vj?V
i

Kathleen McKeowns slides
13
VNBargmax P(vj)?P(aivj) vj?V
S-length S-width P-length Class
1 high high high Versicolour
2 low high low Setosa
3 low high low Verginica
4 low high med Verginica
5 high high high Versicolour
6 high high med Setosa
7 high high low Setosa
8 high high high Versicolour
9 high high high Versicolour
Kathleen McKeowns slides
14
Estimating Probabilities

What happens when the number of data elements is
small?
Suppose true P(S-lengthlowverginica).05
There are only 2 instances with CVerginica
We estimate probability by nc/n using the
training set
Then S-length low Verginica must 0
Then, instead of .05 we use estimated probability
of 0
Two problems
Biased underestimate of probability
This probability term will dominate if future
query contains S-lengthlow

Kathleen McKeowns slides
15
Instead use m-estimate

Use priors as well
ncmp nm
Where p prior estimate of P(S-lengthlowvergini
ca)
m is a constant called the equivalent sample size
Determines how heavily to weight p relative to
the observed data
Typical method assume a uniform prior of an
attribute (e.g. if values low,med,high -gt p 1/3)

Kathleen McKeowns slides
16
K-Nearest Neighbour

Memorize the training data
Given a new example, find its k nearest
neighbours, and output the majority vote class.
Choices
How many neighbours?
What distance measure?

17
Application in Bioinformatics

A Regression-based K nearest neighbor algorithm
for gene function prediction from heterogeneous
data, Z. Yao and W.L. Ruzzo, BMC Bioinformatics
2006, 7
For each dataset k, for each pair of genes p
compute similarity fk(p) of p wrt k-th data
Construct predictor of gene pair similarity, e.g.
logistic regression
H f(p,1),,f(p,m) ? H(f(p,1),,f(p,m)) such
that
H high value if genes of p have similar
functions.
Given a new gene g find kNN using H as distance
Predict the functional classes C1, .., Cn of g
with confidence equal to
Confidence(Ci) 1- ? (1- Pij) with gj neighbour
of g and Ci in the set of classes of gj
(probability that at least one prediction is
correct, that is 1 probability that all
predictions are wrong)

18
Classification CV error
N samples

Training error
Empirical error
Error on independent test set
Test error
Cross validation (CV) error
Leave-one-out (LOO)
N-fold CV

splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
Supervised learning
19
Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
Gene selection
Train and test the gene-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
Supervised learning
20
Difference between CV1 and CV2

CV1 gene selection within LOOCV
CV2 gene selection before before LOOCV
CV2 can yield optimistic estimation of
classification true error
CV2 used in paper by Golub et al.
0 training error
2 CV error (5.26)
5 test error (14.7)
CV error different from test error!

Supervised learning
21
Significance of classification results

Permutation test
Permute class label of samples
LOOCV error on data with permuted labels
Repeat process a high number of times
Compare with LOOCV error on original data
P-value ( times LOOCV on permuted data lt
LOOCV on original data) / total of permutations
considered

Supervised learning
22
Unsupervised Learning

ML for unsupervised learning attempts to
discover interesting structure in the available
data
Unsupervised learning
23
Unsupervised Learning

The dataset describes the structure of an unknown
(but existent) system.
The computer program tries to identify structure
of the system (clustering, data compression).
ML techniques hierarchical clustering, k-means,
Self Organizing Maps (SOM), fuzzy clustering
(described in a future lesson).

24
Clustering

Clustering is one of the most important
unsupervised learning processes for organizing
objects into groups whose members are similar in
some way.
Clustering finds structures in a collection of
unlabeled data.
A cluster is a collection of objects which are
similar between them and are dissimilar to the
objects belonging to other clusters.

25
Clustering Algorithms

Start with a collection of n objects each
represented by a pdimensional feature vector xi
, i1, n.
The goal is to associatethe n objects to k
clusters so that objects within a clusters are
more similar than objects between clusters. k
is usually unknown.
Popular methods hierarchical, k-means, SOM,

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
26
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
27
Hierarchical Clustering (Cont.)

Multilevel clustering level 1 has n clusters ?
level n has one cluster.
Agglomerative HC starts with singleton and merge
clusters.
Divisive HC starts with one sample and split
clusters.

28
Nearest Neighbor Algorithm

Nearest Neighbor Algorithm is an agglomerative
approach (bottom-up).
Starts with n nodes (n is the size of our
sample), merges the 2 most similar nodes at each
step, and stops when the desired number of
clusters is reached.

From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
29
Nearest Neighbor, Level 2, k 7 clusters.
From http//www.stat.unc.edu/postscript/papers/mar
ron/Stat321FDA/RimaIzempresentation.ppt
30
Nearest Neighbor, Level 3, k 6 clusters.
31
Nearest Neighbor, Level 4, k 5 clusters.
32
Nearest Neighbor, Level 5, k 4 clusters.
33
Nearest Neighbor, Level 6, k 3 clusters.
34
Nearest Neighbor, Level 7, k 2 clusters.
35
Nearest Neighbor, Level 8, k 1 cluster.
36
Hierarchical Clustering
Calculate the similarity between all possible
combinations of two profiles

Keys
Similarity
Clustering

Two most similar clusters are grouped together to
form a new cluster
Calculate the similarity between the new cluster
and all remaining clusters.
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
37
Clustering in Bioinformatics

Microarray data quality checking
Does replicates cluster together?
Does similar conditions, time points, tissue
types cluster together?
Cluster genes ? Prediction of functions of
unknown genes by known ones
Cluster samples ? Discover clinical
characteristics (e.g. survival, marker status)
shared by samples.
Promoter analysis of commonly regulated genes

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
38
Functional significant gene clusters
Two-way clustering
Sample clusters
Gene clusters
39
Bhattacharjee et al. (2001) Human lung carcinomas
mRNA expression profiling reveals distinct
adenocarcinoma subclasses. Proc. Natl. Acad. Sci.
USA, Vol. 98, 13790-13795.
40
Similarity Measurements

Pearson Correlation

Two profiles (vectors)
and
1 ? Pearson Correlation ? 1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
41
Similarity Measurements

Pearson Correlation Trend Similarity

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
42
Similarity Measurements

Euclidean Distance

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
43
Similarity Measurements

Euclidean Distance Absolute difference

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
44
Clustering
C1
Merge which pair of clusters?
C2
C3
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
45
Clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters

C2
C1
Tend to generate long chains
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
46
Clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters

C2
C1
Tend to generate clumps
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
47
Clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).

C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
48
Clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.

C2
C1
From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
49
Considerations

What genes are used to cluster samples?
Expression variation
Inherent variation
Prior knowledge (irrelevant genes)
Etc.

From Introduction to Hierarchical Clustering
Analysis, Pengyu Hong
50
K-means Clustering

Initialize the K cluster representatives ws,
e.g. to randomly chosen examples.
Assign each input example x to the cluster c(x)
with the nearest corresponding weight vector
Update the weights
Increment n by 1 and go until no noticeable
changes of the cluster representatives occur.

Unsupervised learning
51
Example I
Initial Data and Seeds
Final Clustering
Unsupervised learning
52
Example II
Initial Data and Seeds
Final Clustering
Unsupervised learning
53
SOM Brains self-organization
The brain maps the external multidimensional
representation of the world into a similar 1 or 2
- dimensional internal representation. That
is, the brain processes the external signals in a
topology-preserving way Mimicking the way the
brain learns, our clustering system should be
able to do the same thing.
Unsupervised learning
54
Self-Organized Map idea
Data vectors XT (X1, ... Xd) from
d-dimensional space. Grid of nodes, with local
processor (called neuron) in each node. Local
processor j has d adaptive parameters W(j).
Goal change W(j) parameters to recover data
clusters in X space.
Unsupervised learning
55
Training process
Java demos http//www.neuroinformatik.ruhr-uni-bo
chum.de/ini/VDM/research/gsn/DemoGNG/GNG.html
Unsupervised learning
56
Concept of the SOM
Input space
Reduced feature space
Ba
s1
s2
Mn
Sr
Cluster centers (code vectors)
Place of these code vectors in the reduced space
Clustering and ordering of the cluster centers
in a two dimensional grid
Unsupervised learning
57
Concept of the SOM
We can use it for visualization
Ba
Mn
SA3
We can use it for classification
Sr
SA3

Mg
We can use it for clustering
Unsupervised learning
58
SOM learning algorithm

Initialization. n0. Choose random small values
for weight vectors components.
Sampling. Select an x from the input examples.
Similarity matching. Find the winning neuron i(x)
at iteration n
Updating adjust the weight vectors of all
neurons using the following rule
Continuation n n1. Go to the Sampling step
until no noticeable changes in the weights are
observed.

Unsupervised learning
59
Neighborhood Function

Gaussian neighborhood function
dji lateral distance of neurons i and j
in a 1-dimensional lattice j - i
in a 2-dimensional lattice rj - ri where
rj is the position of neuron j in the lattice.

Unsupervised learning
60
Initial h function (Example )
Unsupervised learning
61
Some examples of real-life applications

Helsinki University of Technology web site
http//www.cis.hut.fi/research/refs/
Contains gt 5000 papers on SOM and its
applications
Brain research modeling of formation of various
topographical maps in motor, auditory, visual and
somatotopic areas.
Clusterization of genes, protein properties,
chemical compounds, speech phonemes, sounds of
birds and insects, astronomical objects,
economical data, business and financial data
....
Data compression (images and audio), information
filtering.
Medical and technical diagnostics.

Unsupervised learning
62
Issues in Clustering

How many clusters?
User parameter
Use model selection criteria (Bayesian
Information Criterion) with penalization term
which considers model complexity. See e.g.
X-means http//www2.cs.cmu.edu/dpelleg/kmeans.ht
ml
What similarity measure?
Euclidean distance
Correlation coefficient
Ad-hoc similarity measures

Unsupervised learning
63
Validation of clustering results

External measures
According to some external knowledge
Consideration of bias and subjectivity
Internal measures
Quality of clusters according to the data
Compactness and separation
Stability
See e.g. J.Handl, J.Knowles, D.B.Kell
Computational cluster validation in postgenomic
data analysis, Bioinformatics, 21(15)3201-3212,
2005

Unsupervised learning
64
Molecular Classification of CancerClass
Discovery and Class Prediction by Gene Expression
Monitoring
Bioinformatics Application

T.R. Golub et al., Science 286, 531 (1999)

Unsupervised learning
65
Identification of cancer types

Why is Identification of Cancer Class (tumor
sub-type) important?
Cancers of Identical grade can have widely
variable clinical courses (i.e. acute
lymphoblastic leukemia, or Acute myeloid
leukemia).
Tradition Method
Morphological appearance.
Enzyme-based histochemical analyses.
Immunophenotyping.
Cytogenetic analysis.

Golub et al 1999
Unsupervised learning
66
Class Prediction

How could one use an initial collection of
samples belonging to know classes to create a
class Predictor?
Identification of Informative Genes
Weighted Vote

Golub et al slides
Unsupervised learning
67
Data

Initial Sample 38 Bone Marrow Samples (27 ALL,
11 AML) obtained at the time of diagnosis.
Independent Sample 34 leukemia consisted of 24
bone marrow and 10 peripheral blood samples (20
ALL and 14 AML).

Golub et al slides
Unsupervised learning
68
Validation of Gene Voting

Initial Samples 36 of the 38 samples as either
AML or ALL and two as uncertain. All 36 samples
agrees with clinical diagnosis.
Independent Samples 29 of 34 samples are
strongly predicted with 100 accuracy.

Golub et al slides
Unsupervised learning
69
Class Discovery

Can cancer classes be discovered automatically
based on gene expression?
Cluster tumors by gene expression
Determine whether the putative classes produced
are meaningful.

Golub et al slides
Unsupervised learning
70
Cluster tumors

Self-organization Map (SOM)
Mathematical cluster analysis for recognizing and
clasifying feautres in complex, multidimensional
data (similar to K-mean approach)
Chooses a geometry of nodes
Nodes are mapped into K-dimensional space,
initially at random.
Iteratively adjust the nodes.

Golub et al slides
Unsupervised learning
71
Validation of SOM

Prediction based on cluster A1 and A2
24/25 of the ALL samples from initial dataset
were clustered in group A1
10/13 of the AML samples from initial dataset
were clustered in group A2

Golub et al slides
Unsupervised learning
72
Validation of SOM

How could one evaluate the putative cluster if
the right answer were not known?
Assumption class discovery could be tested by
class prediction.
Testing of Assumption
Construct Predictors based on clusters A1 and A2.
Construct Predictors based on random clusters

Golub et al slides
Unsupervised learning
73
Validation of SOM

Predictions using predictors based on clusters A1
and A2 yields 34 accurate predictions, one error
and three uncertains.

Golub et al slides
Unsupervised learning
74
Validation of SOM
Golub et al slides
Unsupervised learning
75
CONCLUSION

In Machine Learning, every technique has its
assumptions and constraints, advantages and
limitations
My view
First perform simple data analysis before
applying fancy high tech ML methods
Possibly use different ML techniques and then
ensemble results
Apply correct cross validation method!
Check for significance of results (permutation
test, stability of selected genes)
Work in collaboration with data producer
(biologist, pathologist) when possible!

ML in bioinformatics

Write a Comment

User Comments (0)