Title: Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning
1Introduction to Bioinformatics Lecture
VIIIClassification and Supervised Learning
- Jarek Meller
- Division of Biomedical Informatics,
- Childrens Hospital Research Foundation
- Department of Biomedical Engineering, UC
2Outline of the lecture
-
- Motivating story correlating inputs and outputs
- Learning with a teacher
- Regression and classification problems
- Model selection, feature selection and
generalization - k-nearest neighbors and some other classification
algorithms - Phenotype fingerprints and their applications in
medicine
3Web watch an on-line biology textbook by JW
Kimball
Dr. J. W. Kimball's Biology Pages http//users.rc
n.com/jkimball.ma.ultranet/BiologyPages/ Story
1 B-cells and DNA editing, Apolipoprotein B and
RNA eiditing http//users.rcn.com/jkimball.ma.ult
ranet/BiologyPages/R/RNA_Editing.htmlapoB_gene
Story 2 ApoB, cholesterol uptake, LDL and
its endocytosis http//users.rcn.com/jkimball.ma.
ultranet/BiologyPages/E/Endocytosis.htmlldl
Complex patterns of mutations in genes related to
cholesterol transport and uptake (e.g. LDLR,
ApoB) may lead to an elevated level of LDL in the
blood.
4Correlations and fingerprints
Instead of often difficult to decipher underlying
molecular model, one may simply try to find
correlations between inputs and outputs. If
measurements on certain attributes correlate with
molecular processes, underlying
genomic structures, phenotypes, disease states
etc., one can use such attributes as indicators
of such hidden states and to make predictions
for new cases. Consider for example the elevated
levels of the low density lipoprotein
(LDL) particles in the blood, as an indicator
(fingerprint) of the atherosclerosis.
5Correlations and fingerprints LDL example
Healthy cases blue heart attack or stroke
within 5 years from the exam red (simulated
data) x LDL y - HDL z age (see study by
Westendorp et. al., Arch Intern Med. 2003,
163(13)1549
6LDL example 2D projection
7LDL example regression with binary output and 1D
projection for classification
8Unsupervised vs. supervised learning
In case of unsupervised learning the goal is to
discover the structure in the data and group
(cluster) similar objects, given a similarity
measure. In case of supervised learning (or
learning with a teacher) a set of examples
with class assignments (e.g. healthy vs.
diseased) is given and the goal is to find a
representation of the problem in some feature
(attribute) space that provides a proper
separation of the imposed classes. Such
representations With the resulting decision
boundaries may be subsequently used to
make prediction for new cases.
Class 3
Class 1
Class 2
9Choice of the model, problem representation and
feature selection another simple example
adults
children
F
weight
estrogen
M
heights
testosterone
10Gene expression example again JRA clinical
classes
Picture courtesy of B. Aronow
11Advantages of prior knowledge, problems with
class assignment (e.g. in clinical practice) on
the other hand
GLOBINS
FixL
No sequence similarity
??
PYP
Prior knowledge the same class despite low
sequence similarity suggestion that distance
based on sequence similarity is not sufficient
adding structure derived features might help
(good model question again).
12Three phases in supervised learning protocols
- Training data examples with class assignment are
given - Learning
- i) appropriate model (or
representation) of the problem needs to be
selected in terms of attributes, distance measure
and classifier type ii) adaptive parameters
in the model need to optimized to provide correct
classification of training examples (e.g.
minimizing the number of misclassified training
vectors) - Validation cross-validation, independent control
sets and other measure of real accuracy and
generalization should be used to assess the
success of the model and the training phase
(finding trade off between accuracy and
generalization is not trivial)
13Training set LDL example again
- A set of objects (here patients) xi , i1, , N
is given. For each patient a set of features
(attributes and the corresponding measurements on
these attributes) are given too. Finally, for
each patient we are given the class Ck , k1, ,
K, he/she belongs to. -
- Age LDL HDL Sex Class
- 41 230 60 F healthy (0)
- 32 120 50 M stroke within 5 years (1)
- 45 90 70 M heart attack within 5
years (1)
xi , Ck i1, , N
14Optimizing adaptable parameters in the model
- Find a model y(xw) that describes the objects of
each class as a function of the features and
adaptive parameters (weights) w. - Prediction, given x (e.g. LDL240, age52,
sexmale) assign the class C?, (e.g. if
y(x,w)gt0.5 then C1, i.e. likely to suffer from a
stroke or heart attack in the next 5 years)
y(xw)
15Examples of machine learning algorithms for
classification and regression problems
- Linear perceptron, Least Squares
- LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non-linear
generalizations) - SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non-linear
generalizations) - Decision trees (logical rules)
- k-NN (k-Nearest Neighbors) (simple
non-parametric) - Neural networks (general non-linear models,
adaptivity, artificial brain)
16Training accuracy vs. generalization
17Model complexity, training set size and
generalization
18Similarity measures
19k-nearest neighbors as a simple algorithm for
classification
-
- Given a training set of N objects with known
class assignment and kltN find an assignment of
new objects (not included in the training) to one
of the classes based on the assignment of its k
neighbors - A simple, non-parametric method that works
surprisingly well, especially in case of low
dimensional problems - Note however that the choice of the distance
measure may again have a profound effect on the
results - The optimal k is found by trial and error
20k-nearest neighbor algorithm
Step 1 Compute pairwise distances and take k
closest neighbors Step2 Assign class based on a
simple majority voting, the new point
belongs to the class with most neighbors in this
class