Machine Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Machine Learning

Description:

C E N T R F O R I N T E G R A T I V E E B I O I N F O R M A T I C S V U Lecture 3 Machine Learning (Elena Marchiori s s adapted) Bioinformatics Data Analysis ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 47
Provided by: WojtekKo6
Category:
Tags: learning | machine | weka

less

Transcript and Presenter's Notes

Title: Machine Learning


1
Lecture 3 Machine Learning (Elena Marchioris
slides adapted)Bioinformatics Data Analysis
and Tools
heringa_at_few.vu.nl
2
Supervised Learning
property of interest
System (unknown)
observations

supervisor
Train dataset
?
ML algorithm
new observation
model
prediction
Classification
3
Unsupervised Learning

ML for unsupervised learning attempts to
discover interesting structure in the available
data
Data mining, Clustering
4
What is your question?
  • What are the targets genes for my knock-out gene?
  • Look for genes that have different time profiles
    between different cell types.
  • Gene discovery, differential expression
  • Is a specified group of genes all up-regulated in
    a specified conditions?
  • Gene set, differential expression
  • Can I use the expression profile of cancer
    patients to predict survival?
  • Identification of groups of genes that are
    predictive of a particular class of tumors?
  • Class prediction, classification
  • Are there tumor sub-types not previously
    identified?
  • Are there groups of co-expressed genes?
  • Class discovery, clustering
  • Detection of gene regulatory mechanisms.
  • Do my genes group into previously undiscovered
    pathways?
  • Clustering. Often expression data alone is not
    enough, need to incorporate functional and other
    information

5
  • Basic principles of discrimination
  • Each object associated with a class label (or
    response) Y ? 1, 2, , K and a feature vector
    (vector of predictor variables) of G
    measurements X (X1, , XG)
  • Aim predict Y from X.

Predefined Class 1,2,K
K
1
2
Objects
Y Class Label 2 X Feature vector
colour, shape
Classification rule ?
X red, square Y ?
6
Discrimination and Prediction
Learning Set Data with known classes
Prediction
Classification rule
Data with unknown classes
Classification Technique
Class Assignment
Discrimination
7
Example A Classification Problem
  • Categorize images of fishsay, Atlantic salmon
    vs. Pacific salmon
  • Use features such as length, width, lightness,
    fin shape number, mouth position, etc.
  • Steps
  • Preprocessing (e.g., background subtraction)
  • Feature extraction/feature weighting
  • Classification

example from Duda Hart
8
Classification in Bioinformatics
  • Computational diagnostic early cancer detection
  • Tumor biomarker discovery
  • Protein structure prediction (threading)
  • Protein-protein binding sites prediction
  • Gene function prediction

9
Learning set
Good Prognosis recurrence gt 5 yrs
?
Bad prognosis recurrence lt 5yrs
Good Prognosis recurrence gt 5yrs
Predefine classes Clinical outcome
Objects Array Feature vectors Gene expression
new array
Reference L vant Veer et al (2002) Gene
expression profiling predicts clinical outcome of
breast cancer. Nature, Jan. .
Classification rule
10
Classification Techniques
  • K Nearest Neighbor classifier
  • Support Vector Machines

11
Instance Based Learning (IBL)
  • Key idea just store all training examples
    ltxi,f(xi)gt
  • Nearest neighbor
  • Given query instance xq, first locate nearest
    training example xn, then estimate f(xq)f(xn)
  • K-nearest neighbor
  • Given xq, take vote among its k nearest neighbors
    (if discrete-valued target function)
  • Take mean of values of k nearest neighbors (if
    real-valued) f(xq)?i1k f(xi)/k

12
K-Nearest Neighbor
  • The k-nearest neighbor algorithm is amongst the
    simplest of all machine learning algorithms.
  • An object is classified by a majority vote of its
    neighbors, with the object being assigned to the
    class most common amongst its k nearest
    neighbors.
  • k is a positive integer, typically small. If k
    1, then the object is simply assigned to the
    class of its nearest neighbor.
  • K-NN can do multiple class prediction (more than
    two cancer subtypes, etc.)
  • In binary (two class) classification problems, it
    is helpful to choose k to be an odd number as
    this avoids tied votes.

Adapted from Wikipedia
13
K-Nearest Neighbor
  • A lazy learner
  • Issues
  • How many neighbors?
  • What similarity measure?

Example of k-NN classification. The test sample
(green circle) should be classified either to the
first class of blue squares or to the second
class of red triangles. If k 3 it is classified
to the second class because there are 2 triangles
and only 1 square inside the inner circle. If k
5 it is classified to first class (3 squares vs.
2 triangles inside the outer circle).
From Wikipedia
14
Which similarity or dissimilarity measure?
  • A metric is a measure of the similarity or
    dissimilarity between two data objects
  • Two main classes of metric
  • Correlation coefficients (similarity)
  • Compares shape of expression curves
  • Types of correlation
  • Centered.
  • Un-centered.
  • Rank-correlation
  • Distance metrics (dissimilarity)
  • City Block (Manhattan) distance
  • Euclidean distance

15
Correlation (a measure between -1 and 1)
  • Pearson Correlation Coefficient (centered
    correlation)
  • Sx Standard deviation of x
  • Sy Standard deviation of y

You can use absolute correlation to capture both
positive and negative correlation
Positive correlation
Negative correlation
16
Potential pitfalls
Correlation 1
17
Distance metrics
  • City Block (Manhattan) distance
  • Sum of differences across dimensions
  • Less sensitive to outliers
  • Diamond shaped clusters
  • Euclidean distance
  • Most commonly used distance
  • Sphere shaped cluster
  • Corresponds to the geometric distance into the
    multidimensional space

Y
Condition 2
X
Condition 1
where gene X (x1,,xn) and gene Y(y1,,yn)
18
Euclidean vs Correlation (I)
  • Euclidean distance
  • Correlation

19
When to Consider Nearest Neighbors
  • Instances map to points in RN
  • Less than 20 attributes per instance
  • Lots of training data
  • Advantages
  • Training is very fast
  • Learn complex target functions
  • Do not loose information
  • Disadvantages
  • Slow at query time
  • Easily fooled by irrelevant attributes

20
Voronoi Diagrams
  • Voronoi diagrams partition a space with objects
    in the same way as happens when you throw a
    number of pebbles in water -- you get concentric
    circles that will start touching and by doing so
    delineate the area for each pebble (object).
  • The area assigned to each object can now be used
    for weighting purposes
  • A nice example from sequence analysis is by
    Sibbald, Vingron and Argos (1990)
  • Sibbald, P. and Argos, P. 1990. Weighting
    aligned protein or nucleic acid sequences to
    correct for unequal representation. JMB
    216813-818.

21
Voronoi Diagram
query point qf
nearest neighbor qi
22
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
Can use Voronoi areas for weighting
23
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
24
k-Nearest Neighbors
  • The best choice of k depends upon the data
    generally, larger values of k reduce the effect
    of noise on the classification, but make
    boundaries between classes less distinct.
  • A good k can be selected by various heuristic
    techniques, for example, cross-validation. If k
    1, the algorithm is called the nearest neighbor
    algorithm.
  • The accuracy of the k-NN algorithm can be
    severely degraded by the presence of noisy or
    irrelevant features, or if the feature scales are
    not consistent with their importance.
  • Much research effort has been put into selecting
    or scaling features to improve classification,
    e.g. using evolutionary algorithms to optimize
    feature scaling.

25
Nearest Neighbor
  • Approximate the target function f(x) at the
    single query point x xq
  • Locally weighted regression generalization of
    IBL

26
Curse of Dimensionality
  • Imagine instances are described by 20 attributes
    (features) but only 10 are relevant to target
    function
  • Curse of dimensionality nearest neighbor is
    easily misled when the instance space is
    high-dimensional
  • One approach weight the features according to
    their relevance!
  • Stretch j-th axis by weight zj, where z1,,zn
    chosen to minimize prediction error
  • Use cross-validation to automatically choose
    weights z1,,zn
  • Note setting zj to zero eliminates this dimension
    alltogether (feature subset selection)

27
Practical implementations
  • Weka IBk
  • Optimized Timbl

28
Example Tumor Classification
  • Reliable and precise classification essential for
    successful cancer treatment
  • Current methods for classifying human
    malignancies rely on a variety of morphological,
    clinical and molecular variables
  • Uncertainties in diagnosis remain likely that
    existing classes are heterogeneous
  • Characterize molecular variations among tumors by
    monitoring gene expression (microarray)
  • Hope that microarrays will lead to more reliable
    tumor classification (and therefore more
    appropriate treatments and better outcomes)

29
Tumor Classification Using Gene Expression Data
  • Three main types of ML problems associated with
    tumor classification
  • Identification of new/unknown tumor classes using
    gene expression profiles (unsupervised learning
    clustering)
  • Classification of malignancies into known classes
    (supervised learning discrimination)
  • Identification of marker genes that
    characterize the different tumor classes (feature
    or variable selection).

30
Example Leukemia experiments (Golub et al 1999)
  • Goal. To identify genes which are differentially
    expressed in acute lymphoblastic leukemia (ALL)
    tumours in comparison with acute myeloid
    leukemia (AML) tumours.
  • 38 tumour samples 27 ALL, 11 AML.
  • Data from Affymetrix chips, some
    pre-processing.
  • Originally 6,817 genes 3,051 after reduction.
  • Data therefore 3,051 ? 38 array of expression
    values.

Acute lymphoblastic leukemia (ALL) is the most
common malignancy in children 2-5 years in age,
representing nearly one third of all pediatric
cancers. Acute Myeloid Leukemia (AML) is the
most common form of myeloid leukemia in adults
(chronic lymphocytic leukemia is the most common
form of leukemia in adults overall). In contrast,
acute myeloid leukemia is an uncommon variant of
leukemia in children. The median age at diagnosis
of acute myeloid leukemia is 65 years of age.
31
Learning set
B-ALL
T-ALL
AML
Predefine classes Tumor type
?
T-ALL
Objects Array Feature vectors Gene expression
new array
Reference Golub et al (1999) Molecular
classification of cancer class discovery and
class prediction by gene expression monitoring.
Science 286(5439) 531-537.
Classification Rule
32
Nearest neighbor rule
33
SVM
  • SVMs were originally proposed by Boser, Guyon and
    Vapnik in 1992 and gained increasing popularity
    in late 1990s.
  • SVMs are currently among the best performers for
    a number of classification tasks ranging from
    text to genomic data.
  • SVM techniques have been extended to a number of
    tasks such as regression Vapnik et al. 97,
    principal component analysis Schölkopf et al.
    99, etc.
  • Most popular optimization algorithms for SVMs are
    SMO Platt 99 and SVMlight Joachims 99, both
    use decomposition to hill-climb over a subset of
    ais at a time.
  • Tuning SVMs remains a black art selecting a
    specific kernel and parameters is usually done in
    a try-and-see manner.

34
SVM
  • In order to discriminate between two classes,
    given a training dataset
  • Map the data to a higher dimension space (feature
    space)
  • Separate the two classes using an optimal linear
    separator

35
Feature Space Mapping
  • Map the original data to some higher-dimensional
    feature space where the training set is linearly
    separable

F x ? f(x)
36
The Kernel Trick
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • If every datapoint is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is some function that
    corresponds to an inner product in some expanded
    feature space.
  • Example
  • 2-dimensional vectors xx1 x2 let
    K(xi,xj)(1 xiTxj)2,
  • Need to show that K(xi,xj) f(xi) Tf(xj)
  • K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
    xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
  • 1 xi12 v2 xi1xi2 xi22 v2xi1
    v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
  • f(xi) Tf(xj), where f(x) 1 x12
    v2 x1x2 x22 v2x1 v2x2

37
Linear Separators
Which one is the best?
38
Optimal hyperplane
Support vectors uniquely characterize optimal
hyper-plane
margin
Optimal hyper-plane
Support vector
39
Optimal hyperplane geometric view
The first class The second class
40
Soft Margin Classification
  • What if the training set is not linearly
    separable?
  • Slack variables ?i can be added to allow
    misclassification of difficult or noisy examples.

?j
?k
41
Weakening the constraints
Weakening the constraints
Allow that the objects do not strictly obey the
constraints Introduce slack-variables
42
Influence of C
Erroneous objects can still have a (large)
influence on the solution
C is slack variable
43
SVM
  • Advantages
  • maximize the margin between two classes in the
    feature space characterized by a kernel function
  • are robust with respect to high input dimension
  • Disadvantages
  • difficult to incorporate background knowledge
  • Sensitive to outliers

44
SVM and outliers
outlier
45
Classifying new examples
  • Given new point x, its class membership is
    signf(x, ?, b), where

Data enters only in the form of dot products!
and in general
Kernel function
46
Classification CV error
N samples
  • Training error
  • Empirical error
  • Error on independent test set
  • Test error
  • Cross validation (CV) error
  • Leave-one-out (LOO)
  • n-fold CV

splitting
N/n samples for testing
N(n-1)/n samples for training
Count errors
Summarize CV error rate
Write a Comment
User Comments (0)
About PowerShow.com