Introduction%20to%20data%20mining - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction%20to%20data%20mining

Description:

Title: PowerPoint Presentation Last modified by: P. van der Putten Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 118
Provided by: loren54
Category:

less

Transcript and Presenter's Notes

Title: Introduction%20to%20data%20mining


1
Introduction to data mining
  • Peter van der Putten
  • Leiden Institute of Advanced Computer Science
  • Leiden University
  • putten_at_liacs.nl
  • Transcriptomics and Proteomics in Zebrafish
    workshop, Leiden University
  • March 9, 2006

2
Presentation Outline
  • Objective
  • Present the basics of data mining
  • Gain understanding of the potential for applying
    it in the bioinformatics domain

3
Agenda Today
  • Data mining definitions
  • Before Starting to Mine.
  • Descriptive Data Mining
  • Dimension Reduction Projection
  • Clustering
  • Association rules
  • Predictive data mining concepts
  • Classification and regression
  • Bioinformatics applications
  • Predictive data mining techniques
  • Logistic Regression
  • Nearest Neighbor
  • Decision Trees
  • Naive Bayes
  • Neural Networks
  • Evaluating predictive models
  • Demonstration (optional)

4
The Promise.
.
.
.
.
5
The Promise.
.
.
.
.
6
The Promise.
.
.
.
.
7
  • The Solution.
  • NCBI Tools for data mining
  • Nucleotide sequence analysis
  • Proteine sequence analysis
  • Structures
  • Genome analysis
  • Gene expression
  • Data mining or not?.

8
  • What is data mining?

9
Sources of (artificial) intelligence
  • Reasoning versus learning
  • Learning from data
  • Patient data
  • Genomics, protemics
  • Customer records
  • Stock prices
  • Piano music
  • Criminal mug shots
  • Websites
  • Robot perceptions
  • Etc.

10
Some working definitions.
  • Data Mining and Knowledge Discovery in
    Databases (KDD) are used interchangeably
  • Data mining
  • The process of discovery of interesting,
    meaningful and actionable patterns hidden in
    large amounts of data
  • Multidisciplinary field originating from
    artificial intelligence, pattern recognition,
    statistics, machine learning, bioinformatics,
    econometrics, .

11
Some working definitions.
  • Bioinformatics
  • Bioinformatics is the research, development, or
    application of computational tools and approaches
    for expanding the use of biological, medical,
    behavioral or health data, including those to
    acquire, store, organize, archive, analyze, or
    visualize such data http//www.bisti.nih.gov/.
  • Or more pragmatic Bioinformatics or
    computational biology is the use of techniques
    from applied mathematics, informatics,
    statistics, and computer science to solve
    biological problems Wikipedia Nov 2005

12
Bio informatics and data mining
  • From sequence to structure to function
  • Genomics (DNA), Transcriptomics (RNA), Proteomics
    (proteins), Metabolomics (metabolites)
  • Pattern matching and search
  • Sequence matching and alignment
  • Structure prediction
  • Predicting structure from sequence
  • Protein secondary structure prediction
  • Function prediction
  • Predicting function from structure
  • Protein localization
  • Expression analysis
  • Genes micro array data analysis etc.
  • Proteins
  • Regulation analysis

13
Bio informatics and data mining
  • Classical medical and clinical studies
  • Medical decision support tools
  • Text mining on medical research literature
    (MEDLINE)
  • Spectrometry, Imaging
  • Systems biology and modeling biological systems
  • Population biology simulation
  • Spin Off Biological inspired computational
    learning
  • Evolutionary algorithms, neural networks,
    artificial immune systems

14
Genomic Microarrays Case Study
  • Problem
  • Leukemia (different types of Leukemia cells look
    very similar)
  • Given data for a number of samples (patients),
    can we
  • Accurately diagnose the disease?
  • Predict outcome for given treatment?
  • Recommend best treatment?
  • Solution
  • Data mining on micro-array data

15
Microarray data
  • 50 most important genes
  • Rows genes
  • Columns samples / patients

16
Example ALL/AML data
  • 38 training patients, 34 test patients, 7,000
    patient attributes (micro array gene data)
  • 2 Classes Acute Lymphoblastic Leukemia (ALL) vs
    Acute Myeloid Leukemia (AML)
  • Use train data to build diagnostic model

ALL
AML
Results on test data 33/34 correct, 1 error may
be mislabeled
17
Some working definitions.
  • Data Mining and Knowledge Discovery in
    Databases (KDD) are used interchangeably
  • Data mining
  • The process of discovery of interesting,
    meaningful and actionable patterns hidden in
    large amounts of data
  • Multidisciplinary field originating from
    artificial intelligence, pattern recognition,
    statistics, machine learning, bioinformatics,
    econometrics, .

18
The Knowledge Discovery Process
19
Some working definitions.
  • Concepts kinds of things that can be learned
  • Aim intelligible and operational concept
    description
  • Example the relation between patient
    characteristics and the probability to be
    diabetic
  • Instances the individual, independent examples
    of a concept
  • Example a patient, candidate drug etc.
  • Attributes measuring aspects of an instance
  • Example age, weight, lab tests, microarray data
    etc
  • Pattern or attribute space

20
Data mining tasks
  • Descriptive data mining
  • Matching search finding instances similar to x
  • Clustering discovering groups of similar
    instances
  • Association rule extraction if a b then c
  • Summarization summarizing group descriptions
  • Link detection finding relationships
  • Predictive data mining
  • Classification classify an instance into a
    category
  • Regression estimate some continuous value

21
Before starting to mine.
  • Pima Indians Diabetes Data
  • X body mass index
  • Y age

22
Before starting to mine.
23
Before starting to mine.
24
Before starting to mine.
  • Attribute Selection
  • This example InfoGain by Attribute
  • Keep the most important ones

25
Before starting to mine.
  • Types of Attribute Selection
  • Uni-variate versus multivariate (sub set
    selection)
  • The fact that attribute x is a strong uni-variate
    predictor does not necessarily mean it will add
    predictive power to a set of predictors already
    used by a model
  • Filter versus wrapper
  • Wrapper methods involve the subsequent learner
    (classifier or other)

26
Dimension Reduction
  • Projecting high dimensional data into a lower
    dimension
  • Principal Component Analysis
  • Independent Component Analysis
  • Fisher Mapping, Sammons Mapping etc.
  • Multi Dimensional Scaling
  • .

27
Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user
f.e. weight
f.e. age
28
Data Mining Tasks Clustering
Clustering is the discovery of groups in a set of
instances Groups are different, instances in a
group are similar In 2 to 3 dimensional pattern
space you could just visualise the data and leave
the recognition to a human end user In gt3
dimensions this is not possible
f.e. weight
f.e. age
29
Clustering Techniques
  • Hierarchical algorithms
  • Agglomerative
  • Divisive
  • Partition based clustering
  • K-Means
  • Self Organizing Maps / Kohonen Networks
  • Probabilistic Model based
  • Expectation Maximization / Mixture Models

30
Hierarchical clustering
  • Agglomerative / Bottom up
  • Start with single-instance clusters
  • At each step, join the two closest clusters
  • Method to compute distance between cluster x and
    y single linkage (distance between closest point
    in cluster x and y), average linkage (average
    distance between all points), complete linkage
    (distance between furthest points), centroid
  • Distance measure Euclidean, Correlation etc.
  • Divisive / Top Down
  • Start with all data in one cluster
  • Split into two clusters based on distance measure
    / split utility
  • Proceed recursively on each subset
  • Both methods produce a dendrogram

31
Levels of Clustering
Agglomerative
Divisive
Dunham, 2003
32
Hierarchical Clustering Example
  • Clustering Microarray Gene Expression Data
  • Gene expression measured using microarrays
    studied under variety of conditions
  • On budding yeast Saccharomyces cerevisiae
  • Groups together efficiently genes of known
    similar function,

33
Hierarchical Clustering Example
  • Method
  • Genes are the instances, samples the attributes!
  • Agglomerative
  • Distance measure correlation

34
Simple Clustering K-means
  • Pick a number (k) of cluster centers (at random)
  • Cluster centers are sometimes called codes, and
    the k codes a codebook
  • Assign every item to its nearest cluster center
  • F.i. Euclidean distance
  • Move each cluster center to the mean of its
    assigned items
  • Repeat until convergence
  • change in cluster assignments less than a
    threshold

KDnuggets
35
K-means example, step 1
Initially distribute codes randomly in
pattern space
KDnuggets
36
K-means example, step 2
Assign each point to the closest code
KDnuggets
37
K-means example, step 3
Move each code to the mean of all its assigned
points
KDnuggets
38
K-means example, step 2
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
39
K-means example
Repeat the process reassign the data points to
the codes Q Which points are reassigned?
KDnuggets
40
K-means example
re-compute cluster means
KDnuggets
41
K-means example
move cluster centers to cluster means
KDnuggets
42
K-means clustering summary
  • Advantages
  • Simple, understandable
  • items automatically assigned to clusters
  • Disadvantages
  • Must pick number of clusters before hand
  • All items forced into a cluster
  • Sensitive to outliers
  • Extensions
  • Adaptive k-means
  • K-mediods (based on median instead of mean)
  • 1,2,3,4,100 ? average 22, median 3

43
Biological Example
  • Clustering of yeast cell images
  • Two clusters are found
  • Left cluster primarily cells with thick capsule,
    right cluster thin capsule
  • caused by media, proxy for sick vs healthy

44
Self Organizing Maps(Kohonen Maps)
  • Claim to fame
  • Simplified models of cortical maps in the brain
  • Things that are near in the outside world link to
    areas near in the cortex
  • For a variety of modalities touch, motor, . up
    to echolocation
  • Nice visualization
  • From a data mining perspective
  • SOMs are simple extensions of k-means clustering
  • Codes are connected in a lattice
  • In each iteration codes neighboring winning code
    in the lattice are also allowed to move

45
SOM
10x10 SOM Gaussian Distribution
46
SOM
47
SOM
48
SOM
49
SOM example
50
Famous examplePhonetic Typewriter
  • SOM lattice below left is trained on spoken
    letters, after convergence codes are labeled
  • Creates a phonotopic map
  • Spoken word creates a sequence of labels

51
Famous examplePhonetic Typewriter
  • Criticism
  • Topology preserving property is not used so why
    use SOMs and not adaptive k-means for instance?
  • K-means could also create a sequence
  • This is true for most SOM applications!
  • Is using clustering for classification optimal?

52
Bioinformatics ExampleClustering GPCRs
  • Clustering G Protein Coupled Receptors (GPCRs)
    Samsanova et al, 2003, 2004
  • Important drug target, function often unknown

53
Bioinformatics ExampleClustering GPCRs
54
Association Rules Outline
  • What are frequent item sets association rules?
  • Quality measures
  • support, confidence, lift
  • How to find item sets efficiently?
  • APRIORI
  • How to generate association rules from an item
    set?
  • Biological examples

KDnuggets
55
Market Basket ExampleGene Expression Example
  • Frequent item set
  • MILK, BREAD 4
  • Association rule
  • MILK, BREAD ? EGGS
  • Frequency / importance 2 (Support)
  • Quality 50 (Confidence)
  • What genes are expressed (active) together?
  • Interaction / regulation
  • Similar function

56
Association Rule Definitions
  • Set of items II1,I2,,Im
  • Transactions Dt1,t2, , tn, tj? I
  • Itemset Ii1,Ii2, , Iik ? I
  • Support of an itemset Percentage of transactions
    which contain that itemset.
  • Large (Frequent) itemset Itemset whose number of
    occurrences is above a threshold.

Dunham, 2003
57
Frequent Item Set Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
Dunham, 2003
58
Association Rule Definitions
  • Association Rule (AR) implication X ? Y where
    X,Y ? I and X,Y disjunct
  • Support of AR (s) X ? Y Percentage of
    transactions that contain X ?Y
  • Confidence of AR (a) X ? Y Ratio of number of
    transactions that contain X ? Y to the number
    that contain X

Dunham, 2003
59
Association Rules Ex (contd)
Dunham, 2003
60
Association Rule Problem
  • Given a set of items II1,I2,,Im and a
    database of transactions Dt1,t2, , tn where
    tiIi1,Ii2, , Iik and Iij ? I, the Association
    Rule Problem is to identify all association rules
    X ? Y with a minimum support and confidence.
  • NOTE Support of X ? Y is same as support of X ?
    Y.

Dunham, 2003
61
Association Rules Example
  • Q Given frequent set A,B,E, what association
    rules have minsup 2 and minconf 50 ?
  • A, B gt E conf2/4 50
  • A, E gt B conf2/2 100
  • B, E gt A conf2/2 100
  • E gt A, B conf2/2 100
  • Dont qualify
  • A gtB, E conf2/6 33lt 50
  • B gt A, E conf2/7 28 lt 50
  • __ gt A,B,E conf 2/9 22 lt 50

KDnuggets
62
Solution Association Rule Problem
  • First, find all frequent itemsets with sup
    gtminsup
  • Exhaustive search wont work
  • Assume we have a set of m items ? 2m subsets!
  • Exploit the subset property (APRIORI algorithm)
  • For every frequent item set, derive rules with
    confidence gt minconf

KDnuggets
63
Finding itemsets next level
  • Apriori algorithm (Agrawal Srikant)
  • Idea use one-item sets to generate two-item
    sets, two-item sets to generate three-item sets,
    ..
  • Subset Property If (A B) is a frequent item set,
    then (A) and (B) have to be frequent item sets as
    well!
  • In general if X is frequent k-item set, then all
    (k-1)-item subsets of X are also frequent
  • Compute k-item set by merging (k-1)-item sets

KDnuggets
64
An example
  • Given five three-item sets
  • (A B C), (A B D), (A C D), (A C E), (B C D)
  • Candidate four-item sets
  • (A B C D) Q OK?
  • A yes, because all 3-item subsets are frequent
  • (A C D E) Q OK?
  • A No, because (C D E) is not frequent

KDnuggets
65
From Frequent Itemsets to Association Rules
  • Q Given frequent set A,B,E, what are possible
    association rules?
  • A gt B, E
  • A, B gt E
  • A, E gt B
  • B gt A, E
  • B, E gt A
  • E gt A, B
  • __ gt A,B,E (empty rule), or true gt A,B,E

KDnuggets
66
Example Generating Rules from an Itemset
  • Frequent itemset from golf data
  • Seven potential rules

Humidity Normal, Windy False, Play Yes (4)
If Humidity Normal and Windy False then Play Yes If Humidity Normal and Play Yes then Windy False If Windy False and Play Yes then Humidity Normal If Humidity Normal then Windy False and Play Yes If Windy False then Humidity Normal and Play Yes If Play Yes then Humidity Normal and Windy False If True then Humidity Normal and Windy False and Play Yes 4/4 4/6 4/6 4/7 4/8 4/9 4/12
KDnuggets
67
ExampleGenerating Rules
  • Rules with support gt 1 and confidence 100
  • In total 3 rules with support four, 5 with
    support three, and 50 with support two

Association rule Sup. Conf.
1 HumidityNormal WindyFalse ?PlayYes 4 100
2 TemperatureCool ?HumidityNormal 4 100
3 OutlookOvercast ?PlayYes 4 100
4 TemperatureCold PlayYes ?HumidityNormal 3 100
... ... ... ... ...
58 OutlookSunny TemperatureHot ?HumidityHigh 2 100
KDnuggets
68
Weka associations output
KDnuggets
69
Extensions and Challenges
  • Extra quality measure Lift
  • The lift of an association rule I gt J is defined
    as
  • lift P(JI) / P(J)
  • Note, P(I) (support of I) / (no. of
    transactions)
  • ratio of confidence to expected confidence
  • Interpretation
  • if lift gt 1, then I and J are positively
    correlated
  • lift lt 1, then I are J are negatively
    correlated.
  • lift 1, then I and J are
    independent
  • Other measures for interestingness
  • A ? B, B ? C, but not A ? C
  • Efficient algorithms
  • Known Problem
  • What to do with all these rules? How to exploit /
    make useful / actionable?

KDnuggets
70
Biomedical ApplicationHead and Neck Cancer
Example
  • 1. ace270 fiveyralive 381 ? tumorbefore0
    372 conf(0.98)
  • 2. genderM ace270 467 ? tumorbefore0
    455 conf(0.97)
  • 3. ace270 588 ? tumorbefore0 572
    conf(0.97)
  • 4. tnmT0N0M0 ace270 405 ? tumorbefore0 391
    conf(0.97)
  • 5. locLOC7 tumorbefore0 409 ? tnmT0N0M0 391
    conf(0.96)
  • 6. locLOC7 442 ? tnmT0N0M0 422
    conf(0.95)
  • 7. locLOC7 genderM tumorbefore0 374?
    tnmT0N0M0 357 conf(0.95)
  • 8. locLOC7 genderM 406 ? tnmT0N0M0 387
    conf(0.95)
  • 9. genderM fiveyralive 633 ? tumorbefore0 595
    conf(0.94)
  • 10. fiveyralive 778 ? tumorbefore0 726
    conf(0.93)

71
Bioinformatics Application
  • The idea of association rules have been
    customized for bioinformatics applications
  • In biology it is often interesting to find
    frequent structures rather than items
  • For instance protein or other chemical structures
  • Solution Mining Frequent Patterns
  • FSG (Kuramochi and Karypis, ICDM 2001)
  • gSpan (Yan and Han, ICDM 2002)
  • CloseGraph (Yan and Han, KDD 2002)

72
FSG Mining Frequent Patterns
73
FSG Mining Frequent Patterns
74
FSG Algorithmfor finding frequent subgraphs
75
Frequent Subgraph ExamplesAIDS Data
  • Compounds are active, inactive or moderately
    active (CA, CI, CM)

76
Predictive Subgraphs
  • The three most discriminating sub-structures
    forthe PTC, AIDS, and Anthrax datasets

77
Experiments and ResultsAIDS Data
78
FSG References
  • Frequent Sub-structure Based Approaches for
    Classifying Chemical CompoundsMukund Deshpande,
    Michihiro Kuramochi, and George KarypisICDM 2003
  • An Efficient Algorithm for Discovering Frequent
    SubgraphsMichihiro Kuramochi and George
    KarypisIEEE TKDE
  • Automated Approaches for Classifying
    StructuresMukund Deshpande, Michihiro Kuramochi,
    and George KarypisBIOKDD 2002
  • Discovering Frequent Geometric SubgraphsMichihiro
    Kuramochi and George KarypisICDM 2002
  • Frequent Subgraph Discovery Michihiro Kuramochi 
    and George Karypis1st IEEE Conference on Data
    Mining 2001

79
Recap
  • Before Starting to Mine.
  • Descriptive Data Mining
  • Dimension Reduction Projection
  • Clustering
  • Hierarchical clustering
  • K-means
  • Self organizing maps
  • Association rules
  • Frequent item sets
  • Association Rules
  • APRIORI
  • Bio-informatics case FSG for frequent subgraph
    discovery
  • Next
  • Predictive data mining

80
Data Mining Tasks Classification
Goal classifier is to seperate classes on the
basis of known attributes The classifier can be
applied to an instance with unknow class For
instance, classes are healthy (circle) and sick
(square) attributes are age and weight
weight
age
81
Data Preparation for Classification
  • On attributes
  • Attribute selection
  • Attribute construction
  • On attribute values
  • Outlier removal / clipping
  • Normalization
  • Creating dummies
  • Missing values imputation
  • .

82
Examples of Classification Techniques
  • Majority class vote
  • Logistic Regression
  • Nearest Neighbor
  • Decision Trees, Decision Stumps
  • Naive Bayes
  • Neural Networks
  • Genetic algorithms
  • Artificial Immune Systems

83
Example classification algorithmLogistic
Regression
  • Linear regression
  • For regression not classification (outcome
    numeric, not symbolic class)
  • Predicted value is linear combination of inputs
  • Logistic regression
  • Apply logistic function to linear regression
    formula
  • Scales output between 0 and 1
  • For binary classification use thresholding

84
Example classification algorithmLogistic
Regression
Classification
Linear decision boundaries can be represented
well with linear classifiers like logistic
regression
fe weight
fe age
85
Logistic Regression in attribute space
Voorspellen
Linear decision boundaries can be represented
well with linear classifiers like logistic
regression
f.e. weight
f.e. age
86
Logistic Regression in attribute space
Voorspellen
xxxx
Non linear decision boundaries cannot be
represented well with linear classifiers like
logistic regression
f.e. weight
f.e. age
87
Logistic Regression in attribute space
Non linear decision boundaries cannot be
represented well with linear classifiers like
logistic regression Well known example The XOR
problem
f.e. weight
f.e. age
88
Example classification algorithmNearest
Neighbour
  • Data itself is the classification model, so no
    model abstraction like a tree etc.
  • For a given instance x, search the k instances
    that are most similar to x
  • Classify x as the most occurring class for the k
    most similar instances

89
Nearest Neighbor in attribute space
Classification
new instance Any decision area
possible Condition enough data available
fe weight
fe age
90
Nearest Neighbor in attribute space
Voorspellen
Any decision area possible Condition enough data
available
bvb. weight
f.e. age
91
Example Classification AlgorithmDecision Trees
20000 patients
age gt 67
no
yes
18800 patients
1200 patients
gender male?
Weight gt 85kg
no
yes
no
800 customers
400 patients
etc.
Diabetic (10)
Diabetic (50)
92
Building TreesWeather Data example
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
KDNuggets / Witten Frank, 2000
93
Building Trees
  • An internal node is a test on an attribute.
  • A branch represents an outcome of the test, e.g.,
    Colorred.
  • A leaf node represents a class label or class
    label distribution.
  • At each node, one attribute is chosen to split
    training examples into distinct classes as much
    as possible
  • A new case is classified by following a matching
    path to a leaf node.

Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
false
true
No
No
Yes
Yes
KDNuggets / Witten Frank, 2000
94
Split on what attribute?
  • Which is the best attribute to split on?
  • The one which will result in the smallest tree
  • Heuristic choose the attribute that produces
    best separation of classes (the purest nodes)
  • Popular impurity measure information
  • Measured in bits
  • At a given node, how much more information do you
    need to classify an instance correctly?
  • What if at a given node all instances belong to
    one class?
  • Strategy
  • choose attribute that results in greatest
    information gain

KDNuggets / Witten Frank, 2000
95
Which attribute to select?
  • Candidate outlook attribute
  • What is the info for the leafs?
  • info2,3 0.971 bits
  • Info4,0 0 bits
  • Info3,2 0.971 bits
  • Total take average weighted by nof instances
  • Info(2,3, 4,0, 3,2) 5/14 0.971 4/14
    0 5/14 0.971 0.693 bits
  • What was the info before the split?
  • Info9,5 0.940 bits
  • What is the gain for a split on outlook?
  • Gain(outlook) 0.940 0.693 0.247 bits

Witten Frank, 2000
96
Which attribute to select?
Gain 0.247
Gain 0.152
Gain 0.048
Gain 0.029
Witten Frank, 2000
97
Continuing to split
KDNuggets / Witten Frank, 2000
98
The final decision tree
  • Note not all leaves need to be pure sometimes
    identical instances have different classes
  • ? Splitting stops when data cant be split any
    further

KDNuggets / Witten Frank, 2000
99
Computing information
  • Information is measured in bits
  • When a leaf contains once class only information
    is 0 (pure)
  • When the number of instances is the same for all
    classes information reaches a maximum (impure)
  • Measure information value or entropy
  • Example (log base 2)
  • Info(2,3,4) -2/9 log(2/9) 3/9 log(3/9)
    4/9 log(4/9)

KDNuggets / Witten Frank, 2000
100
Decision Trees in Pattern Space
Goal classifier is to seperate classes (circle,
square) on the basis of attribute age and
income Each line corresponds to a split in the
tree Decision areas are tiles in pattern space
weight
age
101
Decision Trees in attribute space
Goal classifier is to seperate classes (circle,
square) on the basis of attribute age and
weight Each line corresponds to a split in the
tree Decision areas are tiles in attribute
space
weight
age
102
Example classification algorithmNaive Bayes
  • Naive Bayes Probabilistic Classifier based on
    Bayes Rule
  • Will produce probability for each target /
    outcome class
  • Naive because it assumes independence between
    attributes (uncorrelated)

103
Bayess rule
  • Probability of event H given evidence E
  • A priori probability of H
  • Probability of event before evidence is seen
  • A posteriori probability of H
  • Probability of event after evidence is seen
  • from Bayes Essay towards solving a problem in
    the doctrine of chances (1763)
  • Thomas Bayes
  • Born 1702 in London, EnglandDied 1761 in
    Tunbridge Wells, Kent, England

KDNuggets / Witten Frank, 2000
104
Naïve Bayes for classification
  • Classification learning whats the probability
    of the class given an instance?
  • Evidence E instance
  • Event H class value for instance
  • Naïve assumption evidence splits into parts
    (i.e. attributes) that are independent

KDNuggets / Witten Frank, 2000
105
Weather data example
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Evidence E
Probability of class yes
KDNuggets / Witten Frank, 2000
106
Probabilities for weather data
Outlook Outlook Outlook Temperature Temperature Temperature Humidity Humidity Humidity Windy Windy Windy Play Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
KDNuggets / Witten Frank, 2000
107
Probabilities for weather data
Outlook Outlook Outlook Temperature Temperature Temperature Humidity Humidity Humidity Windy Windy Windy Play Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
  • A new day

Likelihood of the two classes For yes 2/9 ? 3/9 ? 3/9 ? 3/9 ? 9/14 0.0053 For no 3/5 ? 1/5 ? 4/5 ? 3/5 ? 5/14 0.0206 Conversion into a probability by normalization P(yes) 0.0053 / (0.0053 0.0206) 0.205 P(no) 0.0206 / (0.0053 0.0206) 0.795
KDNuggets / Witten Frank, 2000
108
Extensions
  • Numeric attributes
  • Fit a normal distribution to calculate
    probabilites
  • What if an attribute value doesnt occur with
    every class value?(e.g. Humidity high for
    class yes)
  • Probability will be zero!
  • A posteriori probability will also be zero!(No
    matter how likely the other values are!)
  • Remedy add 1 to the count for every attribute
    value-class combination (Laplace estimator)
  • Result probabilities will never be zero!(also
    stabilizes probability estimates)

witteneibe
109
Naïve Bayes discussion
  • Naïve Bayes works surprisingly well (even if
    independence assumption is clearly violated)
  • Why? Because classification doesnt require
    accurate probability estimates as long as maximum
    probability is assigned to correct class
  • However adding too many redundant attributes
    will cause problems (e.g. identical attributes)

witteneibe
110
Naive Bayes in attribute space
Classification
NB can model non
fe weight
fe age
111
Example classification algorithmNeural Networks
  • Inspired by neuronal computation in the brain
    (McCullough Pitts 1943 (!))
  • Input (attributes) is coded as activation on the
    input layer neurons, activation feeds forward
    through network of weighted links between neurons
    and causes activations on the output neurons (for
    instance diabetic yes/no)
  • Algorithm learns to find optimal weight using the
    training instances and a general learning rule.

112
Neural Networks
  • Example simple network (2 layers)
  • Probability of being diabetic f (age
    weightage body mass index weightbody mass
    index)

age body_mass_index
Weightbody mass index
weightage
Probability of being diabetic
113
Neural Networks in Pattern Space
Classification
Simpel network only a line available (why?) to
seperate classes Multilayer network Any
classification boundary possible
f.e. weight
f.e. age
114
Evaluating Classifiers
  • Root mean squared error (rmse), Area Under the
    ROC Curve (AUC), confusion matrices,
    classification accuracy
  • Accuracy 78 ? on test set 78 of
    classifications were correct
  • Hold out validation, n fold cross validation,
    leave one out validation
  • Build a model on a training set, evaluate on test
    set
  • Hold out single test set (f.e. one thirds of
    data)
  • n fold cross validation
  • Divide into n groups
  • Perform n cycles, each cycle with different fold
    as test set
  • Leave one out
  • Test set of one instance, cycle trough all
    instances

115
Evaluating Classifiers
  • Investigating the sources of error
  • bias variance decomposition
  • Informal definition
  • Bias error due to limitations of model
    representation (eg linear classifier on non
    linear problem) even with infinite date there
    will be bias
  • Variance error due to instability of classifier
    over different samples error due to sample
    sizes, overfitting

116
Example ResultsPredicting Survival for Head
Neck Cancer
  • TNM Symbolic

TNM Numeric
Average and standard deviation (SD) on the
classification accuracy for all classifiers
117
Example Results Head and Neck CancerBias
Variance Decomposition
  • Quiz What could be a strategy to improve these
    models?

118
Agenda Today
  • Data mining definitions
  • Before Starting to Mine.
  • Descriptive Data Mining
  • Dimension Reduction Projection
  • Clustering
  • Association rules
  • Predictive data mining concepts
  • Classification and regression
  • Bioinformatics applications
  • Predictive data mining techniques
  • Logistic Regression
  • Nearest Neighbor
  • Decision Trees
  • Naive Bayes
  • Neural Networks
  • Evaluating predictive models
  • Demonstration (optional)

119
General Data Mining Resources
  • Best KDD website mailing lists KDNuggets
    website (www.kdnuggets.com - check the
    bionformatics sections)
  • Most popular open source data mining tool WEKA (
  • Most important conference KDD (see
    www.kdd2006.com for instance check BIOKDD,
    bioinformatics data mining workshop)
  • You can contact me at putten_at_liacs.nl

120
  • Demo WEKA

121
  • Questions and Answers
Write a Comment
User Comments (0)
About PowerShow.com