Title: L3S Overview Visit in Sweden
1Introduction Machine Learning December 16,
2008
Avaré Stewart
2Outline
- Motivation
- Learner Input
- Selected Learning Techniques
- Supervised Learning
- Unsupervised Learning
- Summary
- Tools
- Further Reading
3Motivation
- Data Volume
- Terabytes of data
- How do we detect patterns
- Structure, Classify
- Too difficult for humans, how can automation
help? - Information Goal
- Recognize faces, objects in a picture
- Recognize speech
- Filter email
- Humans do this with ease .. but, how do we
encode such expertise in an algorithm?
4Example Machine Learning Application - Text
Mining
- Machine Learning
- has many practical applications
- can be applied to different domains
- i.e. personalized search, ranking
- Online Text Documents (i.e. blogs)
- Classify Blogs, known classes
- Group Blogs into Clusters
- Discover salient terms
- Compact representations
-
5What is Machine Learning?
- Machine Learning is
- programming computers to optimize a performance
criterion using example data or past experience
- A machine has learnt when
- it changes its structure, program, or data,
based on its inputs, in such a manner that its
expected future performance improves
6Supervised Learning
Class Variable
Values
Attributes
- Technique for creating a target function relating
the inputs (attribute values ) to the output
(class variable) - Predict the value of the function for unseen,
test data (any valid input object) after having
seen a number of training examples
7Unsupervised Learning
Input
Output
- Input a-priori partition of the data is not
given - Goal Discover intrinsic groupings of input data
based on - Similarity Function
- Distance Function
8Semi-Supervised Learning
- Labeled Data Tradeoff
- Labeled data - guides the machine learning
algorithm - Labeled data high manual effort
- Most real data is not labeled
- Semi-Supervised Learning
- Some labeled , mostly unlabeled data
- Labels positive or negative
Semi-supervised
9Our Roadmap
- Learner Input s
- Case Text Documents
- Supervised Learning
- Case Naïve Bayes
- Case Bayesian Network
- Unsupervised Learner
- EM Algorithm
- Case Latent Topic Model
10Learner Preprocessing / Input (Text Documents)
- Regardless Learning Method
- text documents are not directly process-able by
learner - preprocessing influences results of learner
- Partitioning Data
- Training
- Testing
- Validation
- Indexing
- Interpretation of a term
- Interpretation weight
- Stop Word
- Stemming
- Pruning
-
Information Retrieval
Machine Learning
11Data Partitioning
- Training Set build learner
- Validation Set tune learner
- Test Set evaluate learner i.e. perplexity,
k-fold cross validation
Training Set
Test Set
Validation Set
?
?
12Filtering Stop Word / Stemming
- Stop Word Removal eliminate non-discriminating
terms - Prepositions over, up
- Articles the , a
- Conjunctions and, thus
- Stemming group words that same morphological
root - i.e. play , plays
- i.e. teacher , teaching
13Terms and Weights
- Terms can be
- Bag of words (most popular)
- Syntactic phases grammatical sequence , i.e.
noun phase - Phantom Limb, Retinal Ganglion, Eating-disorder
- Statistical phrases sequence significant words,
co-occurrence - Asthma-lungs, cholesterol-Arteries
- Weights can
- Computed Term frequency, tf-idf , etc
- Valued according
- binary i.e. 0 or 1
- Normalization, probabilistic i.e. 0..1
14Pruning
- Term Space Reduction / Dimensionality Reduction
- Reduce the set of terms used in the learner
- Reasons
- Some Learners dont scale with huge number of
terms - Improve performance, less noise
- Reduced overfitting
- Learner can not generalize idiosyncratically
builds model for training data given - Overfitting may be avoided even if smaller amount
training examples is used - Risks removing potentially useful terms
elucidate meaning of document
15Pruning Approaches
- Document Frequency Keep terms that receive
highest score according to a function that
measures importance of the term - Reduce dimensions by factor 10
- at most 1 - 3 times in training documents
- at most 1 - 5 times in the training set
- Term Clustering
- Group words with high degree of semantic
relatedness - Represents set of words as abstraction / concept
- Groups or centroids used as learner dimensions
- Information Theoretic
- best terms based on distribute differently across
the classes.
16Information Theoretic Pruning Approaches
17 18What Can We Learn? - Naïve Bayes Classifier
Class Attribute
- Probabilistic
- Goal predict Class value
- Pr(C j d)
- Simple yet effective
- Based on Conditional Independence Assumption
Input Attributes
- Joint Probabilities can be rewritten as the
product of individual probabilities
19Bayes Theorem Likelihood
Posterior
Prior
Bayes Rule
Law of Total Probability (normalization)
Substituting
Sometimes used ..
20Making a Prediction with Naïve Bayes (1 of 3)
- From Bayes Theorem
- Law of Total Probability
- Conditional Independence Assumption
- Looks like we need a bunch of products and sums
...for 2 terms.
21Example Naïve Bayes (2 of 3)
Example Training Data
22Making a Prediction with Naïve Bayes (3 of 3)
From Slide 2 of 3
- Given a new instance assume
- A m , B q , C ?
True wins
23Summary
- Advantages Naïve Bayes
- Simple technique
- results in high accuracy, especially when
combined with other methods. - Disadvantages Naïve Bayes
- Treats variable as independent and equally
important, can cause skewed results - not allow for categorical output attributes
- Other Supervised Methods
- SVM Support Vector Machines
- Decision Trees
- Bayesian Network
24What Can We Learn? - A Bayesian Network
- Bayesian Network can
- Overcome independence assumption of Naïve Bayes
- Handles Noise (misclassifications)
- Optimal Prediction small or large data
Andrew W Moore
25Bayesian Network
Conditional Probability Table
DAG
26Making a Prediction with BN
- Given Evidence / Unseen case
- Outlook sunny
- Temperature cool
- Windy true
- Humidity high
- Prediction Step
- a What is the probability play no
- Pr(play no x) .367.625.538.111.250
- b What is the probability play yes
- Which probability, a or b is maximum?
27Learning Bayesian Network
- Nice But How do I get the DAG ?
- Nice But how do I fill in the Table?
- Bayesian Network Learning
- Structure Learning
- Parameter Learning
- Given training set, no missing values
28Parameter Learning
SELECT outlook, play, temperature , count() as
count FROM db.table group by outlook, play,
temperature
- Given the structure ..
- count occurrences in the database
Portion of Conditional Probability Table for the
Temperature Node
29Learning Structure
- Structure Learning Requires
- Search Procedure
- Scoring Mechanism
Bayesian Network Structure
entropy
degrees
Data
30Search Procedure
- Search procedure we produced different possible
Bayesian Network Structures - Example K2 Algorithm
- Start with empty DAG (or Naïve Bayes Net)
- Add, remove, or reverse edges
- Assure no cycles created
- Score, checking for improvement
- Keep structure, if new score is higher than
previous score - How do we score the current Bayesian Network ?
31Scoring Mechanism
Child Configurations k 1,2,3 rk 3 N52
4 N521 2 N522 1 N523 1
1
2
3
1
Combine Parent Values
Parent configurations i 1,2,3 , 6 qi 6 2
X 3
2
3
4
5
6
32Using Tags to Cluster Blogs
33What Can We Learn? - Model Based Clustering
- Clustering divides the data into groups
- Has wide use, in many applications
- Why Cluster?
- Enhance Understanding
- Group Web search results
- Segmenting customers targeted marketing
- Utility
- Summarization
- Compression
34Model Based Clustering
- Model Based Learner
- Gives the probability with which a specific
object belongs to a particular cluster - Assumptions
- data generated by a mixture model
- a one-to-one correspondence between mixture
components and classes (clusters) - Prob. each component sum to 1
Mixture Model
combination of probability distributions
Cluster 1
Cluster 2
Data
35What is a Model ?
- A Model is
- Distribution
- Parameters
- Learning Process
- Decide on a statistical model for the data
- Learn the parameters of the model from the data
36Maximum Likelihood Estimation
For the moment, assume a single Gaussian .
Reformulation, taking the log
s
µ
- To find the MLE parameters, maximize
- Take the derivative of the likelihood function
wrt. the parameter - Set the result to zero, and solve
37Learning Parameters of the Mixture Model
Cluster 1
Cluster 2
- Now, lets go back to 2 clusters ..
- Which points were generated by which
distribution? - Here is were the unsupervised
part comes in ! - Guess ? Or, calculate the prob. that each point
belongs to each distribution - Use these prob. to compute a new estimate for the
parameters
Data
38EM Algorithm Step 1 Initialization
- Step 1 Select an initial set of model parameters
- Assume we know
- Make initial guesses for
39EM Algorithm Step 2 E-Step
The probability that a point came from a
particular distribution
Assuming, x 0
Recall
Compute for all points and distributions
40EM Algorithm Step 3 M-Step
- Given the probabilities from the expectation
step, find the new estimates of the parameters
that maximize the expected likelihood
- Repeat from E-Step until values stabilize
41Topic Model
Latent Topic
- Generative Process for Document Creation
- Topic mixture of words
- Document mixture of topics
- Predefined distributions govern selection process
http//psiexp.ss.uci.edu/research/papers/SteyversG
riffithsLSABookFormatted.pdf
42Example Topic Model pLSA
Example
pLSA Model
E-Step
M-Step
43Summary Naïve Bayes
- Advantages
- Simple technique
- results in high accuracy, especially when
combined with other methods. - Disadvantages
- Treats variable as independent and equally
important, can cause skewed results - not allow for categorical output attributes
44Summary Bayesian Belief Network
- Advantages
- Well suited incomplete data
- Disadvantages
- can be computationally intensive, esp. when
variables are not conditionally independent of
one another.
45Summary Generative Model
- Advantages
- Clear, well-studied probabilistic framework
- Can be extremely effective, if the model is close
to correct - Disadvantages
- Often difficult to verify the correctness of the
model - EM local optima
- Unlabeled data may hurt if generative model is
wrong
46Tools and Further Reading
- Tools
- Machine Learning Weka, Lemur, Mallet
- Clustering Demo
- TCT - Text Clustering Toolkit
- CLUTO - Clustering Toolkit
- Kevin Murphy's Bayesian Network Toolbox for
MatLab - Hugin - Bayesian Networks
- Bibliography
- Machine Learning. Mitchell
- Principles of Data Mining. Hand, Mannila, Symth.
- Introduction to Data Mining. Tan Steinbach,
Kumar. - Introduction to Information Retrieval. Manning,
Raghavan ,Schütze, - Weka Data Mining Practical Machine Learning
Tools and Techniques .Witten, Frank - Web Data Mining.Bing Liu
- Andrew Moore Tutorial Slides
47 48What can we Learn Semi-Supervised Learner Text
Classification
- Assumptions
- Documents represented as a bag of words
- Probability of word is independent of position
in the document - Words generated by a multinomial distribution
- A one-to-one correspondence between mixture
components and classes
??
??
tube
dry
Abbey
Columbian
Diana
49Making a Prediction w/ Naïve Bayes Text
Classifier
c1
c2
c3
- Looks like we need a bunch of products and sums
...for 2 terms.
D1
D2
D3
Dj subset of data for class cj
50Closer Look at Term 1
Term 1 Prior probability on the classes
Estimated from the data based on the number of
documents appearing in each class and the total
number of documents
D1
D2
D3
51Closer Look at Term 2
Term 2 applying model assumptions..
The number of times wt occurs in the training
data Dj (of class cj) divided by the total number
of word occurrences in the training data for that
class
Nti number of times that word wt occurs in
document di Pr(cjdi) 1 , if di in Dj 0
otherwise V set of all distinctive words
52Recall Making a Prediction with Naïve Bayes
- Given a new instance assume
- A m , B q , C ?
53Overview Semi-Supervised w/ EM Algorithm
Term 1
Classifier
M-Step
Term 2
6
Labeled Data
Naïve Bayes Classifier
E-Step
4
5
1
2
3
Unlabeled Data
D