L3S Overview Visit in Sweden - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

L3S Overview Visit in Sweden

Description:

Too difficult for humans, how can automation help? Information Goal ... Video: Daily Motion. Machine Learning: has many practical applications ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 47

Provided by: Nej7

Category:

more less

Transcript and Presenter's Notes

Title: L3S Overview Visit in Sweden

1
Introduction Machine Learning December 16,
2008
Avaré Stewart
2
Outline

Motivation
Learner Input
Selected Learning Techniques
Supervised Learning
Unsupervised Learning
Summary
Tools
Further Reading

3
Motivation

Data Volume
Terabytes of data
How do we detect patterns
Structure, Classify
Too difficult for humans, how can automation
help?
Information Goal
Recognize faces, objects in a picture
Recognize speech
Filter email
Humans do this with ease .. but, how do we
encode such expertise in an algorithm?

4
Example Machine Learning Application - Text
Mining

Machine Learning
has many practical applications
can be applied to different domains
i.e. personalized search, ranking
Online Text Documents (i.e. blogs)
Classify Blogs, known classes
Group Blogs into Clusters
Discover salient terms
Compact representations

5
What is Machine Learning?

Machine Learning is
programming computers to optimize a performance
criterion using example data or past experience

A machine has learnt when
it changes its structure, program, or data,
based on its inputs, in such a manner that its
expected future performance improves

6
Supervised Learning
Class Variable
Values
Attributes

Technique for creating a target function relating
the inputs (attribute values ) to the output
(class variable)
Predict the value of the function for unseen,
test data (any valid input object) after having
seen a number of training examples

7
Unsupervised Learning
Input
Output

Input a-priori partition of the data is not
given
Goal Discover intrinsic groupings of input data
based on
Similarity Function
Distance Function

8
Semi-Supervised Learning

Labeled Data Tradeoff
Labeled data - guides the machine learning
algorithm
Labeled data high manual effort
Most real data is not labeled
Semi-Supervised Learning
Some labeled , mostly unlabeled data
Labels positive or negative

Semi-supervised
9
Our Roadmap

Learner Input s
Case Text Documents
Supervised Learning
Case Naïve Bayes
Case Bayesian Network
Unsupervised Learner
EM Algorithm
Case Latent Topic Model

10
Learner Preprocessing / Input (Text Documents)

Regardless Learning Method
text documents are not directly process-able by
learner
preprocessing influences results of learner
Partitioning Data
Training
Testing
Validation
Indexing
Interpretation of a term
Interpretation weight
Stop Word
Stemming
Pruning

Information Retrieval
Machine Learning
11
Data Partitioning

Training Set build learner
Validation Set tune learner
Test Set evaluate learner i.e. perplexity,
k-fold cross validation

Training Set
Test Set
Validation Set
?
?
12
Filtering Stop Word / Stemming

Stop Word Removal eliminate non-discriminating
terms
Prepositions over, up
Articles the , a
Conjunctions and, thus
Stemming group words that same morphological
root
i.e. play , plays
i.e. teacher , teaching

13
Terms and Weights

Terms can be
Bag of words (most popular)
Syntactic phases grammatical sequence , i.e.
noun phase
Phantom Limb, Retinal Ganglion, Eating-disorder
Statistical phrases sequence significant words,
co-occurrence
Asthma-lungs, cholesterol-Arteries
Weights can
Computed Term frequency, tf-idf , etc
Valued according
binary i.e. 0 or 1
Normalization, probabilistic i.e. 0..1

14
Pruning

Term Space Reduction / Dimensionality Reduction
Reduce the set of terms used in the learner
Reasons
Some Learners dont scale with huge number of
terms
Improve performance, less noise
Reduced overfitting
Learner can not generalize idiosyncratically
builds model for training data given
Overfitting may be avoided even if smaller amount
training examples is used
Risks removing potentially useful terms
elucidate meaning of document

15
Pruning Approaches

Document Frequency Keep terms that receive
highest score according to a function that
measures importance of the term
Reduce dimensions by factor 10
at most 1 - 3 times in training documents
at most 1 - 5 times in the training set
Term Clustering
Group words with high degree of semantic
relatedness
Represents set of words as abstraction / concept
Groups or centroids used as learner dimensions
Information Theoretic
best terms based on distribute differently across
the classes.

16
Information Theoretic Pruning Approaches
17

Supervised Learning

18
What Can We Learn? - Naïve Bayes Classifier
Class Attribute

Probabilistic
Goal predict Class value
Pr(C j d)
Simple yet effective
Based on Conditional Independence Assumption

Input Attributes

Joint Probabilities can be rewritten as the
product of individual probabilities

19
Bayes Theorem Likelihood
Posterior
Prior
Bayes Rule
Law of Total Probability (normalization)
Substituting
Sometimes used ..
20
Making a Prediction with Naïve Bayes (1 of 3)

From Bayes Theorem
Law of Total Probability
Conditional Independence Assumption

Term 1

Term 2

Looks like we need a bunch of products and sums
...for 2 terms.

21
Example Naïve Bayes (2 of 3)
Example Training Data
22
Making a Prediction with Naïve Bayes (3 of 3)
From Slide 2 of 3

Given a new instance assume
A m , B q , C ?

True wins
23
Summary

Advantages Naïve Bayes
Simple technique
results in high accuracy, especially when
combined with other methods.
Disadvantages Naïve Bayes
Treats variable as independent and equally
important, can cause skewed results
not allow for categorical output attributes
Other Supervised Methods
SVM Support Vector Machines
Decision Trees
Bayesian Network

24
What Can We Learn? - A Bayesian Network

Bayesian Network can
Overcome independence assumption of Naïve Bayes
Handles Noise (misclassifications)
Optimal Prediction small or large data

Andrew W Moore
25
Bayesian Network
Conditional Probability Table
DAG
26
Making a Prediction with BN

Given Evidence / Unseen case
Outlook sunny
Temperature cool
Windy true
Humidity high
Prediction Step
a What is the probability play no
Pr(play no x) .367.625.538.111.250
b What is the probability play yes
Which probability, a or b is maximum?

27
Learning Bayesian Network

Nice But How do I get the DAG ?
Nice But how do I fill in the Table?
Bayesian Network Learning
Structure Learning
Parameter Learning
Given training set, no missing values

28
Parameter Learning
SELECT outlook, play, temperature , count() as
count FROM db.table group by outlook, play,
temperature

Given the structure ..
count occurrences in the database

Portion of Conditional Probability Table for the
Temperature Node
29
Learning Structure

Structure Learning Requires
Search Procedure
Scoring Mechanism

Bayesian Network Structure
entropy
degrees

Example Score

Data
30
Search Procedure

Search procedure we produced different possible
Bayesian Network Structures
Example K2 Algorithm
Start with empty DAG (or Naïve Bayes Net)
Add, remove, or reverse edges
Assure no cycles created
Score, checking for improvement
Keep structure, if new score is higher than
previous score
How do we score the current Bayesian Network ?

31
Scoring Mechanism
Child Configurations k 1,2,3 rk 3 N52
4 N521 2 N522 1 N523 1
1
2
3
1
Combine Parent Values
Parent configurations i 1,2,3 , 6 qi 6 2
X 3
2
3
4
5
6
32

Unsupervised Learning

Using Tags to Cluster Blogs
33
What Can We Learn? - Model Based Clustering

Clustering divides the data into groups
Has wide use, in many applications
Why Cluster?
Enhance Understanding
Group Web search results
Segmenting customers targeted marketing
Utility
Summarization
Compression

34
Model Based Clustering

Model Based Learner
Gives the probability with which a specific
object belongs to a particular cluster
Assumptions
data generated by a mixture model
a one-to-one correspondence between mixture
components and classes (clusters)
Prob. each component sum to 1

Mixture Model
combination of probability distributions
Cluster 1
Cluster 2
Data
35
What is a Model ?

A Model is
Distribution
Parameters
Learning Process
Decide on a statistical model for the data
Learn the parameters of the model from the data

36
Maximum Likelihood Estimation
For the moment, assume a single Gaussian .
Reformulation, taking the log
s
µ

To find the MLE parameters, maximize
Take the derivative of the likelihood function
wrt. the parameter
Set the result to zero, and solve

37
Learning Parameters of the Mixture Model
Cluster 1
Cluster 2

Now, lets go back to 2 clusters ..
Which points were generated by which
distribution? - Here is were the unsupervised
part comes in !
Guess ? Or, calculate the prob. that each point
belongs to each distribution
Use these prob. to compute a new estimate for the
parameters

Data

Expectation
Step

Maximization
Step

38
EM Algorithm Step 1 Initialization

Step 1 Select an initial set of model parameters
Assume we know
Make initial guesses for

39
EM Algorithm Step 2 E-Step
The probability that a point came from a
particular distribution
Assuming, x 0
Recall
Compute for all points and distributions
40
EM Algorithm Step 3 M-Step

Given the probabilities from the expectation
step, find the new estimates of the parameters
that maximize the expected likelihood

Repeat from E-Step until values stabilize

41
Topic Model
Latent Topic

Generative Process for Document Creation
Topic mixture of words
Document mixture of topics
Predefined distributions govern selection process

http//psiexp.ss.uci.edu/research/papers/SteyversG
riffithsLSABookFormatted.pdf
42
Example Topic Model pLSA
Example
pLSA Model
E-Step
M-Step
43
Summary Naïve Bayes

Advantages
Simple technique
results in high accuracy, especially when
combined with other methods.
Disadvantages
Treats variable as independent and equally
important, can cause skewed results
not allow for categorical output attributes

44
Summary Bayesian Belief Network

Advantages
Well suited incomplete data
Disadvantages
can be computationally intensive, esp. when
variables are not conditionally independent of
one another.

45
Summary Generative Model

Advantages
Clear, well-studied probabilistic framework
Can be extremely effective, if the model is close
to correct
Disadvantages
Often difficult to verify the correctness of the
model
EM local optima
Unlabeled data may hurt if generative model is
wrong

46
Tools and Further Reading

Tools
Machine Learning Weka, Lemur, Mallet
Clustering Demo
TCT - Text Clustering Toolkit
CLUTO - Clustering Toolkit
Kevin Murphy's Bayesian Network Toolbox for
MatLab
Hugin - Bayesian Networks
Bibliography
Machine Learning. Mitchell
Principles of Data Mining. Hand, Mannila, Symth.
Introduction to Data Mining. Tan Steinbach,
Kumar.
Introduction to Information Retrieval. Manning,
Raghavan ,Schütze,
Weka Data Mining Practical Machine Learning
Tools and Techniques .Witten, Frank
Web Data Mining.Bing Liu
Andrew Moore Tutorial Slides

Semi-Supervised Learning

48
What can we Learn Semi-Supervised Learner Text
Classification

Assumptions
Documents represented as a bag of words
Probability of word is independent of position
in the document
Words generated by a multinomial distribution
A one-to-one correspondence between mixture
components and classes

??
??
tube
dry
Abbey
Columbian
Diana
49
Making a Prediction w/ Naïve Bayes Text
Classifier
c1
c2
c3

Term 2

Term 1

Looks like we need a bunch of products and sums
...for 2 terms.

D1
D2
D3
Dj subset of data for class cj
50
Closer Look at Term 1
Term 1 Prior probability on the classes
Estimated from the data based on the number of
documents appearing in each class and the total
number of documents
D1
D2
D3
51
Closer Look at Term 2
Term 2 applying model assumptions..
The number of times wt occurs in the training
data Dj (of class cj) divided by the total number
of word occurrences in the training data for that
class
Nti number of times that word wt occurs in
document di Pr(cjdi) 1 , if di in Dj 0
otherwise V set of all distinctive words
52
Recall Making a Prediction with Naïve Bayes