Automatic Text Classification through Machine Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Text Classification through Machine Learning

Description:

Automatic Text Classification through Machine Learning David W. Miller Semantic Web Spring 2002 Department of Computer Science University of Georgia – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 45
Provided by: AndrewM139
Category:

less

Transcript and Presenter's Notes

Title: Automatic Text Classification through Machine Learning


1
Automatic Text Classification through Machine
Learning
David W. Miller Semantic Web Spring
2002 Department of Computer Science University
of Georgia www.cs.uga.edu/miller/SemWeb
2
Query to General-Purpose Search Engine camp
basketball north carolina two weeks
Automatic Text Classification through Machine
Learning, McCallum, et. al.
3
Domain-Specific Search Engine
Automatic Text Classification through Machine
Learning, McCallum, et. al.
4
Automatic Text Classification through Machine
Learning, McCallum, et. al.
5
Automatic Text Classification through Machine
Learning, McCallum, et. al.
6
Domain-Specific Search EngineAdvantages
  • High precision.
  • Powerful searches on domain-specific features.
  • by location, time, price, institution.
  • Domain-specific presentation interfaces
  • Topic hierarchies.
  • Specific fields shown in clear format.
  • Links for special relationships.

Automatic Text Classification through Machine
Learning, McCallum, et. al.
7
Domain-Specific Search EngineDisadvantages
  • Much human effort to build and maintain!
  • e.g. Yahoo has hired many people to build their
    hierarchy, and maintain Full Coverage, etc.

Automatic Text Classification through Machine
Learning, McCallum, et. al.
8
Tough Tasks
  • Find pages that belong in the search engine.
  • Find specific fields (price, location, etc).
  • Organize the content for browsing.

Automatic Text Classification through Machine
Learning, McCallum, et. al.
9
Machine Learning to the Rescue!
  • Find pages that belong in the search engine.
  • Efficient spidering by reinforcement learning.
  • Find specific fields (price, location, etc).
  • Information extraction with hidden Markov models.
  • Organize the content for browsing.
  • Populate a topic hierarchy by document
    classification.

Automatic Text Classification through Machine
Learning, McCallum, et. al.
10
Building Text Classifiers
  • Manual approach
  • Interactive query refinement
  • Expert system methodologies
  • Supervised learning
  • 1. Expert labels example texts with classes
  • 2. Machine learning algorithm produces rule that
    tends to agree with expert classifications

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
11
Advantages of Using Machine Learning to Build
Classifiers
  • Requires no linguistic or computer skills
  • Competitive with manual rule-writing
  • Forces good practices
  • Looking at data
  • Estimating accuracy
  • Can be combined with manual engineering
  • ML research pays too little attention to this

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
12
Main Processes for aMachine-Learning System
Supervised Machine-Learning Based Text
Categorization, Ng Hong I
13
Preparation of Training Texts
  • Essential for a supervised machine learning text
    categorization system
  • Decide on the set of categories
  • A set of positive training texts is prepared for
    each of the categories
  • Assign subject code(s) to each of the training
    texts
  • More than one subject code may be assigned to one
    training text

Supervised Machine-Learning Based Text
Categorization, Ng Hong I
14
Demonstration System Cora
  • Find pages that belong in the search engine.
  • Spider CS departments for research papers.
  • Find specific fields (price, location, etc).
  • Extract titles, authors, abstracts, institutions,
    etc from paper headers and references.
  • Organize the content for browsing.
  • Populate a hand-built topic hierarchy by using
    text classification.

Automatic Text Classification through Machine
Learning, McCallum, et. al.
15
Automatic Text Classification through Machine
Learning, McCallum, et. al.
16
Automatic Text Classification through Machine
Learning, McCallum, et. al.
17
See also CiteSeer Bollacker, Lawrence Giles
98
Automatic Text Classification through Machine
Learning, McCallum, et. al.
18
Automatic Text Classification through Machine
Learning, McCallum, et. al.
19
Automatic Text Classification via Statistical
Methods
  • Text Categorization is the problem of assigning
    predefined categories to free text documents.
  • Popular Approach is Statistical Learning Methods
  • Bayes Method
  • Rocchio Method (most popular)
  • Decision Trees
  • K-Nearest Neighbor Classification
  • Support Vector Machines (fairly new concept)

20
A Probabilistic Generative Model
  • Define a probabilistic generative model for
    documents with classes.Bayes

Automatic Text Classification through Machine
Learning, McCallum, et. al.
21
Bayes Method
Pick the most probable class, given the evidence
- a class (like Planning)
- a document (like language intelligence
proof...)
Bayes Rule
Probability Category cj should be assigned to
document d
Automatic Text Classification through Machine
Learning, McCallum, et. al.
22
Bayes Rule
- Probability that document d belongs to category
cj
- Probability that a randomly picked document has
the same attributes
- Probability that a randomly picked document
belongs to this category
- Probability that category c contains document d
23
Bayes Method
  • Generates conditional probabilities of particular
    words occurring in a document given it belongs to
    a particular category.
  • Larger vocabulary generate better probabilities
  • Each category is given a threshold p for which it
    judges the worthiness of a document to fall in
    that classification.
  • Documents may fall into one, more than one, or
    not even one category.

24
Rocchio Method
  • Each document is D is represented as a vector
    within a given vector space V
  • Documents with similar content have similar
    vectors
  • Each dimension of the vector space represents a
    word selected via a feature selection process

25
Rocchio Method
  • Values of d(i) for a document d are calculated as
    a combination of the statistics TF(w,d) and DF(w)
  • TF(w,d) (Term Frequency) is the number of times
    word w occurs in a document d.
  • DF(w) (Document Frequency) is the number of
    documents in which the word w occurs at least
    once.

26
Rocchio Method
  • The inverse document frequency is calculated as
  • Value of d(i) of feature wi for a document d is
    calculated as the product
  • d(i) is called the weight of the word wi in the
    document d.

27
Rocchio Method
  • Based on word weight heuristics, the word wi is
    an important indexing term for a document d if it
    occurs frequently in that document
  • However, words that occurs frequently in many
    document spanning many categories are rated less
    importantly

28
Decision Tree Learning Algorithm
  • Probabilistic methods have been criticized since
    they are not easily interpreted by humans, not so
    with Decision Trees
  • Decision Trees fall into the category of symbolic
    (non-numeric) algorithms

29
Decision Trees
  • Internal nodes are labeled by terms
  • Branches (departing from a node) are labeled by
    tests on the weight that the term has in a test
    document
  • Leafs are labeled by categories

30
Decision Tree Example
31
Decision Tree
  • Classifier categorizes a test document d by
    recursively testing for the weights that the
    terms labeling the internal nodes have until a
    leaf node is reached.
  • The label of the leaf node is then assigned to
    the document
  • Most decision trees are binary trees

32
Decision Tree
  • Fully grown trees tend to have decision rules
    that are overly specific and are therefore unable
    to categorize documents
  • Therefore, pruning and growing methods for such
    Decision Trees are normally standard part of the
    classification packages

33
K-Nearest Neighbor
  • Features
  • All instances correspond to points in an
    n-dimensional Euclidean space
  • Classification is delayed till a new instance
    arrives
  • Classification done by comparing feature vectors
    of the different points
  • Target function may be discrete or real-valued

K-Nearest Neighbor Learning, Dipanjan Chakraborty
34
1-Nearest Neighbor
K-Nearest Neighbor Learning, Dipanjan Chakraborty
35
K-Nearest Neighbor
  • An arbitrary instance is represented by (a1(x),
    a2(x), a3(x),.., an(x))
  • ai(x) denotes features
  • Euclidean distance between two instances
  • d(xi, xj)sqrt (sum for r1 to n (ar(xi) -
    ar(xj))2)
  • Find the k-nearest neighbors whose distance from
    your test cases falls within a threshold p.
  • If x of those k-nearest neighbors are in category
    ci, then assign the test case to ci, else it is
    unmatched.

K-Nearest Neighbor Learning, Dipanjan Chakraborty
36
Support Vector Machines
  • Based on the Structural Risk Minimization
    principle form computational learning theory
  • Find a hypothesis h for which we can guarantee
    the lowest true error
  • The true error of h is the probability that h
    will make an error on an unseen and randomly
    selected test example

37
Evaluating Learning Algorithms and Software
  • How effective/accurate is classification?
  • Compatibility with operational environment
  • Resource usage
  • Persistence
  • Areas learning algorithms need improvement

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
38
Effectiveness Contingency Table
Machine Learning for Text Classification, David
D. Lewis, ATT Labs
39
Effectiveness Measures
  • recall a/(ac)
  • precision a/(ab)
  • accuracy (ac)/(abcd)
  • utility any weighted average of a,b,c,d
  • F-measure 2a/(2abc)
  • others

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
40
Effectiveness How to Predict
  • Theoretical gaurantees rarely useful
  • Test system on manually classified data
  • Representativeness of sample important
  • Will data vary over time?
  • Effectiveness varies widely across classes and
    data sets
  • Interindexer agreement an upper bound?

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
41
Effectiveness How to Improve
  • More training data
  • Better training data
  • Better text representation
  • Usual IR tricks (term weighting, etc.)
  • Manually construct good predictor features
  • e.g. capitalized letters for spam filtering
  • Hand off hard cases to human being

Machine Learning for Text Classification, David
D. Lewis, ATT Labs
42
Conclusions
  • Performance of classifier depends strongly on the
    choice of data used for evaluation.
  • Dense category space become problematic for
    unique categorization, many documents share
    characteristics

43
CreditsThis Presentation is Partially Based on
Those of Others Listed Below
  • Supervised Machine Learning Based Text
    Categorization
  • Machine Learning for Text Classification
  • Automatically Building Internet Portals using
    Machine Learning
  • Web Search
  • Machine Learning
  • K-Nearest Neighbor Learning

Full Presentations can be found at
http//webster.cs.uga.edu/miller/SemWeb/Presentat
ion/ACT.html
44
Resources
  • Text Categorization Using Weight Adjusted
    k-Nearest Neighbor Classification
  • A Probalisitic Analysis of the Rocchio Alg. w/
    TFIDF for Text Categorization
  • Text Categorization w/ Support Vector Machines
  • Learning to Extract Symbolic Knowledge from the
    WWW
  • An Evaluation of Statistical Approaches to Text
    Categorization
  • A Comparison of Two Learning Algorithms for Text
    Categorization
  • Machine Learning in Automated Text Categorization

Full List of Resources can be found at
http//webster.cs.uga.edu/miller/SemWeb/Presentat
ion/ACT.html
Write a Comment
User Comments (0)
About PowerShow.com