Machine Learning (ML) Classification - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Machine Learning (ML) Classification

Description:

... compassion, game, headaches, lite, nfl, powerful, strawberry, urges, home, ... little, match, payments, pitch, play, player, red, stadiums, umpire, wife, youth, ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 17
Provided by: timhum
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning (ML) Classification


1
Machine Learning (ML) Classification
  • Tim Humphrey
  • LexisNexis
  • 21 June 2001

2
A little scientific humor
  • This atom says to his friend, "I'm really upset,
    I've just lost an electron."
  • His friend says to him, "Are you sure?"
  • "Yeah," he replies. "I'm Positive.
  • __________________________________________________
    __________________________
  • How many weeks are there in a light year?

3
Overview
  • Preface
  • Definition of ML Classifier
  • You as an ML Classifier
  • A little math
  • Examples of ML algorithms
  • How well do ML classifiers work?
  • Advantages Disadvantages
  • Uses of ML classifiers
  • Challenges
  • Q A

4
Preface
  • Mark talked about ways to make rules that
    classify documents. Examples of companies that
    have such systems are
  • LexisNexis
  • Verity
  • SmartLogic
  • Interwoven
  • Machine Learning is another way of getting
    computers to classify documents. Machine learning
    is normally not rule based. Instead, it is
    normally statistically based.

5
Definition of ML Classifier
  • Definition of Machine Learning from
    dictionary.com
  • The ability of a machine to improve its
    performance based on previous results.
  • So, machine learning document classification is
    the ability of a machine to improve its document
    classification performance based on previous
    results of document classification.

6
You as an ML Classifier
  • Topic 1 words
  • baseball, owners, sports, selig, ball, bill,
    indians, isringhausen, mets, minors, players,
    specter, stadium, power, send, new, bud, comes,
    compassion, game, headaches, lite, nfl, powerful,
    strawberry, urges, home, ambassadors, building,
    calendar, commish, costs, day, dolan, drive,
    hits, league, little, match, payments, pitch,
    play, player, red, stadiums, umpire, wife, youth,
    field, leads
  • Topic 2 words
  • merger, business, bank, buy, announces, new,
    acquisition, finance, companies, com, company,
    disclosure, emm, news, us, acquire, chemical,
    inc, results, shares, takeover, corporation,
    european, financial, investment, market, quarter,
    two, acquires, bancorp, bids, communications,
    first, mln, purchase, record, stake, west, sale,
    bid, bn, brief, briefs, capital, control, europe,
    inculab

7
Use the previous slides topics related words
to classify the following titles
  1. CYBEX-Trotter merger creates fitness equipment
    powerhouse
  2. WSU RECRUIT CHOOSES BASEBALL INSTEAD OF FOOTBALL
  3. FCC chief says merger may help pre-empt Internet
    regulation
  4. Vision of baseball stadium growing
  5. Regency Realty Corporation Completes Acquisition
    Of Branch properties
  6. Red Sox to punish All-Star scalpers
  7. Canadian high-tech firm poised to make
    415-million acquisition
  8. Futures-selling hits the Footsie for six
  9. A'S NOTEBOOK Another Young Arm Called Up
  10. All-American SportPark Reaches Agreement for
    Release of Corporate Guarantees

8
Titles Their Classifications
  1. (2) CYBEX-Trotter merger creates fitness
    equipment powerhouse
  2. (1) WSU RECRUIT CHOOSES BASEBALL INSTEAD OF
    FOOTBALL
  3. (2) FCC chief says merger may help pre-empt
    Internet regulation
  4. (1) Vision of baseball stadium growing
  5. (2) Regency Realty Corporation Completes
    Acquisition Of Branch properties
  6. (1) Red Sox to punish All-Star scalpers
  7. (2) Canadian high-tech firm poised to make
    415-million acquisition
  8. (2) Futures-selling hits the Footsie for six
  9. (1) A'S NOTEBOOK Another Young Arm Called Up
  10. (1) All-American SportPark Reaches Agreement for
    Release of Corporate Guarantees

9
The Salary Theorem
  • Mathematic Proof of The less you know, the more
    you make.
  • Knowledge is Power
  • Time is Money
  • Power Work / Time
  • Knowledge Work/Money
  • Solving for Money, we get
  • Money Work / Knowledge.
  • Thus, as Knowledge approaches zero, Money
    approaches infinity,
  • regardless of the amount of work done.
  • Conclusion The less you know, the more you make.

10
A little math
  • Canadian high-tech firm poised to make
    415-million acquisition
  • Estimate the probablity of a word in a topic by
    dividing the number of times the word appeared in
    the topics training set by the total number of
    word occurrences in the topics training set.
  • For each topic,T, sum the probability of finding
    each word of the title in a title that is
    classified as T.
  • The title is classified as the topic with the
    largest sum.
  • Titles evidence of being in Topic 20.01152
  • Titles evidence of being in Topic 10.00932
  • Canadian 1 0 high 0 0 tech 2 0 firm 1 0
  • poised 0 0 make 0 0 million 4 4
  • acquisition 10 0
  • of words in Topic2 1563
  • of words in Topic1 429

11
Examples of ML algorithms
  • Naïve Bayes This method computes the
    probability that a document is about a particular
    topic, T, using a) the words of the document to
    be classified and b) the estimated probability of
    each of these words as they appeared in the set
    of training documents for the topic, T like the
    example previously given.
  • Neural networks During training, a neural
    network looks at the patterns of features (e.g.
    words, phrases, or N-grams) that appear in a
    document of the training set and attempts to
    produce classifications for the document. If its
    attempt doesnt match the set of desired
    classifications, it adjusts the weights of the
    connections between neurons. It repeats this
    process until the attempted classifications match
    the desired classifications.
  • Instance based Saves documents of the training
    set and compares new documents to be classified
    with the saved documents. The document to be
    classified gets tagged with the highest scoring
    classifications. One way to do this is to
    implement a search engine using the documents of
    the training set as the document collection. A
    document to be classified becomes a query/search.
    A classification, C, is picked if a large number
    of its training set documents are at the top of
    the returned answer set.

12
How well do ML classifiers work?
  • A good system will have an accuracy of above 80.
  • Strong evidence of how good these systems are is
    the number of companies in the market place with
    machine learning document classification systems.
  • Example are Semio, Inxight, Purple Yogi,
    Hummingbird, Autonomy, 80-20 Inc., Dophin Search,
    Textology Inc.,

13
Advantages Disadvantages
  • Advantage over classification by humans Once the
    system is trained, classification is done
    automatically with no or little human
    intervention saving human resources.
  • Advantage over classification by humans
    Consistent classification.
  • Advantage over rule based classification Human
    resources are not needed to make rules.
  • Disadvantage Not always obvious why it
    classified a document in a certain way and not
    obvious how to keep it from doing the same type
    of classification in the future (i.e. dont know
    how to modify it.)
  • Disadvantage Human resources must be used to
    manually classify documents for the training set.
    Furthermore, the number and type of document that
    should be in the training set isnt
    straightforward.

14
Uses of ML classifiers
  • Automatically classify documents.
  • Suggest classifications that a human can pick
    from.
  • Classify paragraphs or even sentences.
  • Find important information in a document. For
    example, rules of law in a case law document, or
    the facts of the case.

15
Challenges
  • Labeling the documents of the training set.
  • What is the best way to pick documents for the
    training set so the machine-learning algorithm
    produces a classifier with high accuracy?
  • Which machine-learning algorithm works best on
    your classification problem?

16
Questions Answers
Write a Comment
User Comments (0)
About PowerShow.com