CS188 Guest Lecture: Statistical Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

CS188 Guest Lecture: Statistical Natural Language Processing

Description:

CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems www.sims.berkeley.edu/~hearst – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 38
Provided by: berk148
Category:

less

Transcript and Presenter's Notes

Title: CS188 Guest Lecture: Statistical Natural Language Processing


1
CS188 Guest LectureStatistical Natural Language
Processing
Prof. Marti Hearst School of Information
Management Systems www.sims.berkeley.edu/hearst
 
2
School of Information Management Systems
3
School of Information Management Systems
Information economics and policy
Information design and architecture
SIMS
Human-computer interaction
Information assurance
Sociology of information
4
How do we Automatically Analyze Human Language?
  • The answer is forget all that logic and
    inference stuff youve been learning all
    semester!
  • Instead, we do something entirely different.
  • Gather HUGE collections of text, and compute
    statistics over them. This allows us to make
    predictions.
  • Nearly always a VERY simple algorithm and a VERY
    large text collection do better than a smart
    algorithm using knowledge engineering.

5
Statistical Natural Language Processing
  • Chapter 23 of the textbook
  • Prof. Russell said it wont be on the final
  • Today 3 Applications
  • Author Identification
  • Speech Recognition (language models)
  • Spelling Correction

6
Author Identification
Problem Variations
  • Disputed authorship (choose among k known
    authors)
  • Document pair analysis Were two documents
    written by the same author?
  • Odd-person-out Were these documents written by
    one of this set of authors or by someone else?
  • Clustering of putative authors (e.g., internet
    handles termin8r, heyr, KaMaKaZie)

7
The Federalist Papers
  • Written in 1787-1788 by Alexander Hamilton, John
    Jay and James Madison to persuade the citizens of
    New York to ratify the constitution.
  • Papers consisted of short essays, 900 to 3500
    words in length.
  • Authorship of 12 of those papers have been in
    dispute (Madison or Hamilton). These papers are
    referred to as the disputed Federalist papers.

8
Stylometry
  • The use of metrics of literary style to analyze
    texts.
  • Sentence length
  • Paragraph length
  • Punctuation
  • Density of parts of speech
  • Vocabulary
  • Mosteller Wallace, 1964
  • Federalist papers problem
  • Used Naïve Bayes and 30 marker words more
    typical of one or the other author
  • Concluded the disputed documents written by
    Madison.

9
An Alternative Method (Fung)
  • Find a hyperplane based on 3 words
  • 0.5368 to 24.6634 upon2.9532would66.6159
  • All disputed papers end up on the Madison side
    of the plane.

10
(No Transcript)
11
Features for Author ID
  • Typically seek a small number of textual
    characteristics that distinguish the texts of
    authors
  • (Burrows, Holmes, Binongo, Hoover, Mosteller
    Wallace, McMenamin, Tweedie, etc.)
  • Typically use function words (a, with, as,
    were, all, would, etc.) followed by analysis
  • Function words are topic-independent
  • However, Hoover (2003) shows that using all
    high-frequency words does a better job than
    function words alone.

12
Idiosyncratic Features
Idiosyncratic usage (misspellings, repeated
neologisms, etc.) are apparently also useful.
For example, Fosters unmasking of Klein as the
author of Primary Colors Klein and Anonymous
loved unusual adjectives ending in -y and inous
cartoony, chunky, crackly, dorky, snarly,,
slimetudinous, vertiginous, Both Klein and
Anonymous added letters to their interjections
ahh, aww, naww. Both Klein and Anonymous loved
to coin words beginning in hyper-, mega-, post-,
quasi-, and semi- more than all others put
together Klein and Anonymous use riffle to
mean rifle or rustle, a usage for which the OED
provides no instance in the past thousand years
13
Language Modeling
  • A fundamental concept in NLP
  • Main idea
  • For a given language, some words are more likely
    than others to follow each other, or
  • You can predict (with some degree of accuracy)
    the probability that, given a word, a particular
    other word will follow it.

14
Next Word Prediction
  • From a NY Times story...
  • Stocks ...
  • Stocks plunged this .
  • Stocks plunged this morning, despite a cut in
    interest rates
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    ...
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began

15
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last Tuesday's terrorist attacks.

16
Next Word Prediction
  • Clearly, we have the ability to predict future
    words in an utterance to some degree of accuracy.
  • How?
  • Domain knowledge
  • Syntactic knowledge
  • Lexical knowledge
  • Claim
  • A useful part of the knowledge needed to allow
    word prediction can be captured using simple
    statistical techniques
  • In particular, we'll rely on the notion of the
    probability of a sequence (a phrase, a sentence)

17
Applications of Language Models
  • Why do we want to predict a word, given some
    preceding words?
  • Rank the likelihood of sequences containing
    various alternative hypotheses,
  • e.g. for spoken language recognition
  • Theatre owners say unicorn sales have
    doubled...
  • Theatre owners say popcorn sales have
    doubled...
  • Assess the likelihood/goodness of a sentence
  • for text generation or machine translation.
  • The doctor recommended a cat scan.
  • El doctor recommendó una exploración del gato.

18
N-Gram Models of Language
  • Use the previous N-1 words in a sequence to
    predict the next word
  • Language Model (LM)
  • unigrams, bigrams, trigrams,
  • How do we train these models?
  • Very large corpora

19
Notation
  • P(unicorn)
  • Read this as The probability of seeing the token
    unicorn
  • P(unicornmythical)
  • Called the Conditional Probability.
  • Read this as The probability of seeing the token
    unicorn given that youve seen the token mythical

20
Speech Recognition Example
  • From BeRP The Berkeley Restaurant Project
    (Jurafsky et al.)
  • A testbed for a Speech Recognition project
  • System prompts user for information in order to
    fill in slots in a restaurant database.
  • Type of food, hours open, how expensive
  • After getting lots of input, can compute how
    likely it is that someone will say X given that
    they already said Y.
  • P(I want to each Chinese food)
  • P(I ltstartgt) P(want I) P(to want) P(eat
    to) P(Chinese eat) P(food Chinese)

21
A Bigram Grammar Fragment from BeRP
22
(No Transcript)
23
  • P(I want to eat British food) P(Iltstartgt)
    P(wantI) P(towant) P(eatto) P(Britisheat)
    P(foodBritish) .25.32.65.26.001.60
    .000080
  • vs. I want to eat Chinese food .00015
  • Probabilities seem to capture syntactic'' facts,
    world knowledge''
  • eat is often followed by an NP
  • British food is not too popular
  • N-gram models can be trained by counting and
    normalization

24
Spelling Correction
  • How to do it?
  • Standard approach
  • Rely on a dictionary for comparison
  • Assume a single point change
  • Insertion, deletion, transposition, substitution
  • Dont handle word substitution
  • Problems
  • Might guess the wrong correction
  • Dictionary not comprehensive
  • Shrek, Britney Spears, nsync, p53, ground zero
  • May spell the word right but use it in the wrong
    place
  • principal, principle
  • read, red

25
New Approach Use Search Engine Query Logs!
  • Leverage off of the mistakes and corrections that
    millions of other people have already made!

26
Spelling Correction via Query Logs
  • Cucerzan and Brill 04
  • Main idea
  • Iteratively transform the query into other
    strings that correspond to more likely queries.
  • Use statistics from query logs to determine
    likelihood.
  • Despite the fact that many of these are
    misspelled
  • Assume that the less wrong a misspelling is, the
    more frequent it is, and correct gt incorrect
  • Example
  • ditroitigers -gt
  • detroittigers -gt
  • detroit tigers

27
Spelling Correction via Query Logs (Cucerzan and
Brill 04)
28
Spelling Correction Algorithm
  • Algorithm
  • Compute the set of all possible alternatives for
    each word in the query
  • Look at word unigrams and bigrams from the logs
  • This handles concatenation and splitting of words
  • Find the best possible alternative string to the
    input
  • Do this efficiently with a modified Viterbi
    algorithm
  • Constraints
  • No 2 adjacent in-vocabulary words can change
    simultaneously
  • Short queries have further (unstated)
    restrictions
  • In-vocabulary words cant be changed in the first
    round of iteration

29
Spelling Correction Algorithm
  • Comparing string similarity
  • Damerau-Levenshtein edit distance
  • The minimum number of point changes required to
    transform a string into another
  • Trading off distance function leniency
  • A rule that allows only one letter change cant
    fix
  • dondal duck -gt donald duck
  • A too permissive rule makes too many errors
  • log wood -gt dog food
  • Actual measure
  • A modified context-dependent weighted
    Damerau-Levenshtein edit function
  • Point changes insertion, deletion, substitution,
    immediate transpositions, long-distance movement
    of letters
  • Weights interactively refined using statistics
    from query logs

30
Spelling Correction Evaluation
  • Emphasizing coverage
  • 1044 randomly chosen queries
  • Annotated by two people (91.3 agreement)
  • 180 misspelled annotators provided corrections
  • 81.1 system agreement with annotators
  • 131 false positives
  • 2002 kawasaki ninja zx6e -gt 2002 kawasaki ninja
    zx6r
  • 156 suggestions for the misspelled queries
  • 2 iterations were sufficient for most corrections
  • Problem annotators were guessing user intent

31
Spell Checking Summary
  • Can use the collective knowledge stored in query
    logs
  • Works pretty well despite the noisiness of the
    data
  • Exploits the errors made by people
  • Might be further improved to incorporate text
    from other domains

32
Other Search Engine Applications
  • Many other applications apply to search engines
    and related topics.
  • One more example automatic synonym and related
    word generation.

33
Synonym Generation
34
Synonym Generation
35
Synonym Generation
36
Speaking of Search Engines Introducing a New
Course!
  • Search Engines Technology, Society, and Business
  • IS141 (2 units)
  • Mondays 4-6pm 1hr section
  • CCN 42702
  • No prerequisites
  • http//www.sims.berkeley.edu/courses/is141/f05/

37
A Great Line-up of World-Class Experts!
38
A Great Line-up of World-Class Experts!
39
Thank you!
Prof. Marti Hearst School of Information
Management Systems www.sims.berkeley.edu/hearst
 
Write a Comment
User Comments (0)
About PowerShow.com