CS188 Guest Lecture: Statistical Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

Title:

CS188 Guest Lecture: Statistical Natural Language Processing

Description:

CS188 Guest Lecture: Statistical Natural Language Processing Prof. Marti Hearst School of Information Management & Systems www.sims.berkeley.edu/~hearst – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 38

Provided by: berk148

Learn more at: https://people.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS188 Guest Lecture: Statistical Natural Language Processing

1
CS188 Guest LectureStatistical Natural Language
Processing
Prof. Marti Hearst School of Information
Management Systems www.sims.berkeley.edu/hearst

2
School of Information Management Systems
3
School of Information Management Systems
Information economics and policy
Information design and architecture
SIMS
Human-computer interaction
Information assurance
Sociology of information
4
How do we Automatically Analyze Human Language?

The answer is forget all that logic and
inference stuff youve been learning all
semester!
Instead, we do something entirely different.
Gather HUGE collections of text, and compute
statistics over them. This allows us to make
predictions.
Nearly always a VERY simple algorithm and a VERY
large text collection do better than a smart
algorithm using knowledge engineering.

5
Statistical Natural Language Processing

Chapter 23 of the textbook
Prof. Russell said it wont be on the final
Today 3 Applications
Author Identification
Speech Recognition (language models)
Spelling Correction

6
Author Identification
Problem Variations

Disputed authorship (choose among k known
authors)
Document pair analysis Were two documents
written by the same author?
Odd-person-out Were these documents written by
one of this set of authors or by someone else?
Clustering of putative authors (e.g., internet
handles termin8r, heyr, KaMaKaZie)

7
The Federalist Papers

Written in 1787-1788 by Alexander Hamilton, John
Jay and James Madison to persuade the citizens of
New York to ratify the constitution.
Papers consisted of short essays, 900 to 3500
words in length.
Authorship of 12 of those papers have been in
dispute (Madison or Hamilton). These papers are
referred to as the disputed Federalist papers.

8
Stylometry

The use of metrics of literary style to analyze
texts.
Sentence length
Paragraph length
Punctuation
Density of parts of speech
Vocabulary
Mosteller Wallace, 1964
Federalist papers problem
Used Naïve Bayes and 30 marker words more
typical of one or the other author
Concluded the disputed documents written by
Madison.

9
An Alternative Method (Fung)

Find a hyperplane based on 3 words
0.5368 to 24.6634 upon2.9532would66.6159
All disputed papers end up on the Madison side
of the plane.

10
(No Transcript)
11
Features for Author ID

Typically seek a small number of textual
characteristics that distinguish the texts of
authors
(Burrows, Holmes, Binongo, Hoover, Mosteller
Wallace, McMenamin, Tweedie, etc.)
Typically use function words (a, with, as,
were, all, would, etc.) followed by analysis
Function words are topic-independent
However, Hoover (2003) shows that using all
high-frequency words does a better job than
function words alone.

12
Idiosyncratic Features
Idiosyncratic usage (misspellings, repeated
neologisms, etc.) are apparently also useful.
For example, Fosters unmasking of Klein as the
author of Primary Colors Klein and Anonymous
loved unusual adjectives ending in -y and inous
cartoony, chunky, crackly, dorky, snarly,,
slimetudinous, vertiginous, Both Klein and
Anonymous added letters to their interjections
ahh, aww, naww. Both Klein and Anonymous loved
to coin words beginning in hyper-, mega-, post-,
quasi-, and semi- more than all others put
together Klein and Anonymous use riffle to
mean rifle or rustle, a usage for which the OED
provides no instance in the past thousand years
13
Language Modeling

A fundamental concept in NLP
Main idea
For a given language, some words are more likely
than others to follow each other, or
You can predict (with some degree of accuracy)
the probability that, given a word, a particular
other word will follow it.

14
Next Word Prediction

From a NY Times story...
Stocks ...
Stocks plunged this .
Stocks plunged this morning, despite a cut in
interest rates
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
...
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began

Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since
last Tuesday's terrorist attacks.

16
Next Word Prediction

Clearly, we have the ability to predict future
words in an utterance to some degree of accuracy.
How?
Domain knowledge
Syntactic knowledge
Lexical knowledge
Claim
A useful part of the knowledge needed to allow
word prediction can be captured using simple
statistical techniques
In particular, we'll rely on the notion of the
probability of a sequence (a phrase, a sentence)

17
Applications of Language Models

Why do we want to predict a word, given some
preceding words?
Rank the likelihood of sequences containing
various alternative hypotheses,
e.g. for spoken language recognition
Theatre owners say unicorn sales have
doubled...
Theatre owners say popcorn sales have
doubled...
Assess the likelihood/goodness of a sentence
for text generation or machine translation.
The doctor recommended a cat scan.
El doctor recommendó una exploración del gato.

18
N-Gram Models of Language

Use the previous N-1 words in a sequence to
predict the next word
Language Model (LM)
unigrams, bigrams, trigrams,
How do we train these models?
Very large corpora

19
Notation

P(unicorn)
Read this as The probability of seeing the token
unicorn
P(unicornmythical)
Called the Conditional Probability.
Read this as The probability of seeing the token
unicorn given that youve seen the token mythical

20
Speech Recognition Example

From BeRP The Berkeley Restaurant Project
(Jurafsky et al.)
A testbed for a Speech Recognition project
System prompts user for information in order to
fill in slots in a restaurant database.
Type of food, hours open, how expensive
After getting lots of input, can compute how
likely it is that someone will say X given that
they already said Y.
P(I want to each Chinese food)
P(I ltstartgt) P(want I) P(to want) P(eat
to) P(Chinese eat) P(food Chinese)

21
A Bigram Grammar Fragment from BeRP
22
(No Transcript)
23

P(I want to eat British food) P(Iltstartgt)
P(wantI) P(towant) P(eatto) P(Britisheat)
P(foodBritish) .25.32.65.26.001.60
.000080
vs. I want to eat Chinese food .00015
Probabilities seem to capture syntactic'' facts,
world knowledge''
eat is often followed by an NP
British food is not too popular
N-gram models can be trained by counting and
normalization

24
Spelling Correction

How to do it?
Standard approach
Rely on a dictionary for comparison
Assume a single point change
Insertion, deletion, transposition, substitution
Dont handle word substitution
Problems
Might guess the wrong correction
Dictionary not comprehensive
Shrek, Britney Spears, nsync, p53, ground zero
May spell the word right but use it in the wrong
place
principal, principle
read, red

25
New Approach Use Search Engine Query Logs!

Leverage off of the mistakes and corrections that
millions of other people have already made!

26
Spelling Correction via Query Logs

Cucerzan and Brill 04
Main idea
Iteratively transform the query into other
strings that correspond to more likely queries.
Use statistics from query logs to determine
likelihood.
Despite the fact that many of these are
misspelled
Assume that the less wrong a misspelling is, the
more frequent it is, and correct gt incorrect
Example
ditroitigers -gt
detroittigers -gt
detroit tigers

27
Spelling Correction via Query Logs (Cucerzan and
Brill 04)
28
Spelling Correction Algorithm

Algorithm
Compute the set of all possible alternatives for
each word in the query
Look at word unigrams and bigrams from the logs
This handles concatenation and splitting of words
Find the best possible alternative string to the
input
Do this efficiently with a modified Viterbi
algorithm
Constraints
No 2 adjacent in-vocabulary words can change
simultaneously
Short queries have further (unstated)
restrictions
In-vocabulary words cant be changed in the first
round of iteration

29
Spelling Correction Algorithm

Comparing string similarity
Damerau-Levenshtein edit distance
The minimum number of point changes required to
transform a string into another
Trading off distance function leniency
A rule that allows only one letter change cant
fix
dondal duck -gt donald duck
A too permissive rule makes too many errors
log wood -gt dog food
Actual measure
A modified context-dependent weighted
Damerau-Levenshtein edit function
Point changes insertion, deletion, substitution,
immediate transpositions, long-distance movement
of letters
Weights interactively refined using statistics
from query logs

30
Spelling Correction Evaluation

Emphasizing coverage
1044 randomly chosen queries
Annotated by two people (91.3 agreement)
180 misspelled annotators provided corrections
81.1 system agreement with annotators
131 false positives
2002 kawasaki ninja zx6e -gt 2002 kawasaki ninja
zx6r
156 suggestions for the misspelled queries
2 iterations were sufficient for most corrections
Problem annotators were guessing user intent

31
Spell Checking Summary

Can use the collective knowledge stored in query
logs
Works pretty well despite the noisiness of the
data
Exploits the errors made by people
Might be further improved to incorporate text
from other domains

32
Other Search Engine Applications

Many other applications apply to search engines
and related topics.
One more example automatic synonym and related
word generation.

33
Synonym Generation
34
Synonym Generation
35
Synonym Generation
36
Speaking of Search Engines Introducing a New
Course!

Search Engines Technology, Society, and Business
IS141 (2 units)
Mondays 4-6pm 1hr section
CCN 42702
No prerequisites
http//www.sims.berkeley.edu/courses/is141/f05/

37
A Great Line-up of World-Class Experts!
38
A Great Line-up of World-Class Experts!
39
Thank you!
Prof. Marti Hearst School of Information
Management Systems www.sims.berkeley.edu/hearst

Write a Comment

User Comments (0)