Title: Computational Linguistics
1Computational Linguistics
Based on Dan Jurafskys textbook, Speech and
Language Processing Ch. 13 and slides from
Preslav Nakov and Marti Hearst
2- Part III
- Parsing
- (structural disambiguation)
- based on Web as Corpus
3- Using the Web as an Implicit Training
SetApplication to Structural Ambiguity
Resolution - Preslav Nakov and Marti HearstComputer Science
Division and SIMSUniversity of California,
Berkeley
4Motivation
- Huge datasets trump sophisticated algorithms
- Scaling to Very Very Large Corpora for Natural
Language Disambiguation, ACL 2001 (Banko
Brill, 2001) - Task spelling correction
- Raw text as training data
- Log-linear improvement even to billion words
- Getting more data is better than fine-tuning
algorithms - How to generalize to other problems?
5Web as a Baseline
- Web as a baseline (Lapata Keller 0405)
applied simple n-gram models to - machine translation candidate selection
- article generation
- noun compound interpretation
- noun compound bracketing
- adjective ordering
- spelling correction
- countability detection
- prepositional phrase attachment
- Unsupervised Web-based methods rival best
supervised approaches - gt Web n-grams should be used as a baseline.
.
6Our Contribution
- Potential of these ideas is not yet fully tapped
- We introduce new features
- paraphrases
- surface features
- Applied to structural ambiguity problems
- Data sparseness need statistics for every
possible word and for word combinations - Problems (unsupervised)
- Noun compound bracketing
- PP attachment
- NP coordination
7 Noun Phrase Bracketing
8Noun Compound Bracketing
- (a) liver cell antibody (left bracketing)
- (b) liver cell line (right
bracketing) - In (a), the antibody targets the liver cell.
- In (b), the cell line is derived from the liver.
9Dependency vs. Adjacency
right
left
dependency model
adjacency model
10Related Work
- Dependency model vs.adjacency model
- P(w1w2) and P(w1w3) vs. P(w1w2) and P(w2w3)
- Marcus(1980), Pustejoskyal.(1993), Resnik(1993)
vs. Lauer (1995) - Keller Lapata (2004)
- Web-based unigrams and bigrams
- Nakov Hearst (2005)
- n-grams, paraphrases, surface features
11Nakov Hearst (2005)
- Web page hits proxy for n-gram frequencies
- Sample surface features
- amino-acid sequence ? left
- brain stems cell ? left
- brains stem cell ? right
- Majority vote to combine the different models
- Accuracy 89.34 (on the Lauers set)
- Baseline 66.70, previous best result 80.70
12Web Counts Problems
- Page hits are inaccurate
- maybe not that bad (KellerLapata,2003)
- The Web lacks linguistic annotation
- Pr(healthcare) (health care) / (care)
- health noun
- care both verb and noun
- can be adjacent by chance
- can come from different sentences
- Cannot search for pos and punctuation marks
- stem cells VERB PREPOSITION brain
- protein synthesis inhibition
13Paraphrases (Warren,1978)
- NC bracketing is related to how NC is paraphrased
- E.g., human immunodeficiency virus Left or
right? - Prepositional paraphrase
- immunodeficiency virus in humans ? right
- Verbal paraphrase
- virus causing human immunodeficiency ? left
- immunodeficiency virus found in humans ? right
- Copula paraphrase
- immunodeficiency virus that is human ? right
- Answerright
14Results
correct
N/A
wrong
15- Prepositional Phrase Attachment
16PP attachment
PP combines with the NP to form another NP
- (a) Peter spent millions of dollars. (noun)
- (b) Peter spent time with his family. (verb)
- quadruple (v, n1, p, n2)
- (a) (spent, millions, of, dollars)
- (b) (spent, time, with, family)
PP is an indirect object of the verb
- Human performance
- quadruple 88
- whole sentence 93
17PP-attachment Web-based Approach
- Unsupervised
- (v,n1,p,n2) quadruples, Ratnaparkhi test set
- Google and MSN Search
- Exact phrase queries
- Inflections WordNet 2.0
- Adding determiners where appropriate
- Models
- n-gram association models
- Web-derived surface features
- paraphrases
18Probabilities Estimation
- Using page hits as a proxy for n-gram counts
- Pr(w1w2) (w1,w2) / (w2)
- (w2) word frequency query for w2
- (w1,w2) bigram frequency query for w1 w2
- Pr(w1,w2w3) (w1,w2,w3) / (w3)
19N-gram models
- (i) Pr(pn1) vs. Pr(pv)
- (ii) Pr(p,n2n1) vs. Pr(p,n2v)
- I eat/v spaghetti/n1 with/p a fork/n2.
- I eat/v spaghetti/n1 with/p sauce/n2.
- Pr or (frequency)
- smoothing as in (Hindle Rooth, 93)
- back-off from (ii) to (i)
- N-grams unreliable, if n1 or n2 is a pronoun.
- MSN Search no rounding of n-gram estimates
20Web-derived Surface Features
- Example features
P, R - open the door / with a key ? verb (100, 0.13)
- open the door (with a key) ? verb ( 74, 2.44)
- open the door with a key? verb ( 68, 2.03)
- open the door , with a key ? verb ( 58, 7.09)
- eat Spaghetti with sauce ? noun (100, 0.14)
- eat ? spaghetti with sauce? noun ( 83, 0.55)
- eat , spaghetti with sauce ? noun ( 66, 5.11)
- eat spaghetti with sauce ? noun ( 65, 1.57)
- high precision, low recall
sum
compare
sum
21Paraphrases
- v n1 p n2
- v n2 n1 (noun)
- v p n2 n1 (verb)
- p n2 v n1 (verb)
- n1 p n2 v (noun)
- v PRONOUN p n2 (verb)
- BE n1 p n2 (noun)
22Paraphrases pattern (1)
- v n1 p n2 ? v n2 n1 (noun)
- Can we turn n1 p n2 into a noun compound n2
n1? - meet/v demands/n1 from/p customers/n2 ?
- meet/v the customer/n2 demands/n1
- Problem ditransitive verbs like give
- gave/v an apple/n1 to/p him/n2 ?
- gave/v him/n2 an apple/n1
- Solution
- no determiner before n1
- determiner before n2 is required
- the preposition cannot be to
23Paraphrases pattern (2)
- v n1 p n2 ? v p n2 n1 (verb)
- If p n2 is an indirect object of v, then it
could be switched with the direct object n1. - had/v a program/n1 in/p place/n2 ?
- had/v in/p place/n2 a program/n1
Determiner before n1 is required to prevent n2
n1 from forming a noun compound.
24Paraphrases pattern (3)
- v n1 p n2 ? p n2 v n1 (verb)
- indicates a wildcard position (up to three
intervening words are allowed) - Looks for appositions, where the PP has moved in
front of the verb, e.g. - I gave/v an apple/n1 to/p him/n2 ?
- to/p him/n2 I gave/v an apple/n1
25Paraphrases pattern (4)
- v n1 p n2 ? n1 p n2 v (noun)
- Looks for appositions, where n1 p n2 has moved
in front of v - shaken/v confidence/n1 in/p markets/n2 ?
- confidence/n1 in/p markets/n2 shaken/v
26Paraphrases pattern (5)
- v n1 p n2 ? v PRONOUN p n2 (verb)
- n1 is a pronoun ? verb (HindleRooth, 93)
- Pattern (5) substitutes n1 with a dative pronoun
(him or her), e.g. - put/v a client/n1 at/p odds/n2 ?
- put/v him at/p odds/n2
pronoun
27Paraphrases pattern (6)
- v n1 p n2 ? BE n1 p n2 (noun)
- BE is typically used with a noun attachment
- Pattern (6) substitutes v with a form of to be
(is or are), e.g. - eat/v spaghetti/n1 with/p sauce/n2 ?
- is spaghetti/n1 with/p sauce/n2
to be
28Evaluation
- Ratnaparkhi dataset
- 3097 test examples, e.g.
- prepare dinner for family V
- shipped crabs from province V
- n1 or n2 is a bare determiner 149 examples
- problem for unsupervised methods
- left chairmanship of the N
- is the of kind N
- acquire securities for an N
- special symbols , /, etc. 230 examples
- problem for Web queries
- buy for 10 V
- beat SP-down from V
- is 43-owned by firm N
29Results
For prepositions other then OF. (of ? noun
attachment)
Smoothing is not needed on the Web
Models in bold are combined in a majority vote.
Simpler but not significantly different from
84.3 (PantelLin,00).
Checking directly for...
30 31Coordination Problems
- (Modified) real sentence
- The Department of Chronic Diseases and Health
Promotion leads and strengthens global efforts to
prevent and control chronic diseases or
disabilities and to promote health and quality of
life. - Problems
- boundaries words, constituents, clauses etc.
- interactions with PPs health and quality of
life vs. health and quality of life - or meaning and chronic diseases or disabilities
- ellipsis
32NC coordination ellipsis
- Ellipsis
- car and truck production
- means car production and truck production
- No ellipsis
- president and chief executive
- All-way coordination
- Securities and Exchange Commission
33NC Coordination ellipsis
- Quadruple (n1,c,n2,h)
- Penn Treebank annotations
- ellipsis
- (NP car/NN and/CC truck/NN production/NN).
- no ellipsis
- (NP (NP president/NN) and/CC (NP chief/NN
executive/NN)) - all-way can be annotated either way
- This is a problem a parser must deal with.
Collins parser always predicts ellipsis, but
other parsers (e.g. Charniaks) try to solve it.
34Related Work
- (Resnik, 99) similarity of form and meaning,
conceptual association, decision tree, P80,
R100 - (Rus al., 02) deterministic, rule-based
bracketing in context, P87.42, R71.05 - (Chantree al., 05) distributional similarities
from BNC, Sketch Engine (freqs., object/modifier
etc.), P80.3, R53.8 - (Goldberg, 99) different problem (n1,p,n2,c,n3),
adapts Ratnaparkhi (99) algorithm, P72, R100
35N-gram models
- (n1,c,n2,h)
- (i) (n1,h) vs. (n2,h)
- (ii) (n1,h) vs. (n1,c,n2)
36Surface Features
sum
compare
sum
37Paraphrases
- n1 c n2 h
- n2 c n1 h (ellipsis)
- n2 h c n1 (NO ellipsis)
- n1 h c n2 h (ellipsis)
- n2 h c n1 h (ellipsis)
38Paraphrases Pattern (1)
- n1 c n2 h ? n2 c n1 h (ellipsis)
- Switch places of n1 and n2
- bar/n1 and/c pie/n2 graph/h ?
- pie/n2 and/c bar/n1 graph/h
39Paraphrases Pattern (2)
- n1 c n2 h ? n2 h c n1 (NO ellipsis)
- Switch places of n1 and n2 h
- president/n1 and/c chief/n2 executive/h ?
- chief/n2 executive/h and/c president/n1
40Paraphrases Pattern (3)
h
- n1 c n2 h ? n1 h c n2 h (ellipsis)
- Insert the elided head h
- bar/n1 and/c pie/n2 graph/h ?
- bar/n1 graph/h and/c pie/n2 graph/h
41Paraphrases Pattern (4)
h
- n1 c n2 h ? n2 h c n1 h (ellipsis)
- Insert the elided head h, but also switch n1 and
n2 - bar/n1 and/c pie/n2 graph/h ?
- pie/n2 graph/h and/c bar/n1 graph/h
42(Rus al.,02) Heuristics
- Heuristic 1 no ellipsis
- n1n2
- milk/n1 and/c milk/n2 products/h
- Heuristic 4 no ellipsis
- n1 and n2 are modified by an adjective
- Heuristic 5 ellipsis
- only n1 is modified by an adjective
- Heuristic 6 no ellipsis
- only n2 is modified by an adjective
We use a determiner.
43Number Agreement
- Introduced by Resnik (93)
- (a) n1n2 agree, but n1h do not ? ellipsis
- (b) n1n2 dont agree, but n1h do ? no ellipsis
- (c) otherwise leave undecided.
44Results428 examples from Penn TB
Bad, compares bigram to trigram.
Models in bold are combined in a majority vote.
Comparable to other researchers (but no standard
dataset).
45Conclusions Future Work
- Tapping the potential of very large corpora for
unsupervised algorithms - Go beyond n-grams
- Surface features
- Paraphrases
- Results competitive with best unsupervised
- Results can rival supervised algorithms
- Future Work
- other NLP tasks
- better evidence combination
There should be even more exciting features on
the Web!
46Summary
- Context-Free Grammars
- Parsing
- Top Down, Bottom Up Metaphors
- Dynamic Programming Parsers CKY. Earley
- Disambiguation
- PCFG
- Probabilistic Augmentations to Parsers
- Treebanks