Computational Linguistics - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Computational Linguistics

Description:

Based on Dan Jurafsky's textbook, Speech and Language Processing Ch. 13 ... Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g. ... – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 47
Provided by: danj172
Category:

less

Transcript and Presenter's Notes

Title: Computational Linguistics


1
Computational Linguistics
  • Lecture 5 Parsing

Based on Dan Jurafskys textbook, Speech and
Language Processing Ch. 13 and slides from
Preslav Nakov and Marti Hearst
2
  • Part III
  • Parsing
  • (structural disambiguation)
  • based on Web as Corpus

3
  • Using the Web as an Implicit Training
    SetApplication to Structural Ambiguity
    Resolution
  • Preslav Nakov and Marti HearstComputer Science
    Division and SIMSUniversity of California,
    Berkeley

4
Motivation
  • Huge datasets trump sophisticated algorithms
  • Scaling to Very Very Large Corpora for Natural
    Language Disambiguation, ACL 2001 (Banko
    Brill, 2001)
  • Task spelling correction
  • Raw text as training data
  • Log-linear improvement even to billion words
  • Getting more data is better than fine-tuning
    algorithms
  • How to generalize to other problems?

5
Web as a Baseline
  • Web as a baseline (Lapata Keller 0405)
    applied simple n-gram models to
  • machine translation candidate selection
  • article generation
  • noun compound interpretation
  • noun compound bracketing
  • adjective ordering
  • spelling correction
  • countability detection
  • prepositional phrase attachment
  • Unsupervised Web-based methods rival best
    supervised approaches
  • gt Web n-grams should be used as a baseline.

.
6
Our Contribution
  • Potential of these ideas is not yet fully tapped
  • We introduce new features
  • paraphrases
  • surface features
  • Applied to structural ambiguity problems
  • Data sparseness need statistics for every
    possible word and for word combinations
  • Problems (unsupervised)
  • Noun compound bracketing
  • PP attachment
  • NP coordination

7
Noun Phrase Bracketing
8
Noun Compound Bracketing
  • (a) liver cell antibody (left bracketing)
  • (b) liver cell line (right
    bracketing)
  • In (a), the antibody targets the liver cell.
  • In (b), the cell line is derived from the liver.

9
Dependency vs. Adjacency
right
left
dependency model
adjacency model
10
Related Work
  • Dependency model vs.adjacency model
  • P(w1w2) and P(w1w3) vs. P(w1w2) and P(w2w3)
  • Marcus(1980), Pustejoskyal.(1993), Resnik(1993)
    vs. Lauer (1995)
  • Keller Lapata (2004)
  • Web-based unigrams and bigrams
  • Nakov Hearst (2005)
  • n-grams, paraphrases, surface features

11
Nakov Hearst (2005)
  • Web page hits proxy for n-gram frequencies
  • Sample surface features
  • amino-acid sequence ? left
  • brain stems cell ? left
  • brains stem cell ? right
  • Majority vote to combine the different models
  • Accuracy 89.34 (on the Lauers set)
  • Baseline 66.70, previous best result 80.70

12
Web Counts Problems
  • Page hits are inaccurate
  • maybe not that bad (KellerLapata,2003)
  • The Web lacks linguistic annotation
  • Pr(healthcare) (health care) / (care)
  • health noun
  • care both verb and noun
  • can be adjacent by chance
  • can come from different sentences
  • Cannot search for pos and punctuation marks
  • stem cells VERB PREPOSITION brain
  • protein synthesis inhibition

13
Paraphrases (Warren,1978)
  • NC bracketing is related to how NC is paraphrased
  • E.g., human immunodeficiency virus Left or
    right?
  • Prepositional paraphrase
  • immunodeficiency virus in humans ? right
  • Verbal paraphrase
  • virus causing human immunodeficiency ? left
  • immunodeficiency virus found in humans ? right
  • Copula paraphrase
  • immunodeficiency virus that is human ? right
  • Answerright

14
Results
correct
N/A
wrong
15
  • Prepositional Phrase Attachment

16
PP attachment
PP combines with the NP to form another NP
  • (a) Peter spent millions of dollars. (noun)
  • (b) Peter spent time with his family. (verb)
  • quadruple (v, n1, p, n2)
  • (a) (spent, millions, of, dollars)
  • (b) (spent, time, with, family)

PP is an indirect object of the verb
  • Human performance
  • quadruple 88
  • whole sentence 93

17
PP-attachment Web-based Approach
  • Unsupervised
  • (v,n1,p,n2) quadruples, Ratnaparkhi test set
  • Google and MSN Search
  • Exact phrase queries
  • Inflections WordNet 2.0
  • Adding determiners where appropriate
  • Models
  • n-gram association models
  • Web-derived surface features
  • paraphrases

18
Probabilities Estimation
  • Using page hits as a proxy for n-gram counts
  • Pr(w1w2) (w1,w2) / (w2)
  • (w2) word frequency query for w2
  • (w1,w2) bigram frequency query for w1 w2
  • Pr(w1,w2w3) (w1,w2,w3) / (w3)

19
N-gram models
  • (i) Pr(pn1) vs. Pr(pv)
  • (ii) Pr(p,n2n1) vs. Pr(p,n2v)
  • I eat/v spaghetti/n1 with/p a fork/n2.
  • I eat/v spaghetti/n1 with/p sauce/n2.
  • Pr or (frequency)
  • smoothing as in (Hindle Rooth, 93)
  • back-off from (ii) to (i)
  • N-grams unreliable, if n1 or n2 is a pronoun.
  • MSN Search no rounding of n-gram estimates

20
Web-derived Surface Features
  • Example features
    P, R
  • open the door / with a key ? verb (100, 0.13)
  • open the door (with a key) ? verb ( 74, 2.44)
  • open the door with a key? verb ( 68, 2.03)
  • open the door , with a key ? verb ( 58, 7.09)
  • eat Spaghetti with sauce ? noun (100, 0.14)
  • eat ? spaghetti with sauce? noun ( 83, 0.55)
  • eat , spaghetti with sauce ? noun ( 66, 5.11)
  • eat spaghetti with sauce ? noun ( 65, 1.57)
  • high precision, low recall

sum
compare
sum
21
Paraphrases
  • v n1 p n2
  • v n2 n1 (noun)
  • v p n2 n1 (verb)
  • p n2 v n1 (verb)
  • n1 p n2 v (noun)
  • v PRONOUN p n2 (verb)
  • BE n1 p n2 (noun)

22
Paraphrases pattern (1)
  • v n1 p n2 ? v n2 n1 (noun)
  • Can we turn n1 p n2 into a noun compound n2
    n1?
  • meet/v demands/n1 from/p customers/n2 ?
  • meet/v the customer/n2 demands/n1
  • Problem ditransitive verbs like give
  • gave/v an apple/n1 to/p him/n2 ?
  • gave/v him/n2 an apple/n1
  • Solution
  • no determiner before n1
  • determiner before n2 is required
  • the preposition cannot be to

23
Paraphrases pattern (2)
  • v n1 p n2 ? v p n2 n1 (verb)
  • If p n2 is an indirect object of v, then it
    could be switched with the direct object n1.
  • had/v a program/n1 in/p place/n2 ?
  • had/v in/p place/n2 a program/n1

Determiner before n1 is required to prevent n2
n1 from forming a noun compound.
24
Paraphrases pattern (3)
  • v n1 p n2 ? p n2 v n1 (verb)
  • indicates a wildcard position (up to three
    intervening words are allowed)
  • Looks for appositions, where the PP has moved in
    front of the verb, e.g.
  • I gave/v an apple/n1 to/p him/n2 ?
  • to/p him/n2 I gave/v an apple/n1

25
Paraphrases pattern (4)
  • v n1 p n2 ? n1 p n2 v (noun)
  • Looks for appositions, where n1 p n2 has moved
    in front of v
  • shaken/v confidence/n1 in/p markets/n2 ?
  • confidence/n1 in/p markets/n2 shaken/v

26
Paraphrases pattern (5)
  • v n1 p n2 ? v PRONOUN p n2 (verb)
  • n1 is a pronoun ? verb (HindleRooth, 93)
  • Pattern (5) substitutes n1 with a dative pronoun
    (him or her), e.g.
  • put/v a client/n1 at/p odds/n2 ?
  • put/v him at/p odds/n2

pronoun
27
Paraphrases pattern (6)
  • v n1 p n2 ? BE n1 p n2 (noun)
  • BE is typically used with a noun attachment
  • Pattern (6) substitutes v with a form of to be
    (is or are), e.g.
  • eat/v spaghetti/n1 with/p sauce/n2 ?
  • is spaghetti/n1 with/p sauce/n2

to be
28
Evaluation
  • Ratnaparkhi dataset
  • 3097 test examples, e.g.
  • prepare dinner for family V
  • shipped crabs from province V
  • n1 or n2 is a bare determiner 149 examples
  • problem for unsupervised methods
  • left chairmanship of the N
  • is the of kind N
  • acquire securities for an N
  • special symbols , /, etc. 230 examples
  • problem for Web queries
  • buy for 10 V
  • beat SP-down from V
  • is 43-owned by firm N

29
Results
For prepositions other then OF. (of ? noun
attachment)
Smoothing is not needed on the Web
Models in bold are combined in a majority vote.
Simpler but not significantly different from
84.3 (PantelLin,00).
Checking directly for...
30
  • Coordination

31
Coordination Problems
  • (Modified) real sentence
  • The Department of Chronic Diseases and Health
    Promotion leads and strengthens global efforts to
    prevent and control chronic diseases or
    disabilities and to promote health and quality of
    life.
  • Problems
  • boundaries words, constituents, clauses etc.
  • interactions with PPs health and quality of
    life vs. health and quality of life
  • or meaning and chronic diseases or disabilities
  • ellipsis

32
NC coordination ellipsis
  • Ellipsis
  • car and truck production
  • means car production and truck production
  • No ellipsis
  • president and chief executive
  • All-way coordination
  • Securities and Exchange Commission

33
NC Coordination ellipsis
  • Quadruple (n1,c,n2,h)
  • Penn Treebank annotations
  • ellipsis
  • (NP car/NN and/CC truck/NN production/NN).
  • no ellipsis
  • (NP (NP president/NN) and/CC (NP chief/NN
    executive/NN))
  • all-way can be annotated either way
  • This is a problem a parser must deal with.

Collins parser always predicts ellipsis, but
other parsers (e.g. Charniaks) try to solve it.
34
Related Work
  • (Resnik, 99) similarity of form and meaning,
    conceptual association, decision tree, P80,
    R100
  • (Rus al., 02) deterministic, rule-based
    bracketing in context, P87.42, R71.05
  • (Chantree al., 05) distributional similarities
    from BNC, Sketch Engine (freqs., object/modifier
    etc.), P80.3, R53.8
  • (Goldberg, 99) different problem (n1,p,n2,c,n3),
    adapts Ratnaparkhi (99) algorithm, P72, R100

35
N-gram models
  • (n1,c,n2,h)
  • (i) (n1,h) vs. (n2,h)
  • (ii) (n1,h) vs. (n1,c,n2)

36
Surface Features
sum
compare
sum
37
Paraphrases
  • n1 c n2 h
  • n2 c n1 h (ellipsis)
  • n2 h c n1 (NO ellipsis)
  • n1 h c n2 h (ellipsis)
  • n2 h c n1 h (ellipsis)

38
Paraphrases Pattern (1)
  • n1 c n2 h ? n2 c n1 h (ellipsis)
  • Switch places of n1 and n2
  • bar/n1 and/c pie/n2 graph/h ?
  • pie/n2 and/c bar/n1 graph/h

39
Paraphrases Pattern (2)
  • n1 c n2 h ? n2 h c n1 (NO ellipsis)
  • Switch places of n1 and n2 h
  • president/n1 and/c chief/n2 executive/h ?
  • chief/n2 executive/h and/c president/n1

40
Paraphrases Pattern (3)
h
  • n1 c n2 h ? n1 h c n2 h (ellipsis)
  • Insert the elided head h
  • bar/n1 and/c pie/n2 graph/h ?
  • bar/n1 graph/h and/c pie/n2 graph/h

41
Paraphrases Pattern (4)
h
  • n1 c n2 h ? n2 h c n1 h (ellipsis)
  • Insert the elided head h, but also switch n1 and
    n2
  • bar/n1 and/c pie/n2 graph/h ?
  • pie/n2 graph/h and/c bar/n1 graph/h

42
(Rus al.,02) Heuristics
  • Heuristic 1 no ellipsis
  • n1n2
  • milk/n1 and/c milk/n2 products/h
  • Heuristic 4 no ellipsis
  • n1 and n2 are modified by an adjective
  • Heuristic 5 ellipsis
  • only n1 is modified by an adjective
  • Heuristic 6 no ellipsis
  • only n2 is modified by an adjective

We use a determiner.
43
Number Agreement
  • Introduced by Resnik (93)
  • (a) n1n2 agree, but n1h do not ? ellipsis
  • (b) n1n2 dont agree, but n1h do ? no ellipsis
  • (c) otherwise leave undecided.

44
Results428 examples from Penn TB
Bad, compares bigram to trigram.
Models in bold are combined in a majority vote.
Comparable to other researchers (but no standard
dataset).
45
Conclusions Future Work
  • Tapping the potential of very large corpora for
    unsupervised algorithms
  • Go beyond n-grams
  • Surface features
  • Paraphrases
  • Results competitive with best unsupervised
  • Results can rival supervised algorithms
  • Future Work
  • other NLP tasks
  • better evidence combination

There should be even more exciting features on
the Web!
46
Summary
  • Context-Free Grammars
  • Parsing
  • Top Down, Bottom Up Metaphors
  • Dynamic Programming Parsers CKY. Earley
  • Disambiguation
  • PCFG
  • Probabilistic Augmentations to Parsers
  • Treebanks
Write a Comment
User Comments (0)
About PowerShow.com