CS 114: Introduction to Computational Linguistics - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

CS 114: Introduction to Computational Linguistics

Description:

automatically segmenting a TV news broadcast or a long news story into sequence of stories ... Emma smiled and chatted as cheerfully as she could. ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 76
Provided by: JamesPus8
Category:

less

Transcript and Presenter's Notes

Title: CS 114: Introduction to Computational Linguistics


1
CS 114 Introduction to Computational Linguistics
James Pustejovsky
Discourse Structure
April 1, 2008
2
Outline
  • Discourse Structure
  • Textiling
  • Coherence
  • Hobbs coherence relations
  • Rhetorical Structure Theory
  • Coreference
  • Kinds of reference phenomena
  • Constraints on co-reference
  • Anaphora Resolution
  • Hobbs
  • Loglinear
  • Coreference

3
Part I Discourse Structure
  • Conventional structures for different genres
  • Academic articles
  • Abstract, Introduction, Methodology, Results,
    Conclusion
  • Newspaper story
  • inverted pyramid structure (lead followed by
    expansion)

4
Discourse Segmentation
  • Simpler task
  • Discourse segmentation
  • Separating document into linear sequence of
    subtopics
  • Why?
  • Information retrieval
  • automatically segmenting a TV news broadcast or a
    long news story into sequence of stories
  • Text summarization
  • summarize different segments of a document
  • Information extraction
  • Extract info from inside a single discourse
    segment

5
Unsupervised Discourse Segmentation
  • Hearst (1997) 21-pgraph science news article
    called Stargazers
  • Goal produce the following subtopic segments

6
Key intuition cohesion
  • Halliday and Hasan (1976) The use of certain
    linguistic devices to link or tie together
    textual units
  • Lexical cohesion
  • Indicated by relations between words in the two
    units (identical word, synonym, hypernym)
  • Before winter I built a chimney, and shingled the
    sides of my house.
  • I thus have a tight shingled and plastered
    house.
  • Peel, core and slice the pears and the apples.
    Add the fruit to the skillet.

7
Key intuition cohesion
  • Non-lexical anaphora
  • The Woodhouses were first in consequence there.
    All looked up to them.
  • Cohesion chain
  • Peel, core and slice the pears and the apples.
    Add the fruit to the skillet. When they are soft

8
Intuition of cohesion-based segmentation
  • Sentences or paragraphs in a subtopic are
    cohesive with each other
  • But not with paragraphs in a neighboring subtopic
  • Thus if we measured the cohesion between every
    neighboring sentences
  • We might expect a dip in cohesion at subtopic
    boundaries.

9
What makes a text/dialogue coherent?
  • Consider, for example, the difference between
    passages (18.71) and (18.72). Almost certainly
    not. The reason is that these utterances, when
    juxtaposed, will not exhibit coherence. Do you
    have a discourse? Assume that you have collected
    an arbitrary set of well-formed and independently
    interpretable utterances, for instance, by
    randomly selecting one sentence from each of the
    previous chapters of this book.
  • vs.

10
What makes a text/dialogue coherent?
  • Assume that you have collected an arbitrary
    set of well-formed and independently
    interpretable utterances, for instance, by
    randomly selecting one sentence from each of the
    previous chapters of this book. Do you have a
    discourse? Almost certainly not. The reason is
    that these utterances, when juxtaposed, will not
    exhibit coherence. Consider, for example, the
    difference between passages (18.71) and (18.72).
    (JM695)

11
What makes a text coherent?
  • Discourse/topic structure
  • Appropriate sequencing of subparts of the
    discourse
  • Rhetorical structure
  • Appropriate use of coherence relations between
    subparts of the discourse
  • Referring expressions
  • Words or phrases, the semantic interpretation of
    which is a discourse entity

12
Information Status
  • Contrast
  • John wanted a poodle but Becky preferred a corgi.
  • Topic/comment
  • The corgi they bought turned out to have fleas.
  • Theme/rheme
  • The corgi they bought turned out to have fleas.
  • Focus/presupposition
  • It was Becky who took him to the vet.
  • Given/new
  • Some wildcats bite, but this wildcat turned out
    to be a sweetheart.
  • Contrast Speaker (S) and Hearer (H)

13
Determining Given vs. New
  • Entities when first introduced are new
  • Brand-new (H must create a new entity)
  • I saw a dinosaur today.
  • Unused (H already knows of this entity)
  • I saw your mother today.
  • Evoked entities are old -- already in the
    discourse
  • Textually evoked
  • The dinosaur was scaley and gray.
  • Situationally evoked
  • The light was red when you went through it.
  • Inferrables
  • Containing
  • I bought a carton of eggs. One of them was
    broken.
  • Non-containing
  • A bus pulled up beside me. The driver was a
    monkey.

14
Given/New and Definiteness/Indefiniteness
  • Subject NPs tend to be syntactically definite and
    old
  • Object NPs tend to be indefinite and new
  • I saw a black cat yesterday. The cat looked
    hungry.
  • Definite articles, demonstratives, possessives,
    personal pronouns, proper nouns, quantifiers like
    all, every
  • Indefinite articles, quantifiers like some, any,
    one signal indefinitenessbut.
  • This guy came into the room

15
Discourse/Topic Structure
  • Text Segmentation
  • Linear
  • TextTiling
  • Look for changes in content words
  • Hierarchical
  • Grosz Sidners Centering theory
  • Morris Hirsts algorithm
  • Lexical chaining through Rogets thesaurus
  • Hierarchical Relations
  • Mann et al.s Rhetorical Structure Theory
  • Marcus algorithm

16
TextTiling (Hearst 94)
  • Goal find multi-paragraph topics
  • Example 21 paragraph article called Stargazers

17
TextTiling (Hearst 94)
  • Goal find multi-paragraph topics
  • But its difficult to define topic (Brown
    Yule)
  • Focus instead on topic shift or change
  • Change in content, by contrast with setting,
    scene, characters
  • Mechanism
  • compare adjacent blocks of text
  • look for shifts in vocabulary

18
Intuition behind TextTiling
19
TextTiling Algorithm
  • Tokenization
  • Lexical Score Determination
  • Blocks
  • Vocabulary Introductions
  • Chains
  • Boundary Identification

20
Tokenization
  • Convert text stream into terms (words)
  • Remove stop words
  • Reduce to root (inflectional morphology)
  • Subdivide into token-sequences
  • (substitute for sentences)
  • Find potential boundary points
  • (paragraphs breaks)

21
Determining Scores
  • Compute a score at each token-sequence gap
  • Score based on lexical occurrences
  • Block algorithm

22
(No Transcript)
23
Boundary Identification
  • Smooth the plot (average smoothing)
  • Assign depth score at each token-sequence gap
  • Deeper valleys score higher
  • Order boundaries by depth score
  • Choose boundary cut off (avg-sd/2)

24
Evaluation
  • Data
  • Twelve news articles from Dialog
  • Seven human judges per article
  • major boundaries chosen by 3 judges
  • Avg number of paragraphs 26.75
  • Avg number of boundaries 10 (39)
  • Results
  • Between upper and lower bounds
  • Upper bound judges averages
  • Lower bound reasonable simple algorithm

25
Assessing Agreement Among Judges
  • KAPPA Coefficient
  • Measures pairwise agreement
  • Takes expected chance agreement into account
  • P(A) proportion of times judges agree
  • P(E) proportion expected chance agreement
  • .43 to .68 (Isard Carletta 95, boundaries)
  • .65 to .90 (Rose 95, sentence segmentation)
  • Here, k .647

26
Part II Text Coherence
  • What makes a discourse coherent?
  • The reason is that these utterances, when
    juxtaposed, will not exhibit coherence. Almost
    certainly not. Do you have a discourse? Assume
    that you have collected an arbitrary set of
    well-formed and independently interpretable
    utterances, for instance, by randomly selecting
    one sentence from each of the previous chapters
    of this book.

27
Better?
  • Assume that you have collected an arbitrary set
    of well-formed and independently interpretable
    utterances, for instance, by randomly selecting
    one sentence from each of the previous chapters
    of this book. Do you have a discourse? Almost
    certainly not. The reason is that these
    utterances, when juxtaposed, will not exhibit
    coherence.

28
Coherence
  • John hid Bills car keys. He was drunk.
  • ??John hid Bills car keys. He likes spinach.

29
What makes a text coherent?
  • Appropriate use of coherence relations between
    subparts of the discourse -- rhetorical structure
  • Appropriate sequencing of subparts of the
    discourse -- discourse/topic structure
  • Appropriate use of referring expressions

30
Hobbs 1979 Coherence Relations
  • Result
  • Infer that the state or event asserted by S0
    causes or could cause the state or event asserted
    by S1.
  • The Tin Woodman was caught in the rain. His
    joints rusted.

31
Hobbs Explanation
  • Infer that the state or event asserted by S1
    causes or could cause the state or event asserted
    by S0.
  • John hid Bills car keys. He was drunk.

32
Hobbs Parallel
  • Infer p(a1, a2..) from the assertion of S0 and
    p(b1,b2) from the assertion of S1, where ai and
    bi are similar, for all I.
  • The Scarecrow wanted some brains. The Tin Woodman
    wanted a heart.

33
Hobbs Elaboration
  • Infer the same proposition P from the assertions
    of S0 and S1.
  • Dorothy was from Kansas. She lived in the midst
    of the great Kansas prairies.

34
Coherence relations impose a discourse structure
35
Rhetorical Structure Theory
  • Another theory of discourse structure, based on
    identifying relations between segments of the
    text
  • Nucleus/satellite notion encodes asymmetry
  • Nucleus is thing that if you deleted it, text
    wouldnt make sense.
  • Some rhetorical relations
  • Elaboration (set/member, class/instance,
    whole/part)
  • Contrast multinuclear
  • Condition Sat presents precondition for N
  • Purpose Sat presents goal of the activity in N

36
One example of rhetorical relation
  • A sample definition
  • Relation Evidence
  • Constraints on N H might not believe N as much
    as S think s/he should
  • Constraints on Sat H already believes or will
    believe Sat
  • Effect Hs belief in N is increased
  • An example
  • Kevin must be here.
  • His car is parked outside.

Satellite
Nucleus
37
Automatic Rhetorical Structure Labeling
  • Supervised machine learning
  • Get a group of annotators to assign a set of RST
    relations to a text
  • Extract a set of surface features from the text
    that might signal the presence of the rhetorical
    relations in that text
  • Train a supervised ML system based on the
    training set

38
Features cue phrases
  • Explicit markers because, however, therefore,
    then, etc.
  • Tendency of certain syntactic structures to
    signal certain relations
  • Infinitives are often used to signal purpose
    relations Use rm to delete files.
  • Ordering
  • Tense/aspect
  • Intonation

39
Some Problems with RST
  • How many Rhetorical Relations are there?
  • How can we use RST in dialogue as well as
    monologue?
  • RST does not model overall structure of the
    discourse.
  • Difficult to get annotators to agree on labeling
    the same texts

40
Part III Coreference
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp since 2004, saw her pay
    jump 20, to 1.3 million, as the 37-year-old
    also became the Denver-based financial-service
    companys president. It has been ten years since
    she came to Megabucks from rival Lotsabucks.
  • The Tin Woodman went to the Emerald City to see
    the Wizard of Oz and ask for a heart. After he
    asked for it, the Woodman waited for the Wizards
    response.

41
Why reference resolution?
  • Information Extraction
  • President of which company is retiring?
  • First Union Corp. is continuing to wrestle with
    severe problems unleashed by a botched merger and
    a troubled business strategy. According to
    industry insiders at Paine Webber, their
    president, John R. Georgius, is planning to
    retire by the end of the year.

42
Some terminology
  • Reference
  • process by which speakers use words Victoria
    Chen and she to denote a particular person
  • Referring expression Victoria Chen, she
  • Referent the actual entity (but as a shorthand
    we might call Victoria Chen the referent).
  • Victoria Chen and she corefer
  • Antecedent Victoria Chen
  • Anaphor she

43
Tasks
  • Pronominal anaphora resolution
  • Given a pronoun, find its antecedent
  • Coreference resolution
  • Find the coreference relations among all
    referring expressions
  • Each set of referring expressions is a
    coreference chain. What are the chains in our
    story?
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp, her, the 37-year-old, the
    Denver-based financial-services companys
    president, she
  • Megabucks Banking Corp., the Denver-based
    financial-services company, Megabucks
  • her pay
  • Lotsabucks

44
Many types of reference
  • (after Webber 91)
  • According to Doug, Sue just bought a 1962 Ford
    Falcon
  • But that turned out to be a lie (a speech act)
  • But that was false (proposition)
  • That struck me as a funny way to describe the
    situation (manner of description)
  • That caused Sue to become rather poor (event)

45
4 types of referring expressions
  • Indefinite noun phrases new to hearer
  • Mrs. Martin was so very kind as to send Mrs.
    Goddard a beautiful goose
  • He had gone round one day to bring her some
    walnuts.
  • I am going to the butchers to buy a goose
    (specific/non-specific)
  • I hope they still have it
  • I hope they still have one
  • Definite noun phrases identifiable to hearer
    because
  • Mentioned It concerns a white stallion which I
    have sold to an officer. But the pedigree of the
    white stallion was not fully established.
  • Identifiable from beliefs or unique I read about
    it in The New York Times
  • Inherently unique The fastest car in

46
Reference Phenomena 3. Pronouns
  • Emma smiled and chatted as cheerfully as she
    could.
  • Compared to definite noun phrases, pronouns
    require more referent salience.
  • John went to Bobs party, and parked next to a
    classic Ford Falcon.
  • He went inside and talked to Bob for more than an
    hour.
  • Bob told him that he recently got engaged.
  • ??He also said that he bought it yesterday.
  • OK He also said that he bought the Falcon
    yesterday.

47
More on Pronouns
  • Cataphora pronoun appears before referent
  • Even before she saw it, Dorothy had been thinking
    about the Emerald City every day.

48
4. Names
  • Miss Woodhouse certainly had not done him
    justice.
  • International Business Machines sought patent
    compensation from Amazon. In fact, IBM had
    previously sued a number of other companies.

49
ComplicationsInferrables and Generics
  • Inferrables (bridging inferences)
  • I almost bought a 1962 Ford Falcon today, but a
    door had a dent and the engine seemed noisy.
  • Generics
  • Im interested in buying a Mac laptop. They are
    very stylish.
  • In March in Boulder you have to wear a jacket.

50
Features for pronominal anaphora resolution
  • Number agreement
  • John has a Ford Falcon. It is red.
  • John has three Ford Falcons. It is red.
  • But note
  • IBM is announcing a new machine translation
    product. They have been been working on it for 20
    years.
  • Gender agreement
  • John has an Acura. He/it/she is attractive.
  • Syntactic constraints (Binding Theory)
  • John bought himself a new Ford (himselfJohn)
  • John bought him a new Ford (him not John)

51
Pronoun Interpretation Features
  • Selectional Restrictions
  • John parked his Ford in the garage. He had
    driven it around for hours.
  • Recency
  • The doctor found an old map in the captains
    chest. Jim found an even older map hidden on the
    shelf. It described an island full of redwood
    trees and sandy beaches.

52
Pronoun Interpretation Preferences
  • Grammatical Role Subject preference
  • Billy Bones went to the bar with Jim Hawkins.
  • He called for a glass of rum.
  • heBilly
  • Jim Hawkins went to the bar with Billy Bones.
  • He called for a glass of rum.
  • he Jim

53
Repeated Mention Preference
  • Billy Bones had been thinking about a glass of
    rum ever since the pirate ship docked. He hobbled
    over to the Old Parrot bar. Jim Hawkins went with
    him. He called for a glass of rum.
  • heBilly

54
Parallelism Preference
  • Long John Silver went with Jim to the Old Parrot.
  • Billy Bones went with him to the Old Anchor Inn.
  • himJim

55
Verb Semantics Preferences
  • John telephoned Bill. He lost the laptop.
  • John criticized Bill. He lost the laptop.
  • Implicit causality
  • Implicit cause of criticizing is object.
  • Implicit cause of telephoning is subject.

56
Two algorithms for pronominal anaphora resolution
  • The Hobbs Algorithm
  • A Log-Linear Model

57
Hobbs algorithm
  • Begin at NP
  • Go up tree to first NP or S. Call this X, and
    the path p.
  • Traverse all branches below X to the left of p,
    left-to-right, breadth-first. Propose as
    antecedent any NP that has a NP or S between it
    and X.
  • If X is the highest S in the sentence, traverse
    the parse trees of the previous sentences in the
    order of recency. Traverse left-to-right,
    breadth first. When a NP is encountered, propose
    as antecedent. If not the highest node, go to
    step 5.

58
  • From node X, go up the tree to the first NP or S.
    Call it X, and the path p.
  • If X is an NP and the path to X did not pass
    through the nominal that X dominates, propose X
    as antecedent.
  • Traverse all branches below X to the right of the
    path, in a left-to-right, breadth first manner.
    Propose any NP encountered as the antecedent.
  • If X is an S node, traverse all branches of X to
    the right of the path but do not go below any NP
    or S encountered. Propose any NP as the
    antecedent.
  • Go to step 4

59
Hobbs algorithm walking through an example
  • John saw a Falcon at the dealership.
  • He showed it to Bob.
  • He bought it.
  • current sentence right to left
  • previous sentences left to right

60
A loglinear model
  • Supervised machine learning
  • Train on a corpus in which each pronoun is
    labeled with the correct antecedent
  • In order to train We need to extract
  • Positive examples of referent-pronoun pairs
  • Negative example of referent-pronoun pairs
  • Feature for each one
  • Then we train model to predict 1 for true
    antecedent and 0 for wrong antecedents

61
Features
62
Example target He (U3)
  • John saw a beautiful 1961 Ford Falcon at the used
    car dealership (U1)
  • He showed it to Bob (U2)
  • He bought it (U3)

63
Coreference resolution
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp since 2004, saw her pay
    jump 20, to 1.3 million, as the 37-year-old
    also became the Denver-based financial-service
    companys president. It has been ten years since
    she came to Megabucks from rival Lotsabucks.
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp, her, the 37-year-old, the
    Denver-based financial-services companys
    president, she
  • Megabucks Banking Corp., the Denver-based
    financial-services company, Megabucks
  • her pay
  • Lotsabucks

64
Coreference resolution
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp since 2004, saw her pay
    jump 20, to 1.3 million, as the 37-year-old
    also became the Denver-based financial-service
    companys president. It has been ten years since
    she came to Megabucks from rival Lotsabucks.
  • Have to deal with
  • Names
  • Non-referential pronouns
  • Definite NPs

65
Algorithm for coreference resolution
  • Based on a binary classifier given an anaphor
    and a potential antecedent
  • Returns true or false
  • Process a document from left to right
  • For each NPj we encounter
  • Search backwards through document at NPs
  • For each such potential antecedent NPi
  • Run our classifier
  • If it returns true, coindex NPi and NPj and
    return
  • Terminate when we reach beginning of document

66
Features for coreference classifier
67
More features
68
Evaluation Vilain et al 1995
  • Suppose A, B, and C are coreferent
  • Could represent this as A-B, B-C
  • Or as A-C, A-B
  • Or as A-C, B-C
  • Call any of these sets of correct links the
    reference set.
  • The output of coref algorithm is the hypothesis
    links.
  • Our goal compute precision and recall from the
    hypothesis to the reference set of links

69
Evaluation Vilain et al 1995
  • Clever algorithm to deal with the fact that there
    are multiple possible referent links.
  • Suppose A,B,C,D coreferent and (A-B,B-C,C-D) is
    referent.
  • Algo returns A-B, C-D
  • Precision should be 1, recall should be 2/3
  • (since need 3 links to make 4 things coreferent,
  • and we got 2 of them)

70
Coreference further difficulties
  • Lots of other algorithms and other constraints
  • Hobbs reference resolution as by-product of
    general reasoning
  • The city council denied the demonstrators a
    permit because
  • they feared violence.
  • they advocated violence.
  • An axiom for all X,Y,Z,W fear(X,Z)advocate(Y,Z)
    enable_to_cause(W,Y,Z)- deny(X,Y,W)
  • First clause is deny(city_council,demonstrators,p
    ermit)
  • Second clause Explanation

71
Coreference further difficulties
  • The city council denied the demonstrators a
    permit because
  • they feared violence.
  • they advocated violence.
  • An axiom for all X,Y,Z,W fear(X,Z)advocate(Y,Z)
    enable_to_cause(W,Y,Z)- deny(X,Y,W)
  • from "theycity_council" we could correctly infer
    deny(X,Y,W) in the "feared violence" example
  • from "theydemonstrators" we could correctly
    infer deny(X,Y,W) in the "advocated violence"
    example

72
Summary
  • Discourse Structure
  • TextTiling
  • Coherence
  • Hobbs coherence relations
  • Rhetorical Structure Theory
  • Coreference
  • Kinds of reference phenomena
  • Constraints on co-reference
  • Anaphora Resolution
  • Hobbs
  • Loglinear
  • Coreference Resolution

73
Text Tiling (Hearst 1997)
  • Tokenization
  • Each space-deliminated word
  • Converted to lower case
  • Throw out stop list words
  • Stem the rest
  • Group into pseudo-sentences of length w20
  • Lexical Score Determination cohesion score
  • Average similarity (cosine measure) between
    gaps
  • Boundary Identification

74
Text Tiling algorithm
75
Cosine
76
Supervised Discourse segmentation
  • Discourse markers or cue words
  • Broadcast news
  • Good evening, Im
  • coming up.
  • Science articles
  • First,.
  • The next topic.

77
Supervised discourse segmentation
  • Supervised machine learning
  • Label segment boundaries in training and test set
  • Extract features in training
  • Learn a classifier
  • In testing, apply features to predict boundaries

78
Supervised discourse segmentation
  • Evaluation WindowDiff (Pevzner and Hearst 2000)
  • assign partial credit
Write a Comment
User Comments (0)
About PowerShow.com