Title: CS 114: Introduction to Computational Linguistics
1CS 114 Introduction to Computational Linguistics
James Pustejovsky
Discourse Structure
April 1, 2008
2Outline
- Discourse Structure
- Textiling
- Coherence
- Hobbs coherence relations
- Rhetorical Structure Theory
- Coreference
- Kinds of reference phenomena
- Constraints on co-reference
- Anaphora Resolution
- Hobbs
- Loglinear
- Coreference
3Part I Discourse Structure
- Conventional structures for different genres
- Academic articles
- Abstract, Introduction, Methodology, Results,
Conclusion - Newspaper story
- inverted pyramid structure (lead followed by
expansion)
4Discourse Segmentation
- Simpler task
- Discourse segmentation
- Separating document into linear sequence of
subtopics - Why?
- Information retrieval
- automatically segmenting a TV news broadcast or a
long news story into sequence of stories - Text summarization
- summarize different segments of a document
- Information extraction
- Extract info from inside a single discourse
segment
5Unsupervised Discourse Segmentation
- Hearst (1997) 21-pgraph science news article
called Stargazers - Goal produce the following subtopic segments
6Key intuition cohesion
- Halliday and Hasan (1976) The use of certain
linguistic devices to link or tie together
textual units - Lexical cohesion
- Indicated by relations between words in the two
units (identical word, synonym, hypernym) - Before winter I built a chimney, and shingled the
sides of my house. - I thus have a tight shingled and plastered
house. - Peel, core and slice the pears and the apples.
Add the fruit to the skillet.
7Key intuition cohesion
- Non-lexical anaphora
- The Woodhouses were first in consequence there.
All looked up to them. - Cohesion chain
- Peel, core and slice the pears and the apples.
Add the fruit to the skillet. When they are soft
8Intuition of cohesion-based segmentation
- Sentences or paragraphs in a subtopic are
cohesive with each other - But not with paragraphs in a neighboring subtopic
- Thus if we measured the cohesion between every
neighboring sentences - We might expect a dip in cohesion at subtopic
boundaries.
9What makes a text/dialogue coherent?
- Consider, for example, the difference between
passages (18.71) and (18.72). Almost certainly
not. The reason is that these utterances, when
juxtaposed, will not exhibit coherence. Do you
have a discourse? Assume that you have collected
an arbitrary set of well-formed and independently
interpretable utterances, for instance, by
randomly selecting one sentence from each of the
previous chapters of this book. - vs.
10What makes a text/dialogue coherent?
- Assume that you have collected an arbitrary
set of well-formed and independently
interpretable utterances, for instance, by
randomly selecting one sentence from each of the
previous chapters of this book. Do you have a
discourse? Almost certainly not. The reason is
that these utterances, when juxtaposed, will not
exhibit coherence. Consider, for example, the
difference between passages (18.71) and (18.72).
(JM695)
11What makes a text coherent?
- Discourse/topic structure
- Appropriate sequencing of subparts of the
discourse - Rhetorical structure
- Appropriate use of coherence relations between
subparts of the discourse - Referring expressions
- Words or phrases, the semantic interpretation of
which is a discourse entity
12Information Status
- Contrast
- John wanted a poodle but Becky preferred a corgi.
- Topic/comment
- The corgi they bought turned out to have fleas.
- Theme/rheme
- The corgi they bought turned out to have fleas.
- Focus/presupposition
- It was Becky who took him to the vet.
- Given/new
- Some wildcats bite, but this wildcat turned out
to be a sweetheart. - Contrast Speaker (S) and Hearer (H)
13Determining Given vs. New
- Entities when first introduced are new
- Brand-new (H must create a new entity)
- I saw a dinosaur today.
- Unused (H already knows of this entity)
- I saw your mother today.
- Evoked entities are old -- already in the
discourse - Textually evoked
- The dinosaur was scaley and gray.
- Situationally evoked
- The light was red when you went through it.
- Inferrables
- Containing
- I bought a carton of eggs. One of them was
broken. - Non-containing
- A bus pulled up beside me. The driver was a
monkey.
14Given/New and Definiteness/Indefiniteness
- Subject NPs tend to be syntactically definite and
old - Object NPs tend to be indefinite and new
- I saw a black cat yesterday. The cat looked
hungry. - Definite articles, demonstratives, possessives,
personal pronouns, proper nouns, quantifiers like
all, every - Indefinite articles, quantifiers like some, any,
one signal indefinitenessbut. - This guy came into the room
15Discourse/Topic Structure
- Text Segmentation
- Linear
- TextTiling
- Look for changes in content words
- Hierarchical
- Grosz Sidners Centering theory
- Morris Hirsts algorithm
- Lexical chaining through Rogets thesaurus
- Hierarchical Relations
- Mann et al.s Rhetorical Structure Theory
- Marcus algorithm
16TextTiling (Hearst 94)
- Goal find multi-paragraph topics
- Example 21 paragraph article called Stargazers
17TextTiling (Hearst 94)
- Goal find multi-paragraph topics
- But its difficult to define topic (Brown
Yule) - Focus instead on topic shift or change
- Change in content, by contrast with setting,
scene, characters - Mechanism
- compare adjacent blocks of text
- look for shifts in vocabulary
18Intuition behind TextTiling
19TextTiling Algorithm
- Tokenization
- Lexical Score Determination
- Blocks
- Vocabulary Introductions
- Chains
- Boundary Identification
20Tokenization
- Convert text stream into terms (words)
- Remove stop words
- Reduce to root (inflectional morphology)
- Subdivide into token-sequences
- (substitute for sentences)
- Find potential boundary points
- (paragraphs breaks)
21Determining Scores
- Compute a score at each token-sequence gap
- Score based on lexical occurrences
- Block algorithm
22(No Transcript)
23Boundary Identification
- Smooth the plot (average smoothing)
- Assign depth score at each token-sequence gap
- Deeper valleys score higher
- Order boundaries by depth score
- Choose boundary cut off (avg-sd/2)
24Evaluation
- Data
- Twelve news articles from Dialog
- Seven human judges per article
- major boundaries chosen by 3 judges
- Avg number of paragraphs 26.75
- Avg number of boundaries 10 (39)
- Results
- Between upper and lower bounds
- Upper bound judges averages
- Lower bound reasonable simple algorithm
25Assessing Agreement Among Judges
- KAPPA Coefficient
- Measures pairwise agreement
- Takes expected chance agreement into account
- P(A) proportion of times judges agree
- P(E) proportion expected chance agreement
- .43 to .68 (Isard Carletta 95, boundaries)
- .65 to .90 (Rose 95, sentence segmentation)
- Here, k .647
26Part II Text Coherence
- What makes a discourse coherent?
- The reason is that these utterances, when
juxtaposed, will not exhibit coherence. Almost
certainly not. Do you have a discourse? Assume
that you have collected an arbitrary set of
well-formed and independently interpretable
utterances, for instance, by randomly selecting
one sentence from each of the previous chapters
of this book.
27Better?
-
-
- Assume that you have collected an arbitrary set
of well-formed and independently interpretable
utterances, for instance, by randomly selecting
one sentence from each of the previous chapters
of this book. Do you have a discourse? Almost
certainly not. The reason is that these
utterances, when juxtaposed, will not exhibit
coherence.
28Coherence
- John hid Bills car keys. He was drunk.
- ??John hid Bills car keys. He likes spinach.
29What makes a text coherent?
- Appropriate use of coherence relations between
subparts of the discourse -- rhetorical structure - Appropriate sequencing of subparts of the
discourse -- discourse/topic structure - Appropriate use of referring expressions
30Hobbs 1979 Coherence Relations
- Result
- Infer that the state or event asserted by S0
causes or could cause the state or event asserted
by S1. - The Tin Woodman was caught in the rain. His
joints rusted.
31Hobbs Explanation
- Infer that the state or event asserted by S1
causes or could cause the state or event asserted
by S0. - John hid Bills car keys. He was drunk.
32Hobbs Parallel
- Infer p(a1, a2..) from the assertion of S0 and
p(b1,b2) from the assertion of S1, where ai and
bi are similar, for all I. - The Scarecrow wanted some brains. The Tin Woodman
wanted a heart.
33Hobbs Elaboration
- Infer the same proposition P from the assertions
of S0 and S1. - Dorothy was from Kansas. She lived in the midst
of the great Kansas prairies.
34Coherence relations impose a discourse structure
35Rhetorical Structure Theory
- Another theory of discourse structure, based on
identifying relations between segments of the
text - Nucleus/satellite notion encodes asymmetry
- Nucleus is thing that if you deleted it, text
wouldnt make sense. - Some rhetorical relations
- Elaboration (set/member, class/instance,
whole/part) - Contrast multinuclear
- Condition Sat presents precondition for N
- Purpose Sat presents goal of the activity in N
36One example of rhetorical relation
- A sample definition
- Relation Evidence
- Constraints on N H might not believe N as much
as S think s/he should - Constraints on Sat H already believes or will
believe Sat - Effect Hs belief in N is increased
- An example
- Kevin must be here.
- His car is parked outside.
Satellite
Nucleus
37Automatic Rhetorical Structure Labeling
- Supervised machine learning
- Get a group of annotators to assign a set of RST
relations to a text - Extract a set of surface features from the text
that might signal the presence of the rhetorical
relations in that text - Train a supervised ML system based on the
training set
38Features cue phrases
- Explicit markers because, however, therefore,
then, etc. - Tendency of certain syntactic structures to
signal certain relations - Infinitives are often used to signal purpose
relations Use rm to delete files. - Ordering
- Tense/aspect
- Intonation
39Some Problems with RST
- How many Rhetorical Relations are there?
- How can we use RST in dialogue as well as
monologue? - RST does not model overall structure of the
discourse. - Difficult to get annotators to agree on labeling
the same texts
40Part III Coreference
- Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp since 2004, saw her pay
jump 20, to 1.3 million, as the 37-year-old
also became the Denver-based financial-service
companys president. It has been ten years since
she came to Megabucks from rival Lotsabucks. - The Tin Woodman went to the Emerald City to see
the Wizard of Oz and ask for a heart. After he
asked for it, the Woodman waited for the Wizards
response.
41Why reference resolution?
- Information Extraction
- President of which company is retiring?
- First Union Corp. is continuing to wrestle with
severe problems unleashed by a botched merger and
a troubled business strategy. According to
industry insiders at Paine Webber, their
president, John R. Georgius, is planning to
retire by the end of the year.
42Some terminology
- Reference
- process by which speakers use words Victoria
Chen and she to denote a particular person - Referring expression Victoria Chen, she
- Referent the actual entity (but as a shorthand
we might call Victoria Chen the referent). - Victoria Chen and she corefer
- Antecedent Victoria Chen
- Anaphor she
43Tasks
- Pronominal anaphora resolution
- Given a pronoun, find its antecedent
- Coreference resolution
- Find the coreference relations among all
referring expressions - Each set of referring expressions is a
coreference chain. What are the chains in our
story? - Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp, her, the 37-year-old, the
Denver-based financial-services companys
president, she - Megabucks Banking Corp., the Denver-based
financial-services company, Megabucks - her pay
- Lotsabucks
44Many types of reference
- (after Webber 91)
- According to Doug, Sue just bought a 1962 Ford
Falcon - But that turned out to be a lie (a speech act)
- But that was false (proposition)
- That struck me as a funny way to describe the
situation (manner of description) - That caused Sue to become rather poor (event)
454 types of referring expressions
- Indefinite noun phrases new to hearer
- Mrs. Martin was so very kind as to send Mrs.
Goddard a beautiful goose - He had gone round one day to bring her some
walnuts. - I am going to the butchers to buy a goose
(specific/non-specific) - I hope they still have it
- I hope they still have one
- Definite noun phrases identifiable to hearer
because - Mentioned It concerns a white stallion which I
have sold to an officer. But the pedigree of the
white stallion was not fully established. - Identifiable from beliefs or unique I read about
it in The New York Times - Inherently unique The fastest car in
46Reference Phenomena 3. Pronouns
- Emma smiled and chatted as cheerfully as she
could. - Compared to definite noun phrases, pronouns
require more referent salience. - John went to Bobs party, and parked next to a
classic Ford Falcon. - He went inside and talked to Bob for more than an
hour. - Bob told him that he recently got engaged.
- ??He also said that he bought it yesterday.
- OK He also said that he bought the Falcon
yesterday.
47More on Pronouns
- Cataphora pronoun appears before referent
- Even before she saw it, Dorothy had been thinking
about the Emerald City every day.
484. Names
- Miss Woodhouse certainly had not done him
justice. - International Business Machines sought patent
compensation from Amazon. In fact, IBM had
previously sued a number of other companies.
49ComplicationsInferrables and Generics
- Inferrables (bridging inferences)
- I almost bought a 1962 Ford Falcon today, but a
door had a dent and the engine seemed noisy. - Generics
- Im interested in buying a Mac laptop. They are
very stylish. - In March in Boulder you have to wear a jacket.
50Features for pronominal anaphora resolution
- Number agreement
- John has a Ford Falcon. It is red.
- John has three Ford Falcons. It is red.
- But note
- IBM is announcing a new machine translation
product. They have been been working on it for 20
years. - Gender agreement
- John has an Acura. He/it/she is attractive.
- Syntactic constraints (Binding Theory)
- John bought himself a new Ford (himselfJohn)
- John bought him a new Ford (him not John)
51Pronoun Interpretation Features
- Selectional Restrictions
- John parked his Ford in the garage. He had
driven it around for hours. - Recency
- The doctor found an old map in the captains
chest. Jim found an even older map hidden on the
shelf. It described an island full of redwood
trees and sandy beaches.
52Pronoun Interpretation Preferences
- Grammatical Role Subject preference
- Billy Bones went to the bar with Jim Hawkins.
- He called for a glass of rum.
-
- heBilly
- Jim Hawkins went to the bar with Billy Bones.
- He called for a glass of rum.
- he Jim
53Repeated Mention Preference
- Billy Bones had been thinking about a glass of
rum ever since the pirate ship docked. He hobbled
over to the Old Parrot bar. Jim Hawkins went with
him. He called for a glass of rum. - heBilly
54Parallelism Preference
- Long John Silver went with Jim to the Old Parrot.
- Billy Bones went with him to the Old Anchor Inn.
- himJim
55Verb Semantics Preferences
- John telephoned Bill. He lost the laptop.
- John criticized Bill. He lost the laptop.
- Implicit causality
- Implicit cause of criticizing is object.
- Implicit cause of telephoning is subject.
56Two algorithms for pronominal anaphora resolution
- The Hobbs Algorithm
- A Log-Linear Model
57Hobbs algorithm
- Begin at NP
- Go up tree to first NP or S. Call this X, and
the path p. - Traverse all branches below X to the left of p,
left-to-right, breadth-first. Propose as
antecedent any NP that has a NP or S between it
and X. - If X is the highest S in the sentence, traverse
the parse trees of the previous sentences in the
order of recency. Traverse left-to-right,
breadth first. When a NP is encountered, propose
as antecedent. If not the highest node, go to
step 5.
58- From node X, go up the tree to the first NP or S.
Call it X, and the path p. - If X is an NP and the path to X did not pass
through the nominal that X dominates, propose X
as antecedent. - Traverse all branches below X to the right of the
path, in a left-to-right, breadth first manner.
Propose any NP encountered as the antecedent. - If X is an S node, traverse all branches of X to
the right of the path but do not go below any NP
or S encountered. Propose any NP as the
antecedent. - Go to step 4
59Hobbs algorithm walking through an example
- John saw a Falcon at the dealership.
- He showed it to Bob.
- He bought it.
- current sentence right to left
- previous sentences left to right
60A loglinear model
- Supervised machine learning
- Train on a corpus in which each pronoun is
labeled with the correct antecedent - In order to train We need to extract
- Positive examples of referent-pronoun pairs
- Negative example of referent-pronoun pairs
- Feature for each one
- Then we train model to predict 1 for true
antecedent and 0 for wrong antecedents
61Features
62Example target He (U3)
- John saw a beautiful 1961 Ford Falcon at the used
car dealership (U1) - He showed it to Bob (U2)
- He bought it (U3)
63Coreference resolution
- Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp since 2004, saw her pay
jump 20, to 1.3 million, as the 37-year-old
also became the Denver-based financial-service
companys president. It has been ten years since
she came to Megabucks from rival Lotsabucks. - Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp, her, the 37-year-old, the
Denver-based financial-services companys
president, she - Megabucks Banking Corp., the Denver-based
financial-services company, Megabucks - her pay
- Lotsabucks
64Coreference resolution
- Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp since 2004, saw her pay
jump 20, to 1.3 million, as the 37-year-old
also became the Denver-based financial-service
companys president. It has been ten years since
she came to Megabucks from rival Lotsabucks. - Have to deal with
- Names
- Non-referential pronouns
- Definite NPs
65Algorithm for coreference resolution
- Based on a binary classifier given an anaphor
and a potential antecedent - Returns true or false
- Process a document from left to right
- For each NPj we encounter
- Search backwards through document at NPs
- For each such potential antecedent NPi
- Run our classifier
- If it returns true, coindex NPi and NPj and
return - Terminate when we reach beginning of document
66Features for coreference classifier
67More features
68Evaluation Vilain et al 1995
- Suppose A, B, and C are coreferent
- Could represent this as A-B, B-C
- Or as A-C, A-B
- Or as A-C, B-C
- Call any of these sets of correct links the
reference set. - The output of coref algorithm is the hypothesis
links. - Our goal compute precision and recall from the
hypothesis to the reference set of links
69Evaluation Vilain et al 1995
- Clever algorithm to deal with the fact that there
are multiple possible referent links. - Suppose A,B,C,D coreferent and (A-B,B-C,C-D) is
referent. - Algo returns A-B, C-D
- Precision should be 1, recall should be 2/3
- (since need 3 links to make 4 things coreferent,
- and we got 2 of them)
70Coreference further difficulties
- Lots of other algorithms and other constraints
- Hobbs reference resolution as by-product of
general reasoning - The city council denied the demonstrators a
permit because - they feared violence.
- they advocated violence.
- An axiom for all X,Y,Z,W fear(X,Z)advocate(Y,Z)
enable_to_cause(W,Y,Z)- deny(X,Y,W) - First clause is deny(city_council,demonstrators,p
ermit) - Second clause Explanation
71Coreference further difficulties
- The city council denied the demonstrators a
permit because - they feared violence.
- they advocated violence.
- An axiom for all X,Y,Z,W fear(X,Z)advocate(Y,Z)
enable_to_cause(W,Y,Z)- deny(X,Y,W) - from "theycity_council" we could correctly infer
deny(X,Y,W) in the "feared violence" example - from "theydemonstrators" we could correctly
infer deny(X,Y,W) in the "advocated violence"
example
72Summary
- Discourse Structure
- TextTiling
- Coherence
- Hobbs coherence relations
- Rhetorical Structure Theory
- Coreference
- Kinds of reference phenomena
- Constraints on co-reference
- Anaphora Resolution
- Hobbs
- Loglinear
- Coreference Resolution
73Text Tiling (Hearst 1997)
- Tokenization
- Each space-deliminated word
- Converted to lower case
- Throw out stop list words
- Stem the rest
- Group into pseudo-sentences of length w20
- Lexical Score Determination cohesion score
- Average similarity (cosine measure) between
gaps - Boundary Identification
74Text Tiling algorithm
75Cosine
76Supervised Discourse segmentation
- Discourse markers or cue words
- Broadcast news
- Good evening, Im
- coming up.
- Science articles
- First,.
- The next topic.
77Supervised discourse segmentation
- Supervised machine learning
- Label segment boundaries in training and test set
- Extract features in training
- Learn a classifier
- In testing, apply features to predict boundaries
78Supervised discourse segmentation
- Evaluation WindowDiff (Pevzner and Hearst 2000)
- assign partial credit