Title: BREDT Processing Reference in Discourse
1BREDTProcessing Reference in Discourse
- Christer Johansson, UiB
- Lars Johnsen, UiB
- Kaja Borthen, NTNU
2Goals
- Develop statistical methods and resources for the
discovery of referential chains in (arbitrary)
text. - Research Training post doc and graduate level
coworkers.
3We propose ...
- Discourse analysis is a fundamental (and
separate) module of language processing (just
like syntax, phonology, and morphology). - Discourse Analysis can be performed without full
parsing (and it might help the parser make
decisions).
4Simple examples
- Pronouns
- The monkey1 ate the banana2 because ...
- it was hungry. itmonkey
- it was ripe. itbanana
- it was tea time it specification of the
situation
5Simple examples
- Definites
- Ola ødela armen / Ola broke the arm.
- Ola broke his arm.
- The definite form indicates that the noun is
known. In this case, it can be resolved by
common knowledge that a person has-an arm.
6Simple examples
- Definites
- The definite signals that something has been
mentioned before. It initiates a search for
reference. - General reference
- The lion is a big cat.
- if no previous reference then lion refers to
the species. - Cats are hungry.
- a link could be established to represent the
knowledge that lions is a sub-group of cats, and
cats are hungry, therefore lions are hungry.
7Across Sentence Boundaries
- Unni was ill. A doctor came to see her. She said
that she must be hospitalized, and she wrote her
a prescription.
8Decisions for representation
- Unni was ill. A doctor came to see her. She said
that she must be hospitalized, and wrote her a
prescription. - The nearest referent is linked. The links can be
followed to the first mentionthe anchor.
9Reference is important for ...
10Machine Translation
- The correct translation of a pronoun depends on
what it refers to. - Translation of a definite noun may depend on its
informative status.
11Prosody (e.g. in text-to-speech)
- Given information is seldom stressed
12New vs. Given(Horne Johansson 1991)
- John wants a dachshund, but Im not sure he can
take care of a dog. - Dog is given information because dachshund is
a kind of dog. - John wants a dog, but Im not sure he can take
care of a dachshund. - Dachshund is a specification of dog, and
therefore new information. (The supposition might
be that a dachshund is more demanding than the
typical dog. There is usually a reason why
something is said.)
13Applications
- Text-to-speech
- Given information should not be stressed.
- Information could be given via semantic
relations - superordinate/subordinate (x is-a y)
- part/whole (has-a)
14Information Retrieval
15Why?
- Reference is important in information retrieval
because ... - Referring expressions may hide key words
- which makes it hard to automatically find the
relevant keywords
16A short example
- The lion is the king1 of the jungle. She2
hunts mostly at night. The females3 live in
groups. The male4 is much larger, but _ 5
lives alone. - Word form only Lion 1 of 26 words (as king,
jungle, night, females, groups, male) - By reference Lion 6 av 26 ord
- The significance of lion goes up.
17Conclusion IR
- The detection of central themes in a text is
facilitated by reference detection. - Assumption themes are referred to often.
- via pronouns
- via semantic relations
18There are plenty of applications for BREDT
19Distribution of tasks
- Who is going to do the work?
20Identification of needs
- 1) A need to inform
- in speech, important information is stressed.
- on internet, markup language could be used.
- 2) A need for information
- automatic tools for reference detection
- tools for detection of the markup
21Who?
- 1) Producer of information
- Has a need for discourse tools.
- 2) Information consumer
- May need similar tools.
- 3) Both
- Have a need for standards.
- Global Document Annotation (via Cyber Assist).
22BREDT
- Discover and determine chains of reference.
- Fairly simple statistical methods
- Partial goals
- Finding selectional restrictions
- Automatically generate useful semantic structure
from co-occurrence
23Why it will work
- We have 18 million Norwegian words tagged for
- word class (95 accurate)
- functional roles (maybe 80 correct)
- lexical stem (not always correct)
- soon 100000 running words tagged for discourse
reference - Tools TiMBL.
24BREDT
- We have the tools
- We have the ingredients
- We can ask the baker
- tekstlaboratoriet i Oslo
- NTNU i Trondheim
- Induction of Linguistic Knowledge i Tilburg
- CyberAssist i Tokyo
- Diskurs och Prosodigruppen vid Lunds Universitet.
25BUT ...
- All information we have available are also
sources of errors. - Word class is only 95 to 98 correct
- Functional roles maybe 80 correct
- Word forms spelling errors ...
26Statistical Method
- One method is given in
- Soon, Ng, Lim, 2001. A Machine Learning
Approach to Coreference Resolution of Noun
Phrases. Computational Linguistics, Vol. 27(4). - The core of the idea is to give each candidate a
context vector.
27We will attempt
- Match depending on two context vectors.
- If two vectors match or not depend on how the
match function has been trained.
28Start Algorithm
- For every possible referent (i.e., noun /
pronoun) - Construct a context vector
- The context vector may represent information
about the previous and following words. - The information could be
- word forms, word class tags, functional role
tags, significant letters of the word (endings).
29Training the match function
- This is done by Machine Learning
- Decision Trees proved useful.
- TiMBL Memory Based Learning
- Has been used on similar tasks with good results
- Training is done with examples manually tagged
for reference.
30Training is incremental
- Easier to expand the training set
- More and more of the task will become slightly
simpler proof-reading. -
- Goal 2005 Some millions of words tagged for
reference.
31State of the art
- We have found very little research in Scandinavia
on this topic. - The Message Understanding Conference (MUC 1..7)
contained approaches for co-reference.
32Research
- Reference is important, but how is it signaled?
- Many cues problem of integration
- Few cues problem of ambiguity
33Common Projects
- Similar projects might be developed for Swedish.
The Prosody and Discourse Group at Lund
University has done some research in the area.
34Publications (after 6 months)
- Christer Johansson
- On automatic word classification
- A Memory Based Method for Inventing Features,
proc. of Scandinavian Conference on Artificial
Intelligence, Bergen, nov.2-4. - Searching for Features using a Genetic Algorithm,
proc. of Scandinavian Conference on Artificial
Intelligence, Bergen, nov.2-4.
35Publications (after 6 months)
- Kaja Borthen
- Semantics
- A grammar component for semantic classes of
nominals, in Bender et al. (Eds.), A Workshop on
Ideas and Startegies for Multi-lingual Grammar
Development, Vienna, Austria. - The correspondence between attention states and
the form of kind-referring NPs general
explanations for seemingly ad hoc facts. in
Festschrift for Jeanette Gundel. (under revision).
36Publications (after 6 months)
- Lars Johnsen Christer Johansson
- Analogy as a mechanism for generalization.
- Under development. (Royal Skousens general model
of analogy can be improved to work in linear
time. It may also use hierarchically ordered
features.)
37Thanks
- http//ling.uib.no/BREDT/
- christer.johansson_at_lili.uib.no
- lars.johnsen_at_lili.uib.no
38State of the Art the Tilburg Memory Based
Learner http//pi0657.uvt.nl/
- Input Now is a tough time to be a computer
maker. - Efter 1) tagging, 2) chunking, 3) functional role
detection -
NP1Subject Now/RB VP1 is/VBZ
NP1NP-PRD a/DT tough/JJ time/NN VP2 to/TO
be/VB NP2NP-PRD a/DT computer/NN
maker/NN
39An example of realistic input
- NP1Subject Sun/NNP Microsystems/NNPS ,/,
P along/IN PNP P with/IN NP its/PRP
rivals/NNS ,/, VP1 has/VBZ had/VBD to/TO
go/VB to/TO / NP1Object warp//NN
speed/NN and/CC VP2 then/RB back/VB
"/UNKNOWN ,/,NP3Subject Scott/NNP McNealy//NNP
,/, NP4Subject its/PRP chief/JJ executive/NN
,/, VP3 said/VBD NP3NP-TMP last/JJ week/NN
,/, C as/IN NP4Subject Sun/NNP VP4
announced/VBD C that/IN NP5Subject it/PRP
VP5 would/MD make/VB NP5Object a/DT
larger-than-expected//JJ loss/NN PNP P
in/IN NP the/DT current/JJ quarter/NN
and/CC VP6 would/MD lay/VB PRT off/RP
NP6Object 3,900//CD workers/NNS ./.