Title: When Corpus Meets Theory
1When Corpus Meets Theory
Models and Data
- James Pustejovsky
- TSD 2002
- September 10, 2002
2Talk Outline
- Goals for Language Modeling
- The Role of Corpus in Theory
- Disambiguation
- Selection discovery
- Clustering
- Category modification and formation
- Grammar induction
- The Role of Theory in Corpus
3Goals of Language Modeling
- Statistically informed models improve application
performance - Speech
- Search
- Clustering
- Parsing
- Machine translation
- Summarization
- Question answering
4Theory Drives the Model
- Corpus Behavior of words is determined by their
type. - You cant find what you cant model.
- But, you dont want to find only what you model!
- Theory allows a model of reality, but
- Corpus brings reality to the model.
5Language Modeling with Generative Lexicon
- Selection integrates paradigmatics and
syntagmatics - Models the relationship between selectional
contexts - Coercion in typing
- Complex type (Dot Objects)
- All major categories behave functionally
- Qualia structure models much of this behavior
- Semantic Types are differentiated and ranked
- Grammatical behavior follows (generally) from type
6Quines Gambit in Corpora
- Co-occurrence reveals surface relations.
- Paradigmatics is first order.
- Syntagmatics is first order.
- LSA and other techniques create non-superficial
associations. - Model Bias is necessary to create decision
procedures - Example Complex Types
7Recognizing Selection
- 1. a. The man fell/died.
- b. The rock fell/!died.
- a. John forced/!convinced the door to open.
- b. John forced/convinced the guests to leave.
- a. John poured milk into /!on his coffee.
- b. John poured milk into/on the bowl.
8Modeling Paradigmatic Systems
9Integrating Selection into Grammars
10Qualia Structure
- Qualia are used to create new types
- They are generative coherence relations between
types.
11Three Ranks of Type
Entities
Events
12System of Generating Types
13Qualia are incorporated into Type Itself
14Qualia as Types
15Functional Selection
16Functional Type Coercion
17Co-composition
18Coercion in Function Composition
19Selection and Coercion
20Type Specification
21Type Determines Grammatical Behavior
Corpus Distribution of different types should
correlate strongly with their type.
Behavior is measurable in corpus
22Corpus Analysis provides probable values for
Coercion
Drinking, sipping, cooling,?pouring,?spilling,
23Complements of begin in AP(Pustejovsky and
Rooth, 1991 ms)
24Complements of veto in AP
25Limitations of this ApproachFuzzy Selection
26Dependencies that require ModelsComplex Types
27Complex Types
28Contexts Introducing Complex Types
- a. John read the story/the book.
- b. John told the story/!the book.
- 2. Mary read the subway wall.
-
29When Paradigmatic systems are modeled,
Syntagmatic Processes are affected
- The specificity of argument selection by a
predicate - The treatment of verbal polysemy and multiple
- subcategorization
- The treatment of type mismatches and the
semantics of - solidarities
30Types of Properties
31Natural Binary Predicate
32Polar Predicates
- hot/cold
- big/small
- short/tall
- clean/dirty
33Lexical Asymmetries
- Preferences and Defaults
- clean/dirty, empty/full, pretty/ugly
- Lexical Gaps
- bald/(hairy), toothless/(toothed)
- Lexical Perfectives
- dead/alive
34Sortal Opposition
- External Negation points up in the Type system
- Internal Negation points down in the Type System
- (1) a. Rocks are not alive.
- b. !Rocks are dead.
- (2) a. The Pope is not married.
- b. !The Pope is a bachelor.
- (3) a. Bill did not run the race.
- b. Hence, Bill did not win the race.
- c. !Bill lost the race.
35Case Study I Corpus Drives Lexical Acquisition
36Text Mining the Biobibliome
- 40,000 papers published each month in Medline
- 11 million abstracts currently in Medline
Database - 36 GB of text
37Robust Extraction of Relations from Biomedical
Texts
- Statistical techniques are too course-grained
- SU6656 does not inhibit the PDGF receptor.
- Local Named Entity Extraction is not
informative enough - This protein binds to Src.
- Bag of words and bag of entities approaches
are too weak - p16 inhibits Cdk4.
- Cdk4 is inhibited by p16.
38Parsing Methodology
- Identify Targets of Interest
- Entities and relations to be extracted
- Perform Corpus Analysis over targets
- Cluster corpus occurrences by syntactic behavior
and semantic type - Generate Patterns for extraction
- Test and modify patterns against development
corpus
39Possible Selectional Frames
- p16 inhibits Cdk4. (entity,entity)
- p16 inhibits cell growth. (entity,process)
- Methylation inhibits HDAC1. (process,entity)
- Cell growth inhibits apoptosis.
(process,process)
40Corpus Pattern Analysis
- Create concordances over target elements
- Automatically cluster complementation patterns
- Semi-automatically verify patterns and amend
grammar rules accordingly.
41Getting the Lexicon out of the Corpus
- Preliminary examination of the text
- Sort concordances according to semantics patterns
- One-sense-per-domain doesn't cut it
- Complementation patterns emerge from the corpus,
with and without realization - Semantic patterns are a first step towards
identifying lexical sets - Semantic patterns identified with specific
lexical sets yields co-specifications - Implicatures can be identified with
co-specifications for a very high proportion of
uses of all predicators.
42Corpus-derived Grammars distinguish Textual
Function
- Tensed Sentence-based relational information
conveys new information. - A peptide representing the carboxyl-terminal tail
of the met receptor inhibits kinase activity. - Nominalization functions to
- Allow further predication and modification
- Bridge the new information with acceptance as
given. - Provide economy of expression in text
- Agentive Nominal conveys a relation as a given
fact. - The protein kinase C inhibitor staurosporine ,
inhibited actin assembly
43Probable Syntactic Patterns Sentential Forms
- A peptide representing the carboxyl-terminal
tail of the met receptor inhibits kinase
activity. - Whereas phosphorylation of the IRK by ATP is
inhibited by the nonhydrolyzable competitor
adenylyl-imidodiphosphate, ... - The Met tail peptide inhibits the closely
related Ron receptor but does not affect - Although the ability of individual trichothecenes
to inhibit protein synthesis and activate JNK/p38
kinases are dissociable , both effects contribute
to the induction of apoptosis .
44Probable Syntactic Patterns Nominal Forms
- 12S E1A , an inhibitor of p300-dependent
transcription , reduces the binding of TFIIB ,
but not that of cyclin E- Cdk2 , to p300. - The protein kinase C inhibitor staurosporine ,
inhibited actin assembly and platelet aggregation
induced by thrombin or PMA.
45Probable Syntactic Patterns Nominalizations
- Structural basis for inhibition of protein
tyrosine phosphatases by Keggin compounds
phosphomolybdate and phosphotungstate. - Previous reports raised question as to whether
8-Cl-cAMP is a prodrug for its metabolite,
8-Cl-adenosine which exerts growth inhibition in
a broad spectrum of cancer cells.
46Case Study II Theory Drives Corpus Analysis
47Semantic Rerendering
- A general technique for adapting and modifying an
existing ontology - Types are extended and created through
- corpus analysis of patterns implicated with type
structures - Ad hoc database projections over a relational
database
48Specialized Ontologies in the Biomedical Domain
- The UMLS from National Library of Medicine
- wide coverage
- shallow semantic type structure
- 180,998 instances of Amino Acid, Peptide, or
Protein in UMLS - Chemical Viewed Functionally and Chemical Viewed
Structurally - These 2 subtrees cover a large number of all
types in the UMLS - The UMLS gives semantic type bindings to 1.5
million entities
49NLP Applications using Semantic Typing
- Statistical Categorization and Disambiguation
Tasks - Resolution of Prepositional Attachment
- Relations between Constituents in Nominal
Compounds - Generalizing across semantic classes
- make up for the sparseness of data
- IR Tasks
- Query Reformulation
- Filtering Ranking of Retrieved Results
- Information Extraction Tasks
- Coreference Resolution
- Relation Extraction (via Anaphora Resolution)
- Entity Identification
50GL as Modeling Bias in Rerendering
- Structural subtyping (Formal)
- Functional subtyping (Telic)
- Activation relations (Agentive)
- Molecular analysis (Const)
51Syntactic Rerendering Algorithm (I)
52Syntactic Rerendering Algorithm (II)
53Syntactic Rerendering Algorithm (III)
54Evaluating Results
- Comparison against Existing Ontologies
- overlap with Gene Ontology (GO) for select
categories - Receptor 17.5 of 2nd level extension phrases
- are in GO
- Improved PR for the client NLP Applications
- Coreference Resolution Application
- Sortal Anaphora
- the enzyme, the protease, the same solvent,
etc.
55Derivation of Instances for the Proposed Subtypes
- Syntactic templates (inhibitor, solvent)
- definitional constructions X is a Y inhibitor
- aliasing constructions X (the solvent)
- appositions X, the inhibitor of Y,
- nominal compounds the solvent X
- enumerations the following solvents X, Y, ..
- relative clauses
- adjuncts X and Y as solvents
56Semantic (Database) Rerendering
- Database of relations
- extracted from the Medline corpus
- inhibit, block, phosphorylate
- Typed projection from relations table
- induces an ad hoc category
- subtype of T1
- ?X X T1 R(X,Y) ?T1?UMLS1
57Syntactic vs. Semantic Rerendering
- Sortals with no corresponding relational form
- solvent
- Sortal and relation predicates
- inhibitor/inhibit
- kinase/phosphorylate
- Relation predicates with no corresponding nominal
forms - bind with
- increase
58Syntactic vs. Semantic Rerendering (II)
- Overlap of derived subtypes
- CDK inhibitor
- p21(WAF-1) inhibited CDK2 and CDK4
- Recover different types of information
- Syntactic templates for sortal predicates old
information - Typed projections of database relations new
information
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67Case Study III Applying Lexical Semantic
Knowledge TERQAS Time and Event Recognition
for Question Answering Systems
68Relevance to Question Answering Systems
- Is Gates currently CEO of Microsoft?
- Were there any meetings between the terrorist
hijackers and Iraq before the WTC event? - Did the Enron merger with Dynegy take place?
- How long did the hostage situation in Beirut last?
Questions over TIMBANK Corpus
- When did the war between Iran and Iraq end?
- When did John Sununu travel to a fundraiser for
John Ashcroft? - How many Tutsis were killed by Hutus in Rwanda
in 1994? - Who was Secretary of Defense during the Gulf
War? - What was the largest U.S. military operation
since Vietnam? - When did the astronauts return from the space
station on the - last shuttle flight?
69Workshop Goals
- TimeML Define and Design a Metadata Standard for
Markup of events, their temporal anchoring, and
how they are related to each other in News
articles. - TIMEBANK Given the specification of TimeML,
create a gold standard corpus of 300 articles
marked up for temporal expressions, events, and
basic temporal relations.
70TERQAS Participants
- James Pustejovsky, PI
- Rob Gaizauskas
- Graham Katz
- Bob Ingria
- José Castaño
- Inderjeet Mani
- Antonio Sanfilippo
- Dragomir Radev
- Patrick Hanks
- Marc Verhagen
- Beth Sundheim
- Andrea Setzer
- Jerry Hobbs
- Bran Boguraev
- Andy Latto
- John Frank
- Lisa Ferro
- Marcia Lazo
- Roser SaurÃ
- Anna Rumshisky
- David Day
- Luc Belanger
- Harry Wu
- Andrew See
Supported by
71How TimeML Differs from Previous Markups
- Extends TIMEX2 annotation
- Temporal Functions three years ago
- Anchors to events and other temporal expressions
- Identifies signals determining interpretation of
temporal expressions - Temporal Prepositions for, during, on, at
- Temporal Connectives before, after, while.
- Identifies event expressions
- tensed verbs has left, was captured, will
resign - stative adjectives sunken, stalled, on board
- event nominals merger, Military Operation, Gulf
War - Creates dependencies between events and times
- Anchoring John left on Monday.
- Orderings The party happened after midnight.
- Embedding John said Mary left.
72ltEVENTgt
attributes eid class tense aspect eid
ID eid EventID EventID eltintegergt class
'OCCURRENCE' 'PERCEPTION' 'REPORTING'
'ASPECTUAL' 'STATE' 'I_STATE'
'I_ACTION' 'MODAL' tense 'PAST'
'PRESENT' 'FUTURE' 'NONE' aspect
'PROGRESSIVE' 'PERFECTIVE' 'PERFECTIVE_PROGRES
SIVE' 'NONE'
73TimeML Event Classes
- Occurrence
- die, crash, build, merge, sell, take advantage
of, .. - State
- Be on board, kidnapped, recovering, love, ..
- Reporting
- Say, report, announce,
- I-Action
- Attempt, try,promise, offer
- I-State
- Believe, intend, want,
- Aspectual
- begin, start, finish, stop, continue.
- Perception
- See, hear, watch, feel.
74The young industry's rapid growth also is
attracting regulators eager to police its many
facets. The young industry's rapid ltEVENT
eid"e1" class"OCCURRENCE"gt growth
lt/EVENTgt also is ltEVENT eid"e2"
class"OCCURRENCE"gt attracting
lt/EVENTgt regulators ltEVENT eid"e4"
class"I_STATE"gt eager lt/EVENTgt to ltEVENT
eid"e5" class"OCCURRENCE"gt police
lt/EVENTgt its many facets.
75Israel will ask the United States to delay a
military strike against Iraq until the Jewish
state is fully prepared for a possible Iraqi
attack. Israel will ltEVENT eid"e1"
class"I_ACTION"gt ask lt/EVENTgt the United States
to ltEVENT eid"e2" class"I_ACTION"gt
delay lt/EVENTgt a military ltEVENT eid"e3"
class"OCCURRENCE"gt strike lt/EVENTgt against
Iraq until the Jewish state is fully ltEVENT
eid"e4" class"I_STATE"gt prepared lt/EVENTgt for a
possible Iraqi ltEVENT eid"e5" class"OCCURRENCE"gt
attack lt/EVENTgt
76ltTIMEX3gt
- Fully Specified Temporal Expressions
- June 11, 1989
- Summer, 2002
- Underspecified Temporal Expressions
- Monday
- Next month
- Last year
- Two days ago
- Durations
- Three months
- Two years
- functionInDocument allows for relative anchoring
of temporal expression values
77TLINK
- TLINK or Temporal Link represents the temporal
relationship holding between events or between an
event and a time, and establishes a link between
the involved entities, making explicit if they
are - Simultaneous (happening at the same time)
- Identical (referring to the same event)
- John drove to Boston. During his drive he ate a
donut. - 3. One before the other
- The police looked into the slayings of 14
women. In six of the cases suspects have already
been arrested. - 4. One after the other
- 5. One immediately before the other
- All passengers died when the plane crashed into
the mountain. - 6. One immediately after than the other
- 7. One including the other
- John arrived in Boston last Thursday.
- 8. One being included in the other
- 9. One holding during the duration of the other
- 10. One being the beginning of the other
- John was in the gym between 600 p.m. and 700
p.m. - 11. One being begun by the other
- 12. One being the ending of the other
- John was in the gym between 600 p.m. and 700
p.m.
78SLINK
SLINK or Subordination Link is used for contexts
introducing relations between two events, or an
event and a signal, of the following sort 1.
Modal Relation introduced mostly by modal verbs
(should, could, would, etc.) and events that
introduce a reference to a possible world
--mainly I_STATEs John should have bought some
wine. Mary wanted John to buy some wine. 2.
Factive Certain verbs introduce an entailment
(or presupposition) of the argument's veracity.
They include forget in the tensed complement,
regret, manage John forgot that he was in
Boston last year. Mary regrets that she didn't
marry John. John managed to leave the party. 3.
Counterfactive The event introduces a
presupposition about the non-veracity of its
argument forget (to), unable to (in past tense),
prevent, cancel, avoid, decline, etc. John
forgot to buy some wine. Mary was unable to
marry John. John prevented the divorce. 4.
Evidential Evidential relations are introduced
by REPORTING or PERCEPTION John said he bought
some wine. Mary saw John carrying only beer.
5. Negative evidential Introduced by REPORTING
(and PERCEPTION?) events conveying negative
polarity John denied he bought only beer. 6.
Negative Introduced only by negative particles
(not, nor, neither, etc.), which will be marked
as SIGNALs, with respect to the events they are
modifying John didn't forgot to buy some wine.
John did not wanted to marry Mary.
79ALINK
ALINK or Aspectual Link represent the
relationship between an aspectual event and its
argument event. Examples of the possible
aspectual relations we will encode are 1.
Initiation John started to read. 2.
Culmination John finished assembling the
table. 3. Termination John stopped talking. 4.
Continuation John kept talking.
80SLINK
(15) Bill wants to teach on Monday. Bill ltEVENT
eid"e1" class"I_STATE" tense"PRESENT"
aspect"NONE"gt wants lt/EVENTgt ltMAKEINSTANCE
eiid"ei1" eventID"e1"/gt ltSLINK
eventInstanceID"ei1" signalID"s1"
subordinatedEvent"e2" relType"MODAL"/gt ltSIGNAL
sid"s1"gt to lt/SIGNALgt ltEVENT eid"e2"
class"OCCURRENCE" tense"NONE"
aspect"NONE"gt teach lt/EVENTgt ltMAKEINSTANCE
eiid"ei2" eventID"e2"/gt ltSIGNAL
sid"s2"gt on lt/SIGNALgt ltTIMEX3 tid"t1"
type"DATE" temporalFunction"true"
value"XXXX-WXX-1"gt Monday lt/TIMEX3gt ltTLINK
eventInstanceID"ei2" relatedToTime"t1"
relType"IS_INCLUDED"/gt
81ALINK
(18) The search party stopped looking for the
survivors. The search party ltEVENT eid"e1"
class"ASPECTUAL" tense"PAST" aspect"NONE"gt stop
ped lt/EVENTgt ltMAKEINSTANCE eiid"ei1"
eventID"e1"/gt ltEVENT eid"e2" class"OCCURRENCE"
tense"NONE" aspect"PROGRESSIVE"gt looking lt/EVENT
gt ltALINK eventInstanceID"ei1" relatedToEvent"e2"
relType"TERMINATES"/gt for the survivors
82Multi-Document TimeML Annotation for Summarization
Even this simple summary is only possible using
TimeML
Multi-doc TimeML anchors single-doc events, and
merges events across multiple docs (via TimeML
graphs)
83TimeML for Multi-lingual Information Access
- Extend to multilingual annotation (re TIMEX2
results on Spanish, French, and Korean) - Address translation of specialized TimeML
constructs
84Open Problems in LKB Design
- Robust acquisition of semantic classes
- Classes modifiable by composition/context
- Persistence and Entailed Events
- The terrorists kidnapped the journalist.
- The President resigned.
- Event Normalization and Quantification
- Three deaths occurred.
- Three people died.
- Generalizing the Treatment of Negation
- No survivors were found.
- The plane did not crash.
85ConclusionThe Open Texture of Words
- Language is constructed by partial generating
functions. - There is inherent incompleteness of terms in
language - Richer modes of composition are used in
determining sense and fixing reference - Corpus data and statistical techniques determine
the texture and completeness of the language in
use.
86Acknowledgements
- Brandeis University
- José Castaño
- Wei Luo
- Roser SaurÃ
- Anna Rumshisky
- James Pustejovsky
- jamesp_at_cs.brandeis.edu
- medstract.org
- Tufts University
- Maciej Kotecki
- Brent Cochran
- TERQAS Workshop
- time2002.org
Supported by