When Corpus Meets Theory - PowerPoint PPT Presentation

About This Presentation
Title:

When Corpus Meets Theory

Description:

Is Gates currently CEO of Microsoft? ... Say, report, announce, I-Action: Attempt, try,promise, offer ... The police looked into the slayings of 14 women. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 87
Provided by: csBra
Category:

less

Transcript and Presenter's Notes

Title: When Corpus Meets Theory


1
When Corpus Meets Theory
Models and Data
  • James Pustejovsky
  • TSD 2002
  • September 10, 2002

2
Talk Outline
  • Goals for Language Modeling
  • The Role of Corpus in Theory
  • Disambiguation
  • Selection discovery
  • Clustering
  • Category modification and formation
  • Grammar induction
  • The Role of Theory in Corpus

3
Goals of Language Modeling
  • Statistically informed models improve application
    performance
  • Speech
  • Search
  • Clustering
  • Parsing
  • Machine translation
  • Summarization
  • Question answering

4
Theory Drives the Model
  • Corpus Behavior of words is determined by their
    type.
  • You cant find what you cant model.
  • But, you dont want to find only what you model!
  • Theory allows a model of reality, but
  • Corpus brings reality to the model.

5
Language Modeling with Generative Lexicon
  • Selection integrates paradigmatics and
    syntagmatics
  • Models the relationship between selectional
    contexts
  • Coercion in typing
  • Complex type (Dot Objects)
  • All major categories behave functionally
  • Qualia structure models much of this behavior
  • Semantic Types are differentiated and ranked
  • Grammatical behavior follows (generally) from type

6
Quines Gambit in Corpora
  • Co-occurrence reveals surface relations.
  • Paradigmatics is first order.
  • Syntagmatics is first order.
  • LSA and other techniques create non-superficial
    associations.
  • Model Bias is necessary to create decision
    procedures
  • Example Complex Types

7
Recognizing Selection
  • 1. a. The man fell/died.
  • b. The rock fell/!died.
  • a. John forced/!convinced the door to open.
  • b. John forced/convinced the guests to leave.
  • a. John poured milk into /!on his coffee.
  • b. John poured milk into/on the bowl.

8
Modeling Paradigmatic Systems
9
Integrating Selection into Grammars
10
Qualia Structure
  • Qualia are used to create new types
  • They are generative coherence relations between
    types.

11
Three Ranks of Type
Entities
Events
12
System of Generating Types
13
Qualia are incorporated into Type Itself
14
Qualia as Types
15
Functional Selection
16
Functional Type Coercion
17
Co-composition
18
Coercion in Function Composition
19
Selection and Coercion
20
Type Specification
21
Type Determines Grammatical Behavior
Corpus Distribution of different types should
correlate strongly with their type.
Behavior is measurable in corpus
22
Corpus Analysis provides probable values for
Coercion
Drinking, sipping, cooling,?pouring,?spilling,
23
Complements of begin in AP(Pustejovsky and
Rooth, 1991 ms)
24
Complements of veto in AP
25
Limitations of this ApproachFuzzy Selection
26
Dependencies that require ModelsComplex Types
27
Complex Types
28
Contexts Introducing Complex Types
  • a. John read the story/the book.
  • b. John told the story/!the book.
  • 2. Mary read the subway wall.

29
When Paradigmatic systems are modeled,
Syntagmatic Processes are affected
  • The specificity of argument selection by a
    predicate
  • The treatment of verbal polysemy and multiple
  • subcategorization
  • The treatment of type mismatches and the
    semantics of
  • solidarities

30
Types of Properties
31
Natural Binary Predicate
32
Polar Predicates
  • hot/cold
  • big/small
  • short/tall
  • clean/dirty

33
Lexical Asymmetries
  • Preferences and Defaults
  • clean/dirty, empty/full, pretty/ugly
  • Lexical Gaps
  • bald/(hairy), toothless/(toothed)
  • Lexical Perfectives
  • dead/alive

34
Sortal Opposition
  • External Negation points up in the Type system
  • Internal Negation points down in the Type System
  • (1) a. Rocks are not alive.
  • b. !Rocks are dead.
  • (2) a. The Pope is not married.
  • b. !The Pope is a bachelor.
  • (3) a. Bill did not run the race.
  • b. Hence, Bill did not win the race.
  • c. !Bill lost the race.

35
Case Study I Corpus Drives Lexical Acquisition
36
Text Mining the Biobibliome
  • 40,000 papers published each month in Medline
  • 11 million abstracts currently in Medline
    Database
  • 36 GB of text

37
Robust Extraction of Relations from Biomedical
Texts
  • Statistical techniques are too course-grained
  • SU6656 does not inhibit the PDGF receptor.
  • Local Named Entity Extraction is not
    informative enough
  • This protein binds to Src.
  • Bag of words and bag of entities approaches
    are too weak
  • p16 inhibits Cdk4.
  • Cdk4 is inhibited by p16.

38
Parsing Methodology
  • Identify Targets of Interest
  • Entities and relations to be extracted
  • Perform Corpus Analysis over targets
  • Cluster corpus occurrences by syntactic behavior
    and semantic type
  • Generate Patterns for extraction
  • Test and modify patterns against development
    corpus

39
Possible Selectional Frames
  • p16 inhibits Cdk4. (entity,entity)
  • p16 inhibits cell growth. (entity,process)
  • Methylation inhibits HDAC1. (process,entity)
  • Cell growth inhibits apoptosis.
    (process,process)

40
Corpus Pattern Analysis
  • Create concordances over target elements
  • Automatically cluster complementation patterns
  • Semi-automatically verify patterns and amend
    grammar rules accordingly.

41
Getting the Lexicon out of the Corpus
  • Preliminary examination of the text
  • Sort concordances according to semantics patterns
  • One-sense-per-domain doesn't cut it
  • Complementation patterns emerge from the corpus,
    with and without realization
  • Semantic patterns are a first step towards
    identifying lexical sets
  • Semantic patterns identified with specific
    lexical sets yields co-specifications
  • Implicatures can be identified with
    co-specifications for a very high proportion of
    uses of all predicators.

42
Corpus-derived Grammars distinguish Textual
Function
  • Tensed Sentence-based relational information
    conveys new information.
  • A peptide representing the carboxyl-terminal tail
    of the met receptor inhibits kinase activity.
  • Nominalization functions to
  • Allow further predication and modification
  • Bridge the new information with acceptance as
    given.
  • Provide economy of expression in text
  • Agentive Nominal conveys a relation as a given
    fact.
  • The protein kinase C inhibitor staurosporine ,
    inhibited actin assembly

43
Probable Syntactic Patterns Sentential Forms
  • A peptide representing the carboxyl-terminal
    tail of the met receptor inhibits kinase
    activity.
  • Whereas phosphorylation of the IRK by ATP is
    inhibited by the nonhydrolyzable competitor
    adenylyl-imidodiphosphate, ...
  • The Met tail peptide inhibits the closely
    related Ron receptor but does not affect
  • Although the ability of individual trichothecenes
    to inhibit protein synthesis and activate JNK/p38
    kinases are dissociable , both effects contribute
    to the induction of apoptosis .

44
Probable Syntactic Patterns Nominal Forms
  • 12S E1A , an inhibitor of p300-dependent
    transcription , reduces the binding of TFIIB ,
    but not that of cyclin E- Cdk2 , to p300.
  • The protein kinase C inhibitor staurosporine ,
    inhibited actin assembly and platelet aggregation
    induced by thrombin or PMA.

45
Probable Syntactic Patterns Nominalizations
  • Structural basis for inhibition of protein
    tyrosine phosphatases by Keggin compounds
    phosphomolybdate and phosphotungstate.
  • Previous reports raised question as to whether
    8-Cl-cAMP is a prodrug for its metabolite,
    8-Cl-adenosine which exerts growth inhibition in
    a broad spectrum of cancer cells.

46
Case Study II Theory Drives Corpus Analysis
47
Semantic Rerendering
  • A general technique for adapting and modifying an
    existing ontology
  • Types are extended and created through
  • corpus analysis of patterns implicated with type
    structures
  • Ad hoc database projections over a relational
    database

48
Specialized Ontologies in the Biomedical Domain
  • The UMLS from National Library of Medicine
  • wide coverage
  • shallow semantic type structure
  • 180,998 instances of Amino Acid, Peptide, or
    Protein in UMLS
  • Chemical Viewed Functionally and Chemical Viewed
    Structurally
  • These 2 subtrees cover a large number of all
    types in the UMLS
  • The UMLS gives semantic type bindings to 1.5
    million entities

49
NLP Applications using Semantic Typing
  • Statistical Categorization and Disambiguation
    Tasks
  • Resolution of Prepositional Attachment
  • Relations between Constituents in Nominal
    Compounds
  • Generalizing across semantic classes
  • make up for the sparseness of data
  • IR Tasks
  • Query Reformulation
  • Filtering Ranking of Retrieved Results
  • Information Extraction Tasks
  • Coreference Resolution
  • Relation Extraction (via Anaphora Resolution)
  • Entity Identification

50
GL as Modeling Bias in Rerendering
  • Structural subtyping (Formal)
  • Functional subtyping (Telic)
  • Activation relations (Agentive)
  • Molecular analysis (Const)

51
Syntactic Rerendering Algorithm (I)
52
Syntactic Rerendering Algorithm (II)
53
Syntactic Rerendering Algorithm (III)
54
Evaluating Results
  • Comparison against Existing Ontologies
  • overlap with Gene Ontology (GO) for select
    categories
  • Receptor 17.5 of 2nd level extension phrases
  • are in GO
  • Improved PR for the client NLP Applications
  • Coreference Resolution Application
  • Sortal Anaphora
  • the enzyme, the protease, the same solvent,
    etc.

55
Derivation of Instances for the Proposed Subtypes
  • Syntactic templates (inhibitor, solvent)
  • definitional constructions X is a Y inhibitor
  • aliasing constructions X (the solvent)
  • appositions X, the inhibitor of Y,
  • nominal compounds the solvent X
  • enumerations the following solvents X, Y, ..
  • relative clauses
  • adjuncts X and Y as solvents

56
Semantic (Database) Rerendering
  • Database of relations
  • extracted from the Medline corpus
  • inhibit, block, phosphorylate
  • Typed projection from relations table
  • induces an ad hoc category
  • subtype of T1
  • ?X X T1 R(X,Y) ?T1?UMLS1

57
Syntactic vs. Semantic Rerendering
  • Sortals with no corresponding relational form
  • solvent
  • Sortal and relation predicates
  • inhibitor/inhibit
  • kinase/phosphorylate
  • Relation predicates with no corresponding nominal
    forms
  • bind with
  • increase

58
Syntactic vs. Semantic Rerendering (II)
  • Overlap of derived subtypes
  • CDK inhibitor
  • p21(WAF-1) inhibited CDK2 and CDK4
  • Recover different types of information
  • Syntactic templates for sortal predicates old
    information
  • Typed projections of database relations new
    information

59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
Case Study III Applying Lexical Semantic
Knowledge TERQAS Time and Event Recognition
for Question Answering Systems
68
Relevance to Question Answering Systems
  • Is Gates currently CEO of Microsoft?
  • Were there any meetings between the terrorist
    hijackers and Iraq before the WTC event?
  • Did the Enron merger with Dynegy take place?
  • How long did the hostage situation in Beirut last?

Questions over TIMBANK Corpus
  • When did the war between Iran and Iraq end?
  • When did John Sununu travel to a fundraiser for
    John Ashcroft?
  • How many Tutsis were killed by Hutus in Rwanda
    in 1994?
  • Who was Secretary of Defense during the Gulf
    War?
  • What was the largest U.S. military operation
    since Vietnam?
  • When did the astronauts return from the space
    station on the
  • last shuttle flight?

69
Workshop Goals
  • TimeML Define and Design a Metadata Standard for
    Markup of events, their temporal anchoring, and
    how they are related to each other in News
    articles.
  • TIMEBANK Given the specification of TimeML,
    create a gold standard corpus of 300 articles
    marked up for temporal expressions, events, and
    basic temporal relations.

70
TERQAS Participants
  • James Pustejovsky, PI
  • Rob Gaizauskas
  • Graham Katz
  • Bob Ingria
  • José Castaño
  • Inderjeet Mani
  • Antonio Sanfilippo
  • Dragomir Radev
  • Patrick Hanks
  • Marc Verhagen
  • Beth Sundheim
  • Andrea Setzer
  • Jerry Hobbs
  • Bran Boguraev
  • Andy Latto
  • John Frank
  • Lisa Ferro
  • Marcia Lazo
  • Roser Saurí
  • Anna Rumshisky
  • David Day
  • Luc Belanger
  • Harry Wu
  • Andrew See

Supported by
71
How TimeML Differs from Previous Markups
  • Extends TIMEX2 annotation
  • Temporal Functions three years ago
  • Anchors to events and other temporal expressions
  • Identifies signals determining interpretation of
    temporal expressions
  • Temporal Prepositions for, during, on, at
  • Temporal Connectives before, after, while.
  • Identifies event expressions
  • tensed verbs has left, was captured, will
    resign
  • stative adjectives sunken, stalled, on board
  • event nominals merger, Military Operation, Gulf
    War
  • Creates dependencies between events and times
  • Anchoring John left on Monday.
  • Orderings The party happened after midnight.
  • Embedding John said Mary left.

72
ltEVENTgt
attributes eid class tense aspect eid
ID eid EventID EventID eltintegergt class
'OCCURRENCE' 'PERCEPTION' 'REPORTING'
'ASPECTUAL' 'STATE' 'I_STATE'
'I_ACTION' 'MODAL' tense 'PAST'
'PRESENT' 'FUTURE' 'NONE' aspect
'PROGRESSIVE' 'PERFECTIVE' 'PERFECTIVE_PROGRES
SIVE' 'NONE'
73
TimeML Event Classes
  • Occurrence
  • die, crash, build, merge, sell, take advantage
    of, ..
  • State
  • Be on board, kidnapped, recovering, love, ..
  • Reporting
  • Say, report, announce,
  • I-Action
  • Attempt, try,promise, offer
  • I-State
  • Believe, intend, want,
  • Aspectual
  • begin, start, finish, stop, continue.
  • Perception
  • See, hear, watch, feel.

74
The young industry's rapid growth also is
attracting regulators eager to police its many
facets. The young industry's rapid ltEVENT
eid"e1" class"OCCURRENCE"gt growth
lt/EVENTgt also is ltEVENT eid"e2"
class"OCCURRENCE"gt attracting
lt/EVENTgt regulators ltEVENT eid"e4"
class"I_STATE"gt eager lt/EVENTgt to ltEVENT
eid"e5" class"OCCURRENCE"gt police
lt/EVENTgt its many facets.
75
Israel will ask the United States to delay a
military strike against Iraq until the Jewish
state is fully prepared for a possible Iraqi
attack. Israel will ltEVENT eid"e1"
class"I_ACTION"gt ask lt/EVENTgt the United States
to ltEVENT eid"e2" class"I_ACTION"gt
delay lt/EVENTgt a military ltEVENT eid"e3"
class"OCCURRENCE"gt strike lt/EVENTgt against
Iraq until the Jewish state is fully ltEVENT
eid"e4" class"I_STATE"gt prepared lt/EVENTgt for a
possible Iraqi ltEVENT eid"e5" class"OCCURRENCE"gt
attack lt/EVENTgt
76
ltTIMEX3gt
  • Fully Specified Temporal Expressions
  • June 11, 1989
  • Summer, 2002
  • Underspecified Temporal Expressions
  • Monday
  • Next month
  • Last year
  • Two days ago
  • Durations
  • Three months
  • Two years
  • functionInDocument allows for relative anchoring
    of temporal expression values

77
TLINK
  • TLINK or Temporal Link represents the temporal
    relationship holding between events or between an
    event and a time, and establishes a link between
    the involved entities, making explicit if they
    are
  • Simultaneous (happening at the same time)
  • Identical (referring to the same event)
  • John drove to Boston. During his drive he ate a
    donut.
  • 3. One before the other
  • The police looked into the slayings of 14
    women. In six of the cases suspects have already
    been arrested.
  • 4. One after the other
  • 5. One immediately before the other
  • All passengers died when the plane crashed into
    the mountain.
  • 6. One immediately after than the other
  • 7. One including the other
  • John arrived in Boston last Thursday.
  • 8. One being included in the other
  • 9. One holding during the duration of the other
  • 10. One being the beginning of the other
  • John was in the gym between 600 p.m. and 700
    p.m.
  • 11. One being begun by the other
  • 12. One being the ending of the other
  • John was in the gym between 600 p.m. and 700
    p.m.

78
SLINK
SLINK or Subordination Link is used for contexts
introducing relations between two events, or an
event and a signal, of the following sort 1.
Modal Relation introduced mostly by modal verbs
(should, could, would, etc.) and events that
introduce a reference to a possible world
--mainly I_STATEs John should have bought some
wine. Mary wanted John to buy some wine. 2.
Factive Certain verbs introduce an entailment
(or presupposition) of the argument's veracity.
They include forget in the tensed complement,
regret, manage John forgot that he was in
Boston last year. Mary regrets that she didn't
marry John. John managed to leave the party. 3.
Counterfactive The event introduces a
presupposition about the non-veracity of its
argument forget (to), unable to (in past tense),
prevent, cancel, avoid, decline, etc. John
forgot to buy some wine. Mary was unable to
marry John. John prevented the divorce. 4.
Evidential Evidential relations are introduced
by REPORTING or PERCEPTION John said he bought
some wine. Mary saw John carrying only beer.
5. Negative evidential Introduced by REPORTING
(and PERCEPTION?) events conveying negative
polarity John denied he bought only beer. 6.
Negative Introduced only by negative particles
(not, nor, neither, etc.), which will be marked
as SIGNALs, with respect to the events they are
modifying John didn't forgot to buy some wine.
John did not wanted to marry Mary.
79
ALINK
ALINK or Aspectual Link represent the
relationship between an aspectual event and its
argument event. Examples of the possible
aspectual relations we will encode are 1.
Initiation John started to read. 2.
Culmination John finished assembling the
table. 3. Termination John stopped talking. 4.
Continuation John kept talking.
80
SLINK
(15) Bill wants to teach on Monday. Bill ltEVENT
eid"e1" class"I_STATE" tense"PRESENT"
aspect"NONE"gt wants lt/EVENTgt ltMAKEINSTANCE
eiid"ei1" eventID"e1"/gt ltSLINK
eventInstanceID"ei1" signalID"s1"
subordinatedEvent"e2" relType"MODAL"/gt ltSIGNAL
sid"s1"gt to lt/SIGNALgt ltEVENT eid"e2"
class"OCCURRENCE" tense"NONE"
aspect"NONE"gt teach lt/EVENTgt ltMAKEINSTANCE
eiid"ei2" eventID"e2"/gt ltSIGNAL
sid"s2"gt on lt/SIGNALgt ltTIMEX3 tid"t1"
type"DATE" temporalFunction"true"
value"XXXX-WXX-1"gt Monday lt/TIMEX3gt ltTLINK
eventInstanceID"ei2" relatedToTime"t1"
relType"IS_INCLUDED"/gt
81
ALINK
(18) The search party stopped looking for the
survivors. The search party ltEVENT eid"e1"
class"ASPECTUAL" tense"PAST" aspect"NONE"gt stop
ped lt/EVENTgt ltMAKEINSTANCE eiid"ei1"
eventID"e1"/gt ltEVENT eid"e2" class"OCCURRENCE"
tense"NONE" aspect"PROGRESSIVE"gt looking lt/EVENT
gt ltALINK eventInstanceID"ei1" relatedToEvent"e2"
relType"TERMINATES"/gt for the survivors
82
Multi-Document TimeML Annotation for Summarization
Even this simple summary is only possible using
TimeML
Multi-doc TimeML anchors single-doc events, and
merges events across multiple docs (via TimeML
graphs)
83
TimeML for Multi-lingual Information Access
  • Extend to multilingual annotation (re TIMEX2
    results on Spanish, French, and Korean)
  • Address translation of specialized TimeML
    constructs

84
Open Problems in LKB Design
  • Robust acquisition of semantic classes
  • Classes modifiable by composition/context
  • Persistence and Entailed Events
  • The terrorists kidnapped the journalist.
  • The President resigned.
  • Event Normalization and Quantification
  • Three deaths occurred.
  • Three people died.
  • Generalizing the Treatment of Negation
  • No survivors were found.
  • The plane did not crash.

85
ConclusionThe Open Texture of Words
  • Language is constructed by partial generating
    functions.
  • There is inherent incompleteness of terms in
    language
  • Richer modes of composition are used in
    determining sense and fixing reference
  • Corpus data and statistical techniques determine
    the texture and completeness of the language in
    use.

86
Acknowledgements
  • Brandeis University
  • José Castaño
  • Wei Luo
  • Roser Saurí
  • Anna Rumshisky
  • James Pustejovsky
  • jamesp_at_cs.brandeis.edu
  • medstract.org
  • Tufts University
  • Maciej Kotecki
  • Brent Cochran
  • TERQAS Workshop
  • time2002.org

Supported by
Write a Comment
User Comments (0)
About PowerShow.com