Title: Meaning from Text: Teaching Computers to Read
1Meaning from TextTeaching Computers to Read
- Steven Bethard
- University of Colorado
2Query Who is opposing the railroad through
Georgia?
- 1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea
they destroyed the railroads and the
manufacturing and agricultural infrastructure of
the stateHenry Clay Work wrote the song
Marching Through Georgia -
- 3 www.ischool.berkeley.edu/mkduggan/politics.html
While the piano piece "Marching Through Georgia"
has no words...Party of California (1882) has
several verses opposing the "railroad robbers"... -
- 71 www.azconsulatela.org/brazaosce.htmAzerbaijan,
Georgia and Turkey plan to start construction of
Kars-Akhalkalaki-Tbilisi-Baku railroad in May,
2007However, weve witnessed a very strong
opposition to this project both in Congress and
White House. President George Bush signed a bill
prohibiting financing of this railroad
3What went wrong?
- Didnt find some similar word forms (Morphology)
- Finds opposing but not opposition
- Finds railroad but not railway
- Didnt know how words should be related (Syntax)
- Looking for opposing railroad
- Finds opposing the railroad robbers
- Didnt know that is opposing means current
(Semantics/Tense) - Looking for recent documents
- Finds Civil War documents
- Didnt know that who means a person
(Semantics/Entities) - Looking for opposing
- Finds several verses opposing
4Teaching Linguistics to Computers
- Natural Language Processing (NLP)
- Symbolic approaches
- Statistical approaches
- Machine learning overview
- Statistical NLP
- Example Identifying people and places
- Example Constructing timelines
5Early Natural Language Processing
- Symbolic approaches
- Small domains
- Example
- SHRDLU block world
- Vocabulary of 50 words
- Simple word combinations
- Hand-written rules to understand sentences
Person WHAT DOES THE BOX CONTAIN? Comp THE
BLUE PYRAMID. Person WHAT IS THE PYRAMID
SUPPORTED BY? Comp THE BOX. Person HOW MANY
BLOCKS ARE NOT IN THE BOX? Comp SEVEN OF THEM.
6Recent Natural Language Processing
- Large scale linguistic corpora
- e.g. Penn TreeBank million words of syntax
- Statistical machine learning
- e.g. Charniak parser
- Trained on the TreeBank
- Builds new trees with 90 accuracy
7Machine Learning
- General approach
- Analyze data
- Extract preferences
- Classify new examples using learned preferences
- Supervised machine learning
- Data have human-annotated labels
- e.g. each sentence in the TreeBank has a
syntactic tree - Learns human preferences
8Supervised Machine Learning Models
- Given
- An Ndimensional feature space
- Points in that space
- A human-annotated label for each point
- Goal
- Learn a function to assign labels to points
- Methods
- K-nearest-neighbors, support vector machines, etc.
?
?
9Machine Learning Examples
- Character Recognition
- Feature space 256 pixels (0 black, 1 white)
- Labels A, B, C,
- Cardiac Arrhythmia
- Feature space age, sex, heart rate,
- Labels has arrythmia, doesnt have arrythmia
- Mushrooms
- Feature space cap shape, gill color, stalk
surface, - Labels poisonous, edible
- and many more
- http//www.ics.uci.edu/mlearn/MLRepository.html
10Machine Learning and Language
- Example
- Identifying people, places, organizations (named
entities) - However, weve witnessed a very strong opposition
to this project both in ORGÂ Congress and
ORG White House. President PER George Bush
signed a bill prohibiting financing of this
railroad. - This doesnt look like that lines and dots
example! - Whats the classification problem?
- Whats the feature space?
11Named Entities Classification
- Word-by-word classification
- Is the word beginning, inside or outside of a
named entity?
12Named Entities Clues
- The word itself
- U.S. is always a Location
- (though Turkey is not)
- Part of speech
- The Locations Turkey and Georgia are nouns
- (though the White of White House is not)
- Is the first letter of the word capitalized?
- Bush and Congress are capitalized
- (though the von of von Neumann is not)
- Is the word at the start of the sentence?
- In the middle of a sentence, Will is likely a
Persion - (but at the start it could be an auxiliary verb)
13Named Entities Clues as Features
- Each clue defines part of the feature space
14Named Entities String Features
- But machine learning models need numeric
features! - True ? 1
- False ? 0
- Congress ? ?
- adjective ? ?
- Solution
- Binary feature for each word
15Named Entities Review
ORG Congress and ORG White House
16Named Entities Features and Models
- String features
- word itself
- part of speech
- starts sentence
- has initial capitalization
- How many numeric features?
- N Nwords Nparts-of-speech 1 1
- Nwords 10,000
- Nparts-of-speech 50
- Need efficient implementations, e.g. TinySVM
17Named Entities in Use
- We know how to
- View named entity recognition as classification
- Convert clues to an N-dimensional feature space
- Train a machine learning model
- How can we use the model?
18Named Entities in Search Engines
19Named Entities in Research
- TREC-QA
- Factoid question answering
- Various research systems compete
- All use named entity matching
- State of the art performance 90
- Thats 10 wrong!
- But good enough for real use
- Named entities are a solved problem
- So whats next?
20Learning Timelines
- The top commander of a Cambodian resistance force
said Thursday he has sent a team to recover the
remains of a British mine removal expert
kidnapped and presumed killed by Khmer Rouge
guerrillas almost two years ago.
21Learning Timelines
- The top commander of a Cambodian resistance force
said Thursday he has sent a team to recover the
remains of a British mine removal expert
kidnapped and presumed killed by Khmer Rouge
guerrillas almost two years ago.
22Learning Timelines
- The top commander of a Cambodian resistance force
said Thursday he has sent a team to recover the
remains of a British mine removal expert
kidnapped and presumed killed by Khmer Rouge
guerrillas almost two years ago.
23Why Learn Timelines?
- Timelines are summarization
- 1996
- Khmer Rouge kidnapped and killed British mine
removal expert - 1998
- Cambodian commander sent recovery team
-
- Timelines allow reasoning
- Q When was the expert kidnapped?A Almost two
years ago. - Q Was the team sent before the expert was
killed?A No, afterwards.
24Learning Timelines Classification
- Standard questions
- Whats the classification problem?
- Whats the feature space?
- Three different problems
- Identify times
- Identify events
- Identify links (temporal relations)
25Times and Events Classification
- Word-by-word classification
- Time features
- word itself
- has digits
-
- Event features
- word itself
- suffixes (e.g. -ize, -tion)
- root (e.g. evasion?evade)
26Times and Events State of the Art
- Performance
- Times 90
- Events 80
- Mr Bryza, it's been Event reported that
Azerbaijan, Georgia and Turkey Event plan to
Event start Event construction of
KarsAkhalkalakiTbilisiBaku railroad in
Time May, Time 2007. - Why are events harder?
- No orthographic cues (capitalization, digits,
etc.) - More parts of speech (nouns, verbs and adjectives)
27Temporal Links
- Everything so far looked like
- Aaaa X bb ccccc Y dd eeeee fff Z gggg
- But now we want this
- Word-by-word classification wont work!
28Temporal Links Classification
- Pairwise classification
- Each event with each time
- Saddam Hussein Time today Event sought
Event peace on another front by
Event promising to Event withdraw from
Iranian territory and Event release soldiers
Event captured during the Iran-Iraq Event war.
29Temporal Links Clues
- Tense of the event
- said (past tense) is probably Before today
- says (present tense) is probably During today
- Nearby temporal expression
- In said today, said is During today
- In captured in 1989, captured is During 1989
- Negativity
- In People believe this, believe is During today
- In People dont believe this any more, believe
is Before today
30Temporal Links Features
Saddam Hussein Time today Event sought
Event peace on another front by
Event promising to Event withdraw from
Iranian territory
31Temporal Links State of the Art
- Corpora with temporal links
- PropBank verbs and subjects/objects
- TimeBank certain pairs of events (e.g.
reporting event and event reported) - TempEval A events and times in the same sentence
- TempEval B events in a document and document
time - Performance on TempEval data
- Same-sentence links (A) 60
- Document time links (B) 80
32What will make timelines better?
- Larger corpora
- TempEval is only 50 documents
- Treebank is 2400
- More types of links
- Event-time pairs for all events
- TempEval only considers high-frequency events
- Event-event pairs in the same sentence
33Summary
- Statistical NLP asks
- Whats the classification problem?
- Word-by-word?
- Pairwise?
- Whats the feature space?
- What are the linguistic clues?
- What does the N-dimensional space look like?
- Statistical NLP needs
- Learning algorithms efficient when N is very
large - Large-scale corpora with linguistic labels
34Future Work Automate this!
35References
- Symbolic NLP
- Terry Winograd. 1972. Understanding Natural
Language. Academic Press. - Statistical NLP
- Daniel M. Bikel, Richard Schwartz, Ralph M.
Weischedel. 1999. An Algorithm that Learns
What's in a Name. Machine Learning. - Kadri Hacioglu,Ying Chen and Benjamin Douglas.
2005. Automatic Time Expression Labeling for
English and Chinese Text. In Proceedings of
CICLing-2005. - Ellen M. Voorhees and Hoa Trang Dang. 2005.
Overview of the TREC 2005 Question Answering
Track. In proceedings of The Fourteenth Text
REtrieval Conference.
36References
- Corpora
- Mitchell P. Marcus, Beatrice Santorini, and Mary
Ann Marcinkiewicz. 1993. Building a large
annotated corpus of english The penn treebank.
Computational Linguistics, 19313-330. - Martha Palmer, Dan Gildea, Paul Kingsbury. 2005.
The Proposition Bank A Corpus Annotated with
Semantic Roles, Computational Linguistics
Journal, 311. - James Pustejovsky, Patrick Hanks, Roser SaurÃ,
Andrew See, Robert Gaizauskas, Andrea Setzer,
Dragomir Radev, Beth Sundheim, David Day, Lisa
Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus.
Proceedings of Corpus Linguistics 2003 647-656.
37(No Transcript)
38Feature Windowing (1)
- Problem
- Word-by-word gives no context
- Solution
- Include surrounding features
39Feature Windowing (2)
- From previous word features, label
- From current word features
- From following word features
- Need special values like !START! and !END!
40Evaluation Precision, Recall, F