Meaning from Text: Teaching Computers to Read - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Meaning from Text: Teaching Computers to Read

Description:

... kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago. ... Khmer Rouge kidnapped and killed British mine removal expert. 1998 ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 41
Provided by: stevenb1
Learn more at: http://wiki.western.edu
Category:

less

Transcript and Presenter's Notes

Title: Meaning from Text: Teaching Computers to Read


1
Meaning from TextTeaching Computers to Read
  • Steven Bethard
  • University of Colorado

2
Query Who is opposing the railroad through
Georgia?
  • 1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea
    they destroyed the railroads and the
    manufacturing and agricultural infrastructure of
    the stateHenry Clay Work wrote the song
    Marching Through Georgia
  • 3 www.ischool.berkeley.edu/mkduggan/politics.html
    While the piano piece "Marching Through Georgia"
    has no words...Party of California (1882) has
    several verses opposing the "railroad robbers"...
  • 71 www.azconsulatela.org/brazaosce.htmAzerbaijan,
    Georgia and Turkey plan to start construction of
    Kars-Akhalkalaki-Tbilisi-Baku railroad in May,
    2007However, weve witnessed a very strong
    opposition to this project both in Congress and
    White House. President George Bush signed a bill
    prohibiting financing of this railroad

3
What went wrong?
  • Didnt find some similar word forms (Morphology)
  • Finds opposing but not opposition
  • Finds railroad but not railway
  • Didnt know how words should be related (Syntax)
  • Looking for opposing railroad
  • Finds opposing the railroad robbers
  • Didnt know that is opposing means current
    (Semantics/Tense)
  • Looking for recent documents
  • Finds Civil War documents
  • Didnt know that who means a person
    (Semantics/Entities)
  • Looking for opposing
  • Finds several verses opposing

4
Teaching Linguistics to Computers
  • Natural Language Processing (NLP)
  • Symbolic approaches
  • Statistical approaches
  • Machine learning overview
  • Statistical NLP
  • Example Identifying people and places
  • Example Constructing timelines

5
Early Natural Language Processing
  • Symbolic approaches
  • Small domains
  • Example
  • SHRDLU block world
  • Vocabulary of 50 words
  • Simple word combinations
  • Hand-written rules to understand sentences

Person WHAT DOES THE BOX CONTAIN? Comp THE
BLUE PYRAMID. Person WHAT IS THE PYRAMID
SUPPORTED BY? Comp THE BOX. Person HOW MANY
BLOCKS ARE NOT IN THE BOX? Comp SEVEN OF THEM.
6
Recent Natural Language Processing
  • Large scale linguistic corpora
  • e.g. Penn TreeBank million words of syntax
  • Statistical machine learning
  • e.g. Charniak parser
  • Trained on the TreeBank
  • Builds new trees with 90 accuracy

7
Machine Learning
  • General approach
  • Analyze data
  • Extract preferences
  • Classify new examples using learned preferences
  • Supervised machine learning
  • Data have human-annotated labels
  • e.g. each sentence in the TreeBank has a
    syntactic tree
  • Learns human preferences

8
Supervised Machine Learning Models
  • Given
  • An Ndimensional feature space
  • Points in that space
  • A human-annotated label for each point
  • Goal
  • Learn a function to assign labels to points
  • Methods
  • K-nearest-neighbors, support vector machines, etc.

?
?
9
Machine Learning Examples
  • Character Recognition
  • Feature space 256 pixels (0 black, 1 white)
  • Labels A, B, C,
  • Cardiac Arrhythmia
  • Feature space age, sex, heart rate,
  • Labels has arrythmia, doesnt have arrythmia
  • Mushrooms
  • Feature space cap shape, gill color, stalk
    surface,
  • Labels poisonous, edible
  • and many more
  • http//www.ics.uci.edu/mlearn/MLRepository.html

10
Machine Learning and Language
  • Example
  • Identifying people, places, organizations (named
    entities)
  • However, weve witnessed a very strong opposition
    to this project both in ORG Congress and
    ORG White House. President PER George Bush
    signed a bill prohibiting financing of this
    railroad.
  • This doesnt look like that lines and dots
    example!
  • Whats the classification problem?
  • Whats the feature space?

11
Named Entities Classification
  • Word-by-word classification
  • Is the word beginning, inside or outside of a
    named entity?

12
Named Entities Clues
  • The word itself
  • U.S. is always a Location
  • (though Turkey is not)
  • Part of speech
  • The Locations Turkey and Georgia are nouns
  • (though the White of White House is not)
  • Is the first letter of the word capitalized?
  • Bush and Congress are capitalized
  • (though the von of von Neumann is not)
  • Is the word at the start of the sentence?
  • In the middle of a sentence, Will is likely a
    Persion
  • (but at the start it could be an auxiliary verb)

13
Named Entities Clues as Features
  • Each clue defines part of the feature space

14
Named Entities String Features
  • But machine learning models need numeric
    features!
  • True ? 1
  • False ? 0
  • Congress ? ?
  • adjective ? ?
  • Solution
  • Binary feature for each word

15
Named Entities Review
ORG Congress and ORG White House
16
Named Entities Features and Models
  • String features
  • word itself
  • part of speech
  • starts sentence
  • has initial capitalization
  • How many numeric features?
  • N Nwords Nparts-of-speech 1 1
  • Nwords 10,000
  • Nparts-of-speech 50
  • Need efficient implementations, e.g. TinySVM

17
Named Entities in Use
  • We know how to
  • View named entity recognition as classification
  • Convert clues to an N-dimensional feature space
  • Train a machine learning model
  • How can we use the model?

18
Named Entities in Search Engines
19
Named Entities in Research
  • TREC-QA
  • Factoid question answering
  • Various research systems compete
  • All use named entity matching
  • State of the art performance 90
  • Thats 10 wrong!
  • But good enough for real use
  • Named entities are a solved problem
  • So whats next?

20
Learning Timelines
  • The top commander of a Cambodian resistance force
    said Thursday he has sent a team to recover the
    remains of a British mine removal expert
    kidnapped and presumed killed by Khmer Rouge
    guerrillas almost two years ago.

21
Learning Timelines
  • The top commander of a Cambodian resistance force
    said Thursday he has sent a team to recover the
    remains of a British mine removal expert
    kidnapped and presumed killed by Khmer Rouge
    guerrillas almost two years ago.

22
Learning Timelines
  • The top commander of a Cambodian resistance force
    said Thursday he has sent a team to recover the
    remains of a British mine removal expert
    kidnapped and presumed killed by Khmer Rouge
    guerrillas almost two years ago.

23
Why Learn Timelines?
  • Timelines are summarization
  • 1996
  • Khmer Rouge kidnapped and killed British mine
    removal expert
  • 1998
  • Cambodian commander sent recovery team
  • Timelines allow reasoning
  • Q When was the expert kidnapped?A Almost two
    years ago.
  • Q Was the team sent before the expert was
    killed?A No, afterwards.

24
Learning Timelines Classification
  • Standard questions
  • Whats the classification problem?
  • Whats the feature space?
  • Three different problems
  • Identify times
  • Identify events
  • Identify links (temporal relations)

25
Times and Events Classification
  • Word-by-word classification
  • Time features
  • word itself
  • has digits
  • Event features
  • word itself
  • suffixes (e.g. -ize, -tion)
  • root (e.g. evasion?evade)

26
Times and Events State of the Art
  • Performance
  • Times 90
  • Events 80
  • Mr Bryza, it's been Event reported that
    Azerbaijan, Georgia and Turkey Event plan to
    Event start Event construction of
    KarsAkhalkalakiTbilisiBaku railroad in
    Time May, Time 2007.
  • Why are events harder?
  • No orthographic cues (capitalization, digits,
    etc.)
  • More parts of speech (nouns, verbs and adjectives)

27
Temporal Links
  • Everything so far looked like
  • Aaaa X bb ccccc Y dd eeeee fff Z gggg
  • But now we want this
  • Word-by-word classification wont work!

28
Temporal Links Classification
  • Pairwise classification
  • Each event with each time
  • Saddam Hussein Time today Event sought
    Event peace on another front by
    Event promising to Event withdraw from
    Iranian territory and Event release soldiers
    Event captured during the Iran-Iraq Event war.

29
Temporal Links Clues
  • Tense of the event
  • said (past tense) is probably Before today
  • says (present tense) is probably During today
  • Nearby temporal expression
  • In said today, said is During today
  • In captured in 1989, captured is During 1989
  • Negativity
  • In People believe this, believe is During today
  • In People dont believe this any more, believe
    is Before today

30
Temporal Links Features
Saddam Hussein Time today Event sought
Event peace on another front by
Event promising to Event withdraw from
Iranian territory
31
Temporal Links State of the Art
  • Corpora with temporal links
  • PropBank verbs and subjects/objects
  • TimeBank certain pairs of events (e.g.
    reporting event and event reported)
  • TempEval A events and times in the same sentence
  • TempEval B events in a document and document
    time
  • Performance on TempEval data
  • Same-sentence links (A) 60
  • Document time links (B) 80

32
What will make timelines better?
  • Larger corpora
  • TempEval is only 50 documents
  • Treebank is 2400
  • More types of links
  • Event-time pairs for all events
  • TempEval only considers high-frequency events
  • Event-event pairs in the same sentence

33
Summary
  • Statistical NLP asks
  • Whats the classification problem?
  • Word-by-word?
  • Pairwise?
  • Whats the feature space?
  • What are the linguistic clues?
  • What does the N-dimensional space look like?
  • Statistical NLP needs
  • Learning algorithms efficient when N is very
    large
  • Large-scale corpora with linguistic labels

34
Future Work Automate this!
35
References
  • Symbolic NLP
  • Terry Winograd. 1972. Understanding Natural
    Language. Academic Press.
  • Statistical NLP
  • Daniel M. Bikel, Richard Schwartz, Ralph M.
    Weischedel. 1999. An Algorithm that Learns
    What's in a Name. Machine Learning.
  • Kadri Hacioglu,Ying Chen and Benjamin Douglas.
    2005. Automatic Time Expression Labeling for
    English and Chinese Text. In Proceedings of
    CICLing-2005.
  • Ellen M. Voorhees and Hoa Trang Dang. 2005.
    Overview of the TREC 2005 Question Answering
    Track. In proceedings of The Fourteenth Text
    REtrieval Conference.

36
References
  • Corpora
  • Mitchell P. Marcus, Beatrice Santorini, and Mary
    Ann Marcinkiewicz. 1993. Building a large
    annotated corpus of english The penn treebank.
    Computational Linguistics, 19313-330.
  • Martha Palmer, Dan Gildea, Paul Kingsbury. 2005.
    The Proposition Bank A Corpus Annotated with
    Semantic Roles, Computational Linguistics
    Journal, 311.
  • James Pustejovsky, Patrick Hanks, Roser Saurí,
    Andrew See, Robert Gaizauskas, Andrea Setzer,
    Dragomir Radev, Beth Sundheim, David Day, Lisa
    Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus.
    Proceedings of Corpus Linguistics 2003 647-656.

37
(No Transcript)
38
Feature Windowing (1)
  • Problem
  • Word-by-word gives no context
  • Solution
  • Include surrounding features

39
Feature Windowing (2)
  • From previous word features, label
  • From current word features
  • From following word features
  • Need special values like !START! and !END!

40
Evaluation Precision, Recall, F
Write a Comment
User Comments (0)
About PowerShow.com