Title: CSA2050: Natural Language Processing
1CSA2050 Natural Language Processing
- Information Extraction
- Information Extraction
- Named Entities
- IE Systems
- MUC
- Finite State Machines
- Pattern Recognition
2Classification at different granularities
- Text Categorization
- Classify an entire document
- Information Extraction (IE)
- Identify and classify small units within
documents - Named Entity Extraction (NE)
- A subset of IE
- Identify and classify proper names
- People, locations, organizations
3Martin Baker, a person
Genomics job
Employers job posting form
4Aggregator Websites
5(No Transcript)
6Aggregator Websites
- Read in many web pages from different sites
- Extract information into a database
- Screen Scraping
- Can then return data matching particular queries
- Data mining can extract meaningful insight that
might not have been obvious
7(No Transcript)
8Data Mining
9IE from Research Papers
10IE from Commercial Websites
11What is Information Extraction?
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
12What is Information Extraction?
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
13What is Information Extraction?
As a familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
14What is Information Extraction?
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
15What is Information Extraction?
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
16IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
17IE in Context Formatting
18IE in Context Formatting
19IE in Context Coverage
Web site specific
Genre specific
Formatting
Layout
Amazon.com Book Pages
Resumes
20IE in Context Coverage
Wide, non-specific
Language
University Names
21IE in Context Complexity
22IE in Context Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
23State of the Art
- Named entity recognition from newswire text
- Person, Location, Organization,
- F1 in high 80s or low- to mid-90s
- Binary relation extraction
- Contained-in (Location1, Location2)Member-of
(Person1, Organization1) - F1 in 60s or 70s or 80s
- Web site structure recognition
- Extremely accurate performance obtainable
- Human effort (10min?) required on each site
24IE Generations
- Hand-Built Systems Knowledge Engineering
1980s - Rules written by hand
- Require experts who understand both the systems
and the domain - Iterative guess-test-tweak-repeat cycle
- Automatic, Trainable Rule-Extraction Systems
1990s - Rules discovered automatically using predefined
templates, using automated rule learners - Require huge, labeled corpora (effort is just
moved!) - Statistical Models 1997
- Use machine learning to learn which features
indicate boundaries and types of entities. - Learning usually supervised may be partially
unsupervised
25IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
26Trainable IE Systems
- Pros
- Annotating text is simpler faster than writing
rules. - Domain independent
- Domain experts dont need to be linguists or
programers. - Learning algorithms ensure full coverage of
examples.
- Cons
- Hand-crafted systems perform better, especially
at hard tasks. (but this is changing) - Training data might be expensive to acquire
- May need huge amount of training data
- Hand-writing rules isnt that hard!!
27MUC Genesis of IE
- DARPA funded significant efforts in IE in the
early to mid 1990s. - Message Understanding Conference (MUC) was an
annual event/competition where results were
presented. - Focused on extracting information from news
articles - Terrorist events
- Industrial joint ventures
- Company management changes
- Information extraction of particular interest to
the intelligence community (CIA, NSA). (Note
early 90s)
28MUC
- Named entity
- Person, Organization, Location
- Co-reference
- Clinton ? President Bill Clinton
- Template element
- Perpetrator, Target
- Template relation
- Incident
- Multilingual
29MUC Typical Text
- Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be shipped to Japan. The joint venture,
Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production
of 20,000 iron and metal wood clubs a month
30MUC Typical Text
- Bridgestone Sports Co. said Friday it has set up
a joint venture in Taiwan with a local concern
and a Japanese trading house to produce golf
clubs to be shipped to Japan. The joint venture,
Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production
of 20,000 iron and metal wood clubs a month
31MUC Templates
- Relationship
- tie-up
- Entities
- Bridgestone Sports Co, a local concern, a
Japanese trading house - Joint venture company
- Bridgestone Sports Taiwan Co
- Activity
- ACTIVITY 1
- Amount
- NT2,000,000
32MUC Templates
- ATIVITY 1
- Activity
- Production
- Company
- Bridgestone Sports Taiwan Co
- Product
- Iron and metal wood clubs
- Start Date
- January 1990
33Example from Fastus (1993)
34(No Transcript)
35(No Transcript)
361.Complex Words Recognition of multi-words and
proper names
set up new Taiwan dollars
2.Basic Phrases Simple noun groups, verb groups
and particles
a Japanese trading house had set up
3.Complex phrases Complex noun groups and verb
groups
4.Domain Events Patterns for events of interest
to the application Basic templates are to be
built.
5. Merging Structures Templates from different
parts of the texts are merged if they provide
information about the same entity or event.
37Evaluating IE Accuracy
- Always evaluate performance on independent,
manually-annotated test data not used during
system development. - Measure for each test document
- Total number of correct extractions in the
solution template N - Total number of slot/value pairs extracted by the
system E - Number of extracted slot/value pairs that are
correct (i.e. in the solution template) C - Compute average value of metrics adapted from IR
- Recall C/N
- Precision C/E
- F-Measure Harmonic mean of recall and precision
38Named Entities
- Named Entities
- Person Name Colin Powell, Frodo
- Location Name Middle East, Aiur
- Organization UN, DARPA
- Domain Specific vs. Open Domain
39Nymble (BBN Corporation)
- State of the art system
- Near-human performance 90 accuracy
- Statistical system
- Approach Hidden Markov Model (HMM)
40Nymble (BBN Corporation)
- Noisy channel paradigm
- Originally, entities were marked in the raw text
- Post noisy channel, annotation is lost
- Probability of most likely sequence of name
classes (NC) given a sequence of words (W) - Pr(NCW) Pr(W,NC) / Pr(W)
- since the a priori probability of the word
sequence can be considered constant for any given
sentence ? maximize just numerator
41Nymble (BBN Corporation)
Person
Start of Sentence
End of Sentence
Organization
Five other classes
Not-A-Name
42Automatic Content Extraction
- DARPA ACE Program
- Identify Entities
- Named Bilbo, San Diego, UNICEF
- Nominal the president, the hobbit
- Pronominal she
- Reference resolution
- Clinton ? the president ? he
43Question Answering
- The over-used pipeline paradigm
Question Analysis
Information Retrieval
Answer Extraction
Question
Answer Merging
Answer
44Question Answering
- Feedback loops can be present for constraint
relaxation purposes - Not all QA systems adhere to the pipeline
architecture - Question answering flavors
- Factoid vs. complex
- Who invented paper? vs. Which of Mr. Bushs
friends are Black Sabbath fans? - Closed vs. open domain
45Answer Extraction
- The over-used pipeline paradigm
- Focus on open domain, factoid question answering
Question Analysis
Information Retrieval
Answer Extraction
Question
Answer Merging
Answer
46Practical Issues
- Web Spell Checking
- Mispling
- nucular
- Infrequent forms
- Niagra vs. Niagara, Filenes vs. Filenes
- Google QA
- Genome, video, games
47Practical Issues
- Traditional Information Extraction
- Either expert built or statistical
- Specific strategies for specific question types
- Person Bio vs. Location question types
- Ability to generalize to new questions and new
question types
48Practical Issues
- Who invented Blah?
- Blah was invented by PersonName
- Blah was Verb by PersonName
- where Verb is synonym to invented
- Blah VerbPhrase by PersonName
49Popular Resources
- Experts and/or Learning Algorithms
- Gazeteers
- NE taggers
- Part Of Speech taggers
- Parsers
- Wordnet
- Stopword list
- Stemmer