Title: Introduction to Information Extraction
1Introduction to Information Extraction
- Chia-Hui Chang
- Dept. of Computer Science and Information
Engineering, National Central University, Taiwan - chia_at_csie.ncu.edu.tw
2Problem Definition
- Information Extraction (IE) is to identify
relevant information from documents, pulling
information from a variety of sources and
aggregates it into a homogeneous form. - Input ? extractor ?structured output
- The output template of the IE task
- Several fields (slots)
- Several instances of a field
3Difficulties of IE tasks depends on
- Text type
- From plain text to semi-structured Web pages
- e.g. Wall Street Journal articles, or email
message, HTML documents. - Domain
- From financial news, or tourist information, to
various language. - Scenario
4Various IE Tasks
- Free-text IE
- For MUC (Message Understanding Conference)
- E.g. terrorist activities, corporate joint
ventures - Semi-structured IE
- E.g. meta-search engines, shopping agents,
Bio-integration system
5Types of IE from MUC
- Named Entity recognition (NE)
- Finds and classifies names, places, etc.
- Coreference Resolution (CO)
- Identifies identity relations between entities in
texts. - Template Element construction (TE)
- Adds descriptive information to NE results.
- Scenario Template production (ST)
- Fits TE results into specified event scenarios.
6Named Entity Recognition
- http//www.cs.nyu.edu/cs/faculty/grishman/NEtask20
.book_3.html
7NE Recognition (Cont.)
- Spanish 93
- Japanese 92
- Chinese 84.51
8Coreference Resolution
- Coreference resolution (CO) involves identifying
identity relations between entities in texts. - For example, in
- Alas, poor Yorick, I knew him well.
- Tie Yorick" with him.
- The Sheffield system scored 51 recall and 71
precision.
http//www.cs.nyu.edu/cs/faculty/grishman/COtask21
.book_4.html
9Template Element Production
- Adds description with named entities
- Sheffield system scores 71
10Scenario Template Extraction
- STs are the prototypical outputs of IE systems
- They tie together TE entities into event and
relation descriptions. - Performance for Sheffield 49
http//www.cs.nyu.edu/cs/ faculty/grishman/
IEtask15.book_2.html
11Example
- The operational domains that user interests are
centered around are drug enforcement, money
laundering, organized crime, terrorism, . - 1. Input texts dealing with drug enforcement,
money laundering, organized crime, terrorism, and
legislation - 2. NE recognizes entities in those texts and
assigns them to one of a number of categories
drawn from the set of entities of interest
(person, company, . . . ) - 3. TE associates certain types of descriptive
information with these entities, e.g. the
location of companies - 4. ST identifies a set (relatively small to
begin with) of events of interest by tying
entities together into event relations.
12Example Text
13Output Example (NE, TE)
14Output (STs)
15Another IE Example
- Corporate Management Changes
- Purpose
- which positions in which organizations are
changing hands? - who is leaving a position and where the person is
going to? - who is appointed to a position and where the
person is coming from? - the locations and types of the organizations
involved in the succession events - the names and titles of the persons involved in
the succession events - http//www.cs.umanitoba.ca/lindek/ie-ex.htm
16Input Text
- President Clinton nominated John Rollwagen, the
chairman and CEO of Cray Research Inc., as the
No. 2 Commerce Department official. Mr. Rollwagen
said he wants to push the Clinton administration
to aggressively confront U.S. trading partners
such as Japan to open their markets, particularly
for high-tech industries. In a letter sent
throughout the Eagan, Minn.-based company on
Friday, Mr. Rollwagen warned "Whether we like it
or not, our country is in an economic war and we
are at a key turning point in that war." ...... - Cray said it has appointed John F. Carlson, its
president and chief operating officer, to succeed
him. ......
17Extraction Result
18MUC
- Data Set for
- MET2 http//www.itl.nist.gov/iaui/894.02/related_p
rojects/muc/met2/met2package.tar.gz - MUC34 http//www.itl.nist.gov/iaui/894.02/related
_projects/muc/muc_data/muc34.tar.gz - MUC67 from LDC http//www.ldc.upenn.edu/
- MUC-6 http//www.cs.nyu.edu/cs/faculty/grishman/m
uc6.html - MUC-7
- http//www.itl.nist.gov/iaui/894.02/related_pr
ojects/muc/ proceedings/muc_7_toc.html
19Summary
- Evaluation
- Precision
- Recall
- Design Methodology for Text IE
- Natural Language Processing
- Machine Learning
of correctly extracted fields of extracted
fields
of correctly extracted fields of fields to be
extracted
20IE from Web pages
- Output Template k-tuple
- Multiple instances of a field
- Missing data
21Web data extraction
- Various Web pages
- Multiple-record page extraction
- One-record (singular) page extraction
22Multiple-record page extraction
23One-record (singular) page extraction
24Applications
- Information integration
- Meta Search Engines
- Shopping agents
- Travel agents
25Information Integration Systems
Abstracted Information
Agent/Module Coordination
Mediation
Semantic Integration
Translation and Wrapping
Unprocessed, Unintegrated Details
26Web Wrappers
- What is a wrapper?
- An extracting program to extract desired
information from Web pages. - Web pages ? wrapper? Structure Info.
- Web wrappers wrap...
- Query-able or Search-able Web sites
- Web pages with large itemized lists
27Summary
- Evaluation
- Precision
- Recall
- Methodology for Web IE
- Programming package
- Machine Learning
- Pattern Mining
of correctly extracted records of extracted
records
of correctly extracted records of records to
be extracted
28Type III News Group IE
- Example Computer-Related Jobs
29Output Template
- Between free-text IE and semi-structured IE
- CaliffRapier 99
30Wrapper Induction Systems
- Wrapper induction (WI) or information extraction
(IE) systems are software that are designed to
generate wrappers. - Taxonomy of Web IE systems by
- Task domain
- free text vs semi-structured pages
- Automation degree
- supervised vs unsupervised
- Techniques applied
- Machine learning vs pattern mining
31Task Domain
- Document type
- Extraction level
- Field-level, record-level, page-level
- Extraction target variation
- Missing Attributes
- Multi-valued Attributes
- Multi-order attribute Permutations
- Nested Data Objects
- Template variation
- Various Templates for an attribute
- Common Templates for various attributes
- Untokenized Attributes
32Automation Degree
- Page-fetching Support
- Annotation Requirement
- Output Support
- API Support
33Techniques Applied
- Scan passes
- Extraction rule types
- Learning algorithms
- Tokenization schemes
- Feature used
34Conclusion
- Define the IE problem
- Specify the input training example
- with annotation, or
- without annotation
- Depict the extraction rule
- Use necessary background knowledge
35References
- H. Cunningham, Information Extraction a User
Guide, http//www.dcs.shef.ac.uk - MUC-6, http//www.cs.nyu.edu/cs/faculty/
grishman/muc6.html - I. Muslea, Extraction Patterns for Information
Extraction Tasks A Survey, The AAAI-99 Workshop
on Machine Learning for Information Extraction. - Califf, Relational Learning of Pattern-Matching
Rule for Information Extraction, AAAI-99.