Introduction to Information Extraction - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Introduction to Information Extraction

Description:

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan – PowerPoint PPT presentation

Number of Views:366

Avg rating:3.0/5.0

Slides: 36

Provided by: Jah63

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Information Extraction

1
Introduction to Information Extraction

Chia-Hui Chang
Dept. of Computer Science and Information
Engineering, National Central University, Taiwan
chia_at_csie.ncu.edu.tw

2
Problem Definition

Information Extraction (IE) is to identify
relevant information from documents, pulling
information from a variety of sources and
aggregates it into a homogeneous form.
Input ? extractor ?structured output
The output template of the IE task
Several fields (slots)
Several instances of a field

3
Difficulties of IE tasks depends on

Text type
From plain text to semi-structured Web pages
e.g. Wall Street Journal articles, or email
message, HTML documents.
Domain
From financial news, or tourist information, to
various language.
Scenario

4
Various IE Tasks

Free-text IE
For MUC (Message Understanding Conference)
E.g. terrorist activities, corporate joint
ventures
Semi-structured IE
E.g. meta-search engines, shopping agents,
Bio-integration system

5
Types of IE from MUC

Named Entity recognition (NE)
Finds and classifies names, places, etc.
Coreference Resolution (CO)
Identifies identity relations between entities in
texts.
Template Element construction (TE)
Adds descriptive information to NE results.
Scenario Template production (ST)
Fits TE results into specified event scenarios.

6
Named Entity Recognition

http//www.cs.nyu.edu/cs/faculty/grishman/NEtask20
.book_3.html

7
NE Recognition (Cont.)

Spanish 93
Japanese 92
Chinese 84.51

8
Coreference Resolution

Coreference resolution (CO) involves identifying
identity relations between entities in texts.
For example, in
Alas, poor Yorick, I knew him well.
Tie Yorick" with him.
The Sheffield system scored 51 recall and 71
precision.

http//www.cs.nyu.edu/cs/faculty/grishman/COtask21
.book_4.html
9
Template Element Production

Adds description with named entities
Sheffield system scores 71

10
Scenario Template Extraction

STs are the prototypical outputs of IE systems
They tie together TE entities into event and
relation descriptions.
Performance for Sheffield 49

http//www.cs.nyu.edu/cs/ faculty/grishman/
IEtask15.book_2.html
11
Example

The operational domains that user interests are
centered around are drug enforcement, money
laundering, organized crime, terrorism, .
1. Input texts dealing with drug enforcement,
money laundering, organized crime, terrorism, and
legislation
2. NE recognizes entities in those texts and
assigns them to one of a number of categories
drawn from the set of entities of interest
(person, company, . . . )
3. TE associates certain types of descriptive
information with these entities, e.g. the
location of companies
4. ST identifies a set (relatively small to
begin with) of events of interest by tying
entities together into event relations.

12
Example Text
13
Output Example (NE, TE)
14
Output (STs)
15
Another IE Example

Corporate Management Changes
Purpose
which positions in which organizations are
changing hands?
who is leaving a position and where the person is
going to?
who is appointed to a position and where the
person is coming from?
the locations and types of the organizations
involved in the succession events
the names and titles of the persons involved in
the succession events
http//www.cs.umanitoba.ca/lindek/ie-ex.htm

16
Input Text

President Clinton nominated John Rollwagen, the
chairman and CEO of Cray Research Inc., as the
No. 2 Commerce Department official. Mr. Rollwagen
said he wants to push the Clinton administration
to aggressively confront U.S. trading partners
such as Japan to open their markets, particularly
for high-tech industries. In a letter sent
throughout the Eagan, Minn.-based company on
Friday, Mr. Rollwagen warned "Whether we like it
or not, our country is in an economic war and we
are at a key turning point in that war." ......
Cray said it has appointed John F. Carlson, its
president and chief operating officer, to succeed
him. ......

17
Extraction Result
18
MUC

Data Set for
MET2 http//www.itl.nist.gov/iaui/894.02/related_p
rojects/muc/met2/met2package.tar.gz
MUC34 http//www.itl.nist.gov/iaui/894.02/related
_projects/muc/muc_data/muc34.tar.gz
MUC67 from LDC http//www.ldc.upenn.edu/
MUC-6 http//www.cs.nyu.edu/cs/faculty/grishman/m
uc6.html
MUC-7
http//www.itl.nist.gov/iaui/894.02/related_pr
ojects/muc/ proceedings/muc_7_toc.html

19
Summary

Evaluation
Precision
Recall
Design Methodology for Text IE
Natural Language Processing
Machine Learning

of correctly extracted fields of extracted
fields
of correctly extracted fields of fields to be
extracted
20
IE from Web pages

Output Template k-tuple
Multiple instances of a field
Missing data

21
Web data extraction

Various Web pages
Multiple-record page extraction
One-record (singular) page extraction

22
Multiple-record page extraction
23
One-record (singular) page extraction
24
Applications

Information integration
Meta Search Engines
Shopping agents
Travel agents

25
Information Integration Systems
Abstracted Information
Agent/Module Coordination
Mediation
Semantic Integration
Translation and Wrapping
Unprocessed, Unintegrated Details
26
Web Wrappers

What is a wrapper?
An extracting program to extract desired
information from Web pages.
Web pages ? wrapper? Structure Info.
Web wrappers wrap...
Query-able or Search-able Web sites
Web pages with large itemized lists

27
Summary

Evaluation
Precision
Recall
Methodology for Web IE
Programming package
Machine Learning
Pattern Mining

of correctly extracted records of extracted
records
of correctly extracted records of records to
be extracted
28
Type III News Group IE

Example Computer-Related Jobs

29
Output Template

Between free-text IE and semi-structured IE
CaliffRapier 99

30
Wrapper Induction Systems

Wrapper induction (WI) or information extraction
(IE) systems are software that are designed to
generate wrappers.
Taxonomy of Web IE systems by
Task domain
free text vs semi-structured pages
Automation degree
supervised vs unsupervised
Techniques applied
Machine learning vs pattern mining

31
Task Domain

Document type
Extraction level
Field-level, record-level, page-level
Extraction target variation
Missing Attributes
Multi-valued Attributes
Multi-order attribute Permutations
Nested Data Objects
Template variation
Various Templates for an attribute
Common Templates for various attributes
Untokenized Attributes

32
Automation Degree

Page-fetching Support
Annotation Requirement
Output Support
API Support

33
Techniques Applied

Scan passes
Extraction rule types
Learning algorithms
Tokenization schemes
Feature used

34
Conclusion

Define the IE problem
Specify the input training example
with annotation, or
without annotation
Depict the extraction rule
Use necessary background knowledge

35
References

H. Cunningham, Information Extraction a User
Guide, http//www.dcs.shef.ac.uk
MUC-6, http//www.cs.nyu.edu/cs/faculty/
grishman/muc6.html
I. Muslea, Extraction Patterns for Information
Extraction Tasks A Survey, The AAAI-99 Workshop
on Machine Learning for Information Extraction.
Califf, Relational Learning of Pattern-Matching
Rule for Information Extraction, AAAI-99.

Write a Comment

User Comments (0)