Information Extraction and Integration - PowerPoint PPT Presentation

About This Presentation

Title:

Information Extraction and Integration

Description:

We want to extract area code. Start rules: R1: SkipTo(() R2: SkipTo(- b ) End rules: ... We learn start rule for area code. Assume the algorithm starts with E2. ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 46

Provided by: csU89

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction and Integration

1
Information Extraction and Integration

Bing Liu
Department of Computer Science
University of Illinois at Chicago (UIC)
liub_at_cs.uic.edu
http//www.cs.uic.edu/liub

2
Introduction

The Web is perhaps the single largest data source
in the world.
Much of the Web (content) mining is about
Data/information extraction from semi-structured
objects and free text, and
Integration of the extracted data/information
Due to the heterogeneity and lack of structure,
mining and integration are challenging tasks.
This talk gives an overview.

3
Road map

Structured data extraction
Wrapper induction
Automatic extraction
Information integration
Summary

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Wrapper induction

Using machine learning to generate extraction
rules.
The user marks the target items in a few training
pages.
The system learns extraction rules from these
pages.
The rules are applied to extract target items
from other pages.
Many wrapper induction systems, e.g.,
WIEN (Kushmerick et al, IJCAI-97),
Softmealy (Hsu and Dung, 1998),
Stalker (Muslea et al. Agents-99),
BWI (Freitag and McCallum, AAAI-00),
WL2 (Cohen et al. WWW-02).
IDE (Liu and Zhai, WISE-05)
Thresher (Hogue and Karger, WWW-05)

8
Stalker A wrapper induction system (Muslea et
al. Agents-99)

E1 513 Pico, Venice, Phone
1-800-555-1515
E2 90 Colfax, Palms, Phone (800)
508-1570
E3 523 1st St., LA, Phone
1-800-578-2293
E4 403 La Tijera, Watts, Phone (310)
798-0008
We want to extract area code.
Start rules
R1 SkipTo(()
R2 SkipTo(-)
End rules
R3 SkipTo())
R4 SkipTo()

9
Learning extraction rules

Stalker uses sequential covering to learn
extraction rules for each target item.
In each iteration, it learns a perfect rule that
covers as many positive items as possible without
covering any negative items.
Once a positive item is covered by a rule, the
whole example is removed.
The algorithm ends when all the positive items
are covered. The result is an ordered list of all
learned rules.

10
Rule induction through an example

Training examples
E1 513 Pico, Venice, Phone
1-800-555-1515
E2 90 Colfax, Palms, Phone (800)
508-1570
E3 523 1st St., LA, Phone
1-800-578-2293
E4 403 La Tijera, Watts, Phone (310)
798-0008
We learn start rule for area code.
Assume the algorithm starts with E2. It creates
three initial candidate rules with first prefix
symbol and two wildcards
R1 SkipTo(()
R2 SkipTo(Punctuation)
R3 SkipTo(Anything)
R1 is perfect. It covers two positive examples
but no negative example.

11
Rule induction (cont )

E1 513 Pico, Venice, Phone
1-800-555-1515
E2 90 Colfax, Palms, Phone (800)
508-1570
E3 523 1st St., LA, Phone
1-800-578-2293
E4 403 La Tijera, Watts, Phone (310)
798-0008
R1 covers E2 and E4, which are removed. E1 and E3
need additional rules.
Three candidates are created
R4 SkiptTo()
R5 SkipTo(HtmlTag)
R6 SkipTo(Anything)
None is good. Refinement is needed.
Stalker chooses R4 to refine, i.e., to add
additional symbols, to specialize it.
It will find R7 SkipTo(-), which is perfect.

12
Limitations of Supervised Learning
Manual Labeling is labor intensive and time
consuming, especially if one wants to extract
data from a huge number of sites.
Wrapper maintenance is very costly
If Web sites change frequently
It is necessary to detect when a wrapper stops to
work properly.
Any change may make existing extraction rules
invalid.
Re-learning is needed, and most likely manual
re-labeling as well.

13
Road map
Structured data extraction
Wrapper induction
Automatic extraction
Information integration
Summary

14
The RoadRunner System(Crescenzi et al. VLDB-01)
Given a set of positive examples (multiple sample
pages). Each contains one or more data records.
From these pages, generate a wrapper as a
union-free regular expression (i.e., no
disjunction).
The approach
To start, a sample page is taken as the wrapper.
The wrapper is then refined by solving mismatches
between the wrapper and each sample page, which
generalizes the wrapper.

15
(No Transcript)
16
Compare with wrapper induction
No manual labeling, but need a set of positive
pages of the same template
which is not necessary for a page with multiple
data records
not wrapper for data records, but pages.
A Web page can have many pieces of irrelevant
information.
Issues of automatic extraction
Hard to handle disjunctions
Hard to generate attribute names for the
extracted data.
extracted data from multiple sites need
integration, manual or automatic.

17
The DEPTA system (Zhai Liu WWW-05)
Data region1
A data record
A data record
Data region2
18
Align and extract data items (e.g., region1)
19
1. Mining Data Records(Liu et al, KDD-03 Zhai
and Liu, WWW-05)
Given a single page with multiple data records (a
list page), it extracts data records.
The algorithm is based on
two observations about data records in a Web page
a string matching algorithm (tree matching ok
too)
Considered both
contiguous
non-contiguous data records

20
The Approach
Given a page, three steps
Building the HTML Tag Tree
Erroneous tags, unbalanced tags, etc
Some problems are hard to fix
Mining Data Regions
Spring matching or tree matching
Identifying Data Records
Rendering (or visual) information is very useful
in the whole process

21
Building tree based on visual cues
left right top bottom 100 300 200 400 100 300 20
0 300 100 200 200 300 200 300 200 300 100 300 300
400 100 200 300 400 200 300 300 400
1
2
3
4
5
6
7
8
9
10

table
The tag tree
tr tr
td td
td td
22
Mining Data Regions
1
3
2
4
10
9
6
7
8
5
12
11
Region 2
Region 1
14
15
16
17
19
18
13
20
Region 3
23
Identify Data Records
A generalized node may not be a data record.
Extra mechanisms are needed to identify true
atomic objects (see the papers).
Some highlights
Contiguous
non-contiguous data records.

24
2. Extract Data from Data Records
Once a list of data records are identified, we
can align and extract data items in them.
Approaches (align multiple data records)
Multiple string alignment
Many ambiguities due to pervasive use of table
related tags.
Multiple tree alignment (partial tree alignment)
Together with visual information is effective
Most multiple alignment methods work like
hierarchical clustering,
Not effective, and very expensive

25
Tree Matching (tree edit distance)
Intuitively, in the mapping
each node can appear no more than once in a
mapping,
the order between sibling nodes are preserved,
and
the hierarchical relation between nodes are also
preserved.

A
B
p
p
b
h
e
a
d
a
c
c
d
26
The Partial Tree Alignment approach
Choose a seed tree A seed tree, denoted by Ts,
is picked with the maximum number of data items.
Tree matching
For each unmatched tree Ti (i ? s),
match Ts and Ti.
Each pair of matched nodes are linked (aligned).
For each unmatched node nj in Ti do
expand Ts by inserting nj into Ts if a position
for insertion can be uniquely determined in Ts.
The expanded seed tree Ts is then used in
subsequent matching.

27
Illustration of partial tree alignment
Ts
Ti
p
p
e
d
a
b
c
e
b
Insertion is possible
New part of Ts
p
e
d
c
b
a
p
p
Ti
Ts
Insertion is not possible
x
e
a
b
e
a
28
p
p
p
T2
T3
Ts T1
A complete example
d
d

g
h
k
c
b
k
x
c
n
b
b
p
Ts
No node inserted

d
x
b
p
New Ts
c, h, and k inserted

T2 is matched again
c
x
b
k
d
h
T2
p
g
c
k
n
b
p

g
n
x
c
d
h
k
b
29
Output Data Table
DEPTA does not work with nested data records.
NET (Liu Zhai, WISE-05)extracts data from both
flat and nested data records.

30
Some other systems and techniques
IEPAD (Chang Lui WWW-01), DeLa (Wang
Lochovsky WWW-03)
These systems treat a page as a long string, and
find repeated substring patterns.
They often produce multiple patterns (rules).
Hard to decide which is correct.
EXALG(Arasu Garcia-Molina SIGMOD-03), (Lerman
et al, SIGMOD-04).
Require multiple pages to find patterns.
Which is not necessary for pages with multiple
records.
(Zhao et al, WWW-04)
It extracts data records in one area of a page.

31
Limitations and issues
Not for a page with only a single data record
Does not generate attribute names for the
extracted data (yet!)
extracted data from multiple sites need
integration.
It is possible in each specific application
domain, e.g.,
products sold online.
need product name, image, and price.
identify only these three fields may not be too
hard.
Job postings, publications, etc

32
Road map
Structured data extraction
Wrapper induction
Automatic extraction
Information integration
Summary

33
Web query interface integration
Many integration tasks,
Integrating Web query interfaces (search forms)
Integrating extracted data
Integrating textual information
Integrating ontologies (taxonomy)

We only introduce integration of query
interfaces.
Many web sites provide forms to query deep web
Applications meta-search and meta-query

34
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
35
Synonym Discovery (He and Chang, KDD-04)
Discover synonym attributes
Author Writer, Subject Category

S1 author title subject ISBN
S2 writer title category format
S3 name title keyword binding
Holistic Model Discovery
category
author
name
subject
writer
36
Schema matching as correlation mining
Across many sources
Synonym attributes are negatively correlated
synonym attributes are semantically alternatives.
thus, rarely co-occur in query interfaces
Grouping attributes with positive correlation
grouping attributes semantically complement
thus, often co-occur in query interfaces

37
1. Positive correlation mining as potential groups
Mining positive correlations
Last Name, First Name
2. Negative correlation mining as potential
matchings
Author Last Name, First Name
Mining negative correlations
3. Matching selection as model construction
Author (any) Last Name, First Name
Subject Category
Format Binding
38
A clustering approach to schema matching (Wu et
al. SIGMOD-04)
11 mapping by clustering
Bridging effect
a2 and c2 might not look similar themselves
but they might both be similar to b3
1m mappings
Aggregate and is-a types
User interaction helps in
learning of matching thresholds
resolution of uncertain mappings

X
39
Find 11 Mappings via Clustering
Initial similarity matrix
Interfaces
After one merge
Similarity functions
linguistic similarity
domain similarity

, final clusters
a1,b1,c1, b2,c2,a2,b3
40
Find 1m Complex Mappings
Aggregate type contents of fields on the many
side are part of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
41
Complex Mappings (Contd)
Is-a type contents of fields on the many side
are sum/union of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
42
Instance-based matching via query probing (Wang
et al. VLDB-04)
Both query interfaces and returned results
(called instances) are considered in matching. It
assumes
a global schema (GS) is given and
a set of instances are also given.
Uses each instance value (V) in GS to probe the
underlying database to obtain the count of V
appeared in the returned results.
These counts are used to help matching.

43
Query interface and result page
Title
Author
Publisher
Publish Date
ISBN
Format

Data Attributes
44
Road map
Structured data extraction
Wrapper Induction
Automatic extraction
Information integration
Summary

45
Summary
Give an overview of two topics
Structured data extraction
Information integration
Some technologies are ready for industrial
exploitation, e.g., data extraction.
Simple integration is do-able, complex
integration still needs further research.

Write a Comment

User Comments (0)

Cancel

OK

OK

Latest

Latest Highest Rated

Sort by:

Page of

About PowerShow.com

PowerShow.com is a leading presentation sharing website. It has millions of presentations already uploaded and available with 1,000s more being uploaded by its users every day. Whatever your area of interest, here you’ll be able to find and view presentations you’ll love and possibly download. And, best of all, it is completely free and easy to use.

You might even have a presentation you’d like to share with others. If so, just upload it to PowerShow.com. We’ll convert it to an HTML5 slideshow that includes all the media types you’ve already added: audio, video, music, pictures, animations and transition effects. Then you can share it with your target audience as well as PowerShow.com’s millions of monthly visitors. And, again, it’s all free.

About the Developers

PowerShow.com is brought to you by CrystalGraphics, the award-winning developer and market-leading publisher of rich-media enhancement products for presentations. Our product offerings include millions of PowerPoint templates, diagrams, animated 3D characters and more.

Recommended

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

«
/
»

Page of

«
/
»

CrystalGraphics Presentations

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Information Extraction from the World Wide Web - Richard Stallman, founder of the Free Software Foundation, countered saying... Free Software Foundation. What is 'Information Extraction' Information Extraction ... | PowerPoint PPT presentation | free to view

Towards Web-Scale Information Extraction - ... generated): see Prof. Bing Liu's KDD webinar: http: ... Steve Cook. Ronald Fagin. Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction ... | PowerPoint PPT presentation | free to view

Information Extraction and Integration - The system learns extraction rules from these pages. ... set of positive pages of the same template ... Hard to generate attribute names for the extracted data. ... | PowerPoint PPT presentation | free to view

Liferay, Jira & Crowd Integration - The goal of this presentation is to focus on installation and integration of open source application like Liferay portal, Jira and Crowd. | PowerPoint PPT presentation | free to view

Constrained Conditional Models Learning and Inference for Information Extraction and Natural Language Understanding - Constrained Conditional Models Learning and Inference for Information Extraction and Natural Language Understanding Dan Roth Department of Computer Science | PowerPoint PPT presentation | free to view

Information Extraction, Data Mining - Information Extraction, Data Mining & Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton ... | PowerPoint PPT presentation | free to view

Information Extraction with Finite State Models and Scoped Learning - Information Extraction with Finite State Models and Scoped Learning Andrew McCallum WhizBang Labs & CMU Joint work with John Lafferty (CMU), Fernando Pereira (UPenn), | PowerPoint PPT presentation | free to view

Scalable Information Extraction and Integration - Scalable Information Extraction and Integration Eugene Agichtein Microsoft Research Emory University Sunita Sarawagi IIT Bombay | PowerPoint PPT presentation | free to view

Information Extraction with Finite State Models and Scoped Learning - Information Extraction with Finite State Models and Scoped Learning Andrew McCallum WhizBang Labs & CMU Joint work with John Lafferty (CMU), Fernando Pereira (UPenn), | PowerPoint PPT presentation | free to view

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis - Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory | PowerPoint PPT presentation | free to view

XML and Data Integration - XML and Data Integration Edward Yau (2002/03/27) Fundamental Digital Video Library Functions Functions Extraction and Discovery Searching and Indexing Query and ... | PowerPoint PPT presentation | free to view

Introduction to Information Extraction - Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan | PowerPoint PPT presentation | free to view

Information Extraction from the World Wide Web - Information Extraction from the World Wide Web Andrew McCallum University of Massachusetts Amherst William Cohen Carnegie Mellon University | PowerPoint PPT presentation | free to view

Know Kolkata By Enterprise Information Portal - An Enterprise Information Portal (EIP) is a class of applications that enables organizations to unlock internally and externally stored information and provide users a single gateway to the personalized information needed to make informed business decisions. ‘Ebizz Kolkata’ is one of the handy and reliable B2B portal contains all information regarding ‘The City of Joy’, Kolkata. | PowerPoint PPT presentation | free to view

Exploiting video information for Meeting Structuring - Exploiting video information for Meeting Structuring . . . | PowerPoint PPT presentation | free to view

Information Extraction, Language Technology and the Semantic Web - Information Extraction, Language Technology and the Semantic Web Thierry Declerck & Paul Buitelaar (DFKI GmbH) | PowerPoint PPT presentation | free to view

Information Extraction from the World Wide Web - Information Extraction from the World Wide Web Andrew McCallum University of Massachusetts Amherst William Cohen Carnegie Mellon University | PowerPoint PPT presentation | free to view

Benefits of INS/GPS Integration - Benefits of INS/GPS Integration Douglas Aguilar ... The MIDG II supports both satellite based differential corrections (WAAS, EGNOS) and local RTCM corrections. | PowerPoint PPT presentation | free to view

Google+ Login Integration for Android Apps - Google is inching towards world domination. With it being the go-to search engine to extract any information under the sun, it is no surprise that billions of people around the planet are on it. Then why not bank on this popularity? All you need to do is, make login on your Android app easy by integrating Google+ Login in it. Well the good news is that with this superb tutorial you are just mere steps away from doing so! | PowerPoint PPT presentation | free to view

Global Integration Platform as a Service Market Trend - The Global Integration Platform as a Service Market is expected to attain a market size of $4.5billion by 2022. iPaaS (Integration Platform as a Service) is a cloud integration platform that enables development, execution, and governance of integration workflows both in on-premise and cloud-based applications, and traditional and newer data protocols. Full report: https://kbvresearch.com/global-integration-platform-as-a-service-market/ | PowerPoint PPT presentation | free to view

Information Kiosk Simplified - Hootboard is one of the most competitive and one of the best kiosk information systems. It provides a comprehensive information view at corporate kiosks. Moreover, Hootboard’s kiosk information system is interactive and encourages viewer participation. The touch-based nature of the kiosk information systems allows the viewer to click on the tab and learn more about it. | PowerPoint PPT presentation | free to view

Education management information system - Education Management Information System (EMIS) is a flexible information management system that collects, stores, analyses & processes educational institutional information & student-related data in an organized way. This readily accessible platform helps educators to make accurate & faster decisions by generating MIS reports and promotes institution efficiency, productivity, and growth. | PowerPoint PPT presentation | free to view

GUMU™ for Salesforce - Sage 100 Integration - Greytrix GUMU™ integration for Salesforce with Sage 100 (US) will streamline your front and back-office operations. Due to this integration, a single console view of Customer transactions like Sales Orders, Invoices, etc., can be viewed in real–time. If you leverage Salesforce.com as your front end solution and Sage 100 ERP for financials, extend your investments with this easy to use and install, real-time, bi-directional integration using GUMU™. | PowerPoint PPT presentation | free to view

GUMU Integration for MS Dynamics 365 with Sage Intacct - Greytrix offers GUMU™ integration solutions for Sage 300 ERP with Salesforce.com, a best of breed solution for organizations to enhance their front and back office operations and sales. With Sage Accpac Salesforce Cloud CRM integration, you can use actionable customer information at your fingertips of your business. 24x7x365 accessibility provides a real edge to virtual workforce, it enables employees to work without being tied to their office desk, desktop or servers and prevents data entry errors by automating business processes. | PowerPoint PPT presentation | free to view

Streamlining Systematic Reviews: Simplifying Data Extraction and Bias Assessment with CHARMS and PROBAST - Simplify systematic reviews with CHARMS & PROBAST. Streamline data extraction & bias assessment for reliable research results. Contact us: Email id – sales@pubrica.com Contact No. +91 9884350006 | PowerPoint PPT presentation | free to view

Prestashop Product Data Extraction: Methods, Tools, and Best Practices - Prestashop Product Data Extraction: Efficient Techniques, Best Tools, and Key Practices for Accurate and Ethical Data Collection. | PowerPoint PPT presentation | free to view

Why Is Food Delivery Data Extraction Crucial for Your Business? - Food delivery data extraction is crucial for optimizing operations, enhancing customer insights, and staying competitive in a rapidly evolving market. | PowerPoint PPT presentation | free to view

Page of