The CROSSMARC Platform for Web Information Retrieval and Extraction

1 / 21

About This Presentation

Title:

The CROSSMARC Platform for Web Information Retrieval and Extraction

Description:

... purpose search engines, which depend on the results of generic Web crawlers. ... Focused Crawler identifies web sites that are of ... Focused Crawler ... –

Number of Views:86

Avg rating:3.0/5.0

Slides: 22

Provided by: Aias6

Category:

more less

Transcript and Presenter's Notes

Title: The CROSSMARC Platform for Web Information Retrieval and Extraction

1
The CROSSMARC Platform for Web Information
Retrieval and Extraction
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis, Constantine D. Spyropoulos

NEMIS, Athens, 25 October 2004
2
Contents

Problem Description
The CROSSMARC project
The Platform
Customization Infrastructure
The customizable run-time system
Application Building
Benefits
Conclusions

3
Problem Description I
4
Problem Description II

Poor performance of general-purpose search
engines, which depend on the results of generic
Web crawlers.
The behavior of the search engines must be
adapted to the users requirements.

5
Problem Description III

The traditional approach to web IE involves the
creation of wrappers for specific web sites.
Manual creation of wrappers presents many
shortcomings due to
the overhead in writing and maintaining them
Automatic creation of wrappers (wrapper
induction) presents also problems since wrapper
re-training is necessary
when changes occur or
when pages from a similar Web site are to be
analysed
Wrappers are more appropriate for structured or
semi-structured pages

6
Wrappers Definition

Wrapper creation problem can be stated as follows
(Laender et al. 2002)
given a Web page W containing the information of
interest determine a mapping M that populates a
data repository R with this information,
the mapping M must be able to extract data from
any other page W similar to W, where similar
means in most cases belonging in the same site
and having the same layout.

7
CROSSMARC objective
Develop a platform for the collection of web
pages and the extraction of information from
them, that

enables handling of structured, semi-structured
or unstructured data,
enables adaptation to new domains and languages,
facilitates maintenance for an existing domain,
ensures personalised access to the extracted
information,
provides strategies for effective site navigation

8
CROSSMARC consortium
Completion Date 31/08/2003 Contract No IST 2000
25366
9
CROSSMARC Platform I

Composed of
a customization infrastructure that supports
configuration to new applications and languages
a run-time system for web information retrieval
and extraction which can be trained using the
customization infrastructure

10
CROSSMARC Platform II Customization
infrastructure

Ontology Management
Corpus Formation
Corpus Collection and Annotation

11
Customization infrastructure Ontology management

Ontology editor
Creation and maintenance of domain ontologies
Lexicon editor
Creation and maintenance of domain lexica
NERC editor
Creation of the NERC DTD
Template editor
Creation of the FE template
Stereotypes editor
Creation and maintenance of user stereotypes

12
Customization Infrastructure Corpus Formation

Corpus Formation is based on an interactive
process between the user and a simple machine
learning based classifier.

13
Customization Infrastructure Corpus Collection
and Annotation

The corpus collection methodology determines
how different web pages characteristics must be
taken into account and
how they are to be represented in the corpora.
According to this methodology, web pages in each
domain are classified in categories

14
Categories of Web pages containing Offers
Descriptions (2nd Domain)
15
Customization Infrastructure Corpus Collection
and Annotation

The corpus collection methodology follows
standard annotation practices for information
extraction
The annotation task is based on guidelines that
are issued for each new domain and on the use of
an annotation tool

16
CROSSMARC Platform III architecture of the
trainable system
Ontology
17
CROSSMARC Platform IV The trainable system
- Web pages collection

Focused Crawler identifies web sites that are of
relevance to a particular domain
Combines 3 distinct crawler types
Filters the list of Web sites produced
Site-specific spider navigates in a Web site to
locate domain specific pages

18
Web Pages Collection Focused Crawler
19
Web Pages Collection Site-specific spider

Site navigation traverses a Web site, collecting
information from each page visited and forwarding
it to the Page-Filtering and Link-Scoring
modules
Page-filtering is responsible for deciding
whether a page is an interesting one and should
be stored or not
before storing a page, its language is identified
the page is also converted to XHTML
Link-scoring validates the links to be followed.
Only links with a score above a certain threshold
are followed.

20
Web Pages Collection Site-specific spider
21
CROSSMARC Platform V The trainable system -
Information Extraction
22
Information Extraction different approaches of
the monolingual IE systems

English IE machine learning based NERC,
heuristics demarcator, classification based FE,
name matcher normaliser
Greek IE machine learning based NERC, heuristics
demarcator, STALKER-based FE, name matcher
normaliser
Italian IE rule based NERC, heuristics
demarcator, name matcher normaliser,
WHISK-based FE, normaliser
French IE hybrid NERC, heuristics demarcator,
name matcher normaliser, hybrid FE

23
Information Extraction an example, the Greek IE
24
CROSSMARC Platform VI The trainable system
- Data Storage Presentation

Data storage component
stores the extracted facts, from the XML file
produced by IE, into domain-specific databases.
Data presentation is implemented by the end User
Interface, an internationalized web-based
application
Personalization is performed by the
general-purpose NCSR personalization server
(PServer)

25
Application Building I

Development of three applications to extract
information from
laptops offers in e-retailers web sites (in four
languages),
job offers in IT companies web sites (in four
languages),
holidays packages in the sites of travel
agencies (in two languages)

26
Application Building II

Involves two main stages
Creation of application specific resources using
the customization infrastructure
Training of the system components using the
application specific resources, configuration of
the system components

27
Application Building III

Stage 1 Creation of application specific
resources
Creation of concepts, their relationships and
attributes using the ontology editor
Creation of their linguistic realizations using
the lexicon editor
Specification of the important named entity types
using the NERC editor to form a common DTD
Specification of the FE XML schema using the
template editor
Collection of the corpus for page filtering using
the corpus formation tool
Collection and annotation of the corpus for IE
using the annotation tools
Specification of user stereotypes using the
stereotypes editor

28
Ontology editor
29
Lexicon editor
30
NERC editor
31
Shared DTDs
Domain 1 NE MANUF, MODEL, PROCESSOR,
SOFT_OS TIMEX TIME, DATE,
DURATION NUMEX LENGTH, WEIGHT,
SPEED, CAPACITY, RESOLUTION,
MONEY, PERCENT

Domain 2 NE MUNICIPALITY, REGION,
COUNTRY, ORGANIZATION, JOB_TITLE,
EDU_TITLE, LANGUAGE, S/W TIMEX DATE,
DURATION NUMEX MONEY TERM SCHEDULE, ORG_UNIT
32
Template editor
33
FE schemas for 1st and 2nd domain
34
Stereotypes editor
35
Application Building IV

Stage 2 Training of the system components,
configuration
Training of the Crawler
Training of the page filtering and link scoring
modules using the collected corpus
Training of each monolingual IE sub-system using
the collected corpus
Customization of the UI exploiting the ontology,
the lexicons and the stereotypes definitions
Configuration of the system components

36
Benefits I

CROSSMARC is not just another web extraction
system
CROSSMARC is a platform that provides a
customization infrastructure and a trainable
system
To cope with the shortcomings of existing
wrappers, CROSSMARC combines
Wrapper Induction techniques to exploit the
formatting features of the web pages,
NLP techniques to exploit linguistic features of
the web pages,
Machine Learning techniques to facilitate
customization to new applications
enabling the process of application specific web
pages in different sites and in different
languages (multilingual, site-independent).
CROSSMARC also employs ontology engineering
techniques to coordinate the creation and
maintenance of application and language specific
resources

37
Categorization of tools for Web IE (Laender et
al. 2002)
38
Benefits II

After evaluation performed during applications
building, we can conclude that
Crawler increased effort must be put into the
initial stage of forming hypotheses about what
would be good directory and query start points.
Spider we are able to identify with a fairly
high degree of confidence, when a Web page is an
interesting one.
Information Extraction satisfactory performance
especially for offer descriptions extracted from
simpler web pages. In addition, the existing
systems can be tuned further in order to achieve
better performance.

39
Crawler Evaluation

more than one experimentation cycle may be needed
depending on the domain and language

40
Spider Evaluation
41
Information extraction Evaluation

Results comparable to MUC
Full IE is a complex task which becomes even more
complicated by the visual nature of web pages

42
Concluding remarks

CROSSMARC is an operational platform for
site-independent and multilingual information
retrieval and extraction from web pages
The run-time system is accessible from the
CROSSMARC site.
Access to various project resources and corpora
will be provided for research purposes.
Currently, more advanced components are tested
and will be integrated in the platform.

43
Useful Links