Title: The CROSSMARC Platform for Web Information Retrieval and Extraction
1The CROSSMARC Platform for Web Information
Retrieval and Extraction
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis, Constantine D. Spyropoulos
NEMIS, Athens, 25 October 2004
2Contents
- Problem Description
- The CROSSMARC project
- The Platform
- Customization Infrastructure
- The customizable run-time system
- Application Building
- Benefits
- Conclusions
3Problem Description I
4Problem Description II
- Poor performance of general-purpose search
engines, which depend on the results of generic
Web crawlers. - The behavior of the search engines must be
adapted to the users requirements.
5Problem Description III
- The traditional approach to web IE involves the
creation of wrappers for specific web sites. - Manual creation of wrappers presents many
shortcomings due to - the overhead in writing and maintaining them
- Automatic creation of wrappers (wrapper
induction) presents also problems since wrapper
re-training is necessary - when changes occur or
- when pages from a similar Web site are to be
analysed - Wrappers are more appropriate for structured or
semi-structured pages
6Wrappers Definition
- Wrapper creation problem can be stated as follows
(Laender et al. 2002) - given a Web page W containing the information of
interest determine a mapping M that populates a
data repository R with this information, - the mapping M must be able to extract data from
any other page W similar to W, where similar
means in most cases belonging in the same site
and having the same layout.
7CROSSMARC objective
Develop a platform for the collection of web
pages and the extraction of information from
them, that
- enables handling of structured, semi-structured
or unstructured data, - enables adaptation to new domains and languages,
- facilitates maintenance for an existing domain,
- ensures personalised access to the extracted
information, - provides strategies for effective site navigation
8CROSSMARC consortium
Completion Date 31/08/2003 Contract No IST 2000
25366
9CROSSMARC Platform I
- Composed of
- a customization infrastructure that supports
configuration to new applications and languages - a run-time system for web information retrieval
and extraction which can be trained using the
customization infrastructure
10CROSSMARC Platform II Customization
infrastructure
- Ontology Management
- Corpus Formation
- Corpus Collection and Annotation
11Customization infrastructure Ontology management
- Ontology editor
- Creation and maintenance of domain ontologies
- Lexicon editor
- Creation and maintenance of domain lexica
- NERC editor
- Creation of the NERC DTD
- Template editor
- Creation of the FE template
- Stereotypes editor
- Creation and maintenance of user stereotypes
12Customization Infrastructure Corpus Formation
- Corpus Formation is based on an interactive
process between the user and a simple machine
learning based classifier.
13Customization Infrastructure Corpus Collection
and Annotation
- The corpus collection methodology determines
- how different web pages characteristics must be
taken into account and - how they are to be represented in the corpora.
- According to this methodology, web pages in each
domain are classified in categories
14Categories of Web pages containing Offers
Descriptions (2nd Domain)
15Customization Infrastructure Corpus Collection
and Annotation
- The corpus collection methodology follows
standard annotation practices for information
extraction - The annotation task is based on guidelines that
are issued for each new domain and on the use of
an annotation tool
16CROSSMARC Platform III architecture of the
trainable system
Ontology
17CROSSMARC Platform IV The trainable system
- Web pages collection
- Focused Crawler identifies web sites that are of
relevance to a particular domain - Combines 3 distinct crawler types
- Filters the list of Web sites produced
- Site-specific spider navigates in a Web site to
locate domain specific pages
18Web Pages Collection Focused Crawler
19Web Pages Collection Site-specific spider
- Site navigation traverses a Web site, collecting
information from each page visited and forwarding
it to the Page-Filtering and Link-Scoring
modules - Page-filtering is responsible for deciding
whether a page is an interesting one and should
be stored or not - before storing a page, its language is identified
- the page is also converted to XHTML
- Link-scoring validates the links to be followed.
Only links with a score above a certain threshold
are followed.
20Web Pages Collection Site-specific spider
21CROSSMARC Platform V The trainable system -
Information Extraction
22Information Extraction different approaches of
the monolingual IE systems
- English IE machine learning based NERC,
heuristics demarcator, classification based FE,
name matcher normaliser - Greek IE machine learning based NERC, heuristics
demarcator, STALKER-based FE, name matcher
normaliser - Italian IE rule based NERC, heuristics
demarcator, name matcher normaliser,
WHISK-based FE, normaliser - French IE hybrid NERC, heuristics demarcator,
name matcher normaliser, hybrid FE
23Information Extraction an example, the Greek IE
24CROSSMARC Platform VI The trainable system
- Data Storage Presentation
- Data storage component
- stores the extracted facts, from the XML file
produced by IE, into domain-specific databases. - Data presentation is implemented by the end User
Interface, an internationalized web-based
application - Personalization is performed by the
general-purpose NCSR personalization server
(PServer)
25Application Building I
- Development of three applications to extract
information from - laptops offers in e-retailers web sites (in four
languages), - job offers in IT companies web sites (in four
languages), - holidays packages in the sites of travel
agencies (in two languages)
26Application Building II
- Involves two main stages
- Creation of application specific resources using
the customization infrastructure - Training of the system components using the
application specific resources, configuration of
the system components
27Application Building III
- Stage 1 Creation of application specific
resources - Creation of concepts, their relationships and
attributes using the ontology editor - Creation of their linguistic realizations using
the lexicon editor - Specification of the important named entity types
using the NERC editor to form a common DTD - Specification of the FE XML schema using the
template editor - Collection of the corpus for page filtering using
the corpus formation tool - Collection and annotation of the corpus for IE
using the annotation tools - Specification of user stereotypes using the
stereotypes editor
28Ontology editor
29Lexicon editor
30NERC editor
31Shared DTDs
Domain 1 NE MANUF, MODEL, PROCESSOR,
SOFT_OS TIMEX TIME, DATE,
DURATION NUMEX LENGTH, WEIGHT,
SPEED, CAPACITY, RESOLUTION,
MONEY, PERCENT
Domain 2 NE MUNICIPALITY, REGION,
COUNTRY, ORGANIZATION, JOB_TITLE,
EDU_TITLE, LANGUAGE, S/W TIMEX DATE,
DURATION NUMEX MONEY TERM SCHEDULE, ORG_UNIT
32Template editor
33FE schemas for 1st and 2nd domain
34Stereotypes editor
35Application Building IV
- Stage 2 Training of the system components,
configuration - Training of the Crawler
- Training of the page filtering and link scoring
modules using the collected corpus - Training of each monolingual IE sub-system using
the collected corpus - Customization of the UI exploiting the ontology,
the lexicons and the stereotypes definitions - Configuration of the system components
36Benefits I
- CROSSMARC is not just another web extraction
system - CROSSMARC is a platform that provides a
customization infrastructure and a trainable
system - To cope with the shortcomings of existing
wrappers, CROSSMARC combines - Wrapper Induction techniques to exploit the
formatting features of the web pages, - NLP techniques to exploit linguistic features of
the web pages, - Machine Learning techniques to facilitate
customization to new applications - enabling the process of application specific web
pages in different sites and in different
languages (multilingual, site-independent). - CROSSMARC also employs ontology engineering
techniques to coordinate the creation and
maintenance of application and language specific
resources
37Categorization of tools for Web IE (Laender et
al. 2002)
38Benefits II
- After evaluation performed during applications
building, we can conclude that - Crawler increased effort must be put into the
initial stage of forming hypotheses about what
would be good directory and query start points. - Spider we are able to identify with a fairly
high degree of confidence, when a Web page is an
interesting one. - Information Extraction satisfactory performance
especially for offer descriptions extracted from
simpler web pages. In addition, the existing
systems can be tuned further in order to achieve
better performance.
39Crawler Evaluation
- more than one experimentation cycle may be needed
depending on the domain and language
40Spider Evaluation
41Information extraction Evaluation
- Results comparable to MUC
- Full IE is a complex task which becomes even more
complicated by the visual nature of web pages
42Concluding remarks
- CROSSMARC is an operational platform for
site-independent and multilingual information
retrieval and extraction from web pages - The run-time system is accessible from the
CROSSMARC site. - Access to various project resources and corpora
will be provided for research purposes. - Currently, more advanced components are tested
and will be integrated in the platform.
43Useful Links
- CROSSMARC site
- http//www.iit.demokritos.gr/skel/crossmarc
- Ellogon
- http//www.ellogon.org