The CROSSMARC Platform for Web Information Retrieval and Extraction

1 / 21
About This Presentation
Title:

The CROSSMARC Platform for Web Information Retrieval and Extraction

Description:

... purpose search engines, which depend on the results of generic Web crawlers. ... Focused Crawler identifies web sites that are of ... Focused Crawler ... –

Number of Views:86
Avg rating:3.0/5.0
Slides: 22
Provided by: Aias6
Category:

less

Transcript and Presenter's Notes

Title: The CROSSMARC Platform for Web Information Retrieval and Extraction


1
The CROSSMARC Platform for Web Information
Retrieval and Extraction
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis, Constantine D. Spyropoulos

NEMIS, Athens, 25 October 2004
2
Contents
  • Problem Description
  • The CROSSMARC project
  • The Platform
  • Customization Infrastructure
  • The customizable run-time system
  • Application Building
  • Benefits
  • Conclusions

3
Problem Description I
4
Problem Description II
  • Poor performance of general-purpose search
    engines, which depend on the results of generic
    Web crawlers.
  • The behavior of the search engines must be
    adapted to the users requirements.

5
Problem Description III
  • The traditional approach to web IE involves the
    creation of wrappers for specific web sites.
  • Manual creation of wrappers presents many
    shortcomings due to
  • the overhead in writing and maintaining them
  • Automatic creation of wrappers (wrapper
    induction) presents also problems since wrapper
    re-training is necessary
  • when changes occur or
  • when pages from a similar Web site are to be
    analysed
  • Wrappers are more appropriate for structured or
    semi-structured pages

6
Wrappers Definition
  • Wrapper creation problem can be stated as follows
    (Laender et al. 2002)
  • given a Web page W containing the information of
    interest determine a mapping M that populates a
    data repository R with this information,
  • the mapping M must be able to extract data from
    any other page W similar to W, where similar
    means in most cases belonging in the same site
    and having the same layout.

7
CROSSMARC objective
Develop a platform for the collection of web
pages and the extraction of information from
them, that
  • enables handling of structured, semi-structured
    or unstructured data,
  • enables adaptation to new domains and languages,
  • facilitates maintenance for an existing domain,
  • ensures personalised access to the extracted
    information,
  • provides strategies for effective site navigation

8
CROSSMARC consortium
Completion Date 31/08/2003 Contract No IST 2000
25366
9
CROSSMARC Platform I
  • Composed of
  • a customization infrastructure that supports
    configuration to new applications and languages
  • a run-time system for web information retrieval
    and extraction which can be trained using the
    customization infrastructure

10
CROSSMARC Platform II Customization
infrastructure
  • Ontology Management
  • Corpus Formation
  • Corpus Collection and Annotation

11
Customization infrastructure Ontology management
  • Ontology editor
  • Creation and maintenance of domain ontologies
  • Lexicon editor
  • Creation and maintenance of domain lexica
  • NERC editor
  • Creation of the NERC DTD
  • Template editor
  • Creation of the FE template
  • Stereotypes editor
  • Creation and maintenance of user stereotypes

12
Customization Infrastructure Corpus Formation
  • Corpus Formation is based on an interactive
    process between the user and a simple machine
    learning based classifier.

13
Customization Infrastructure Corpus Collection
and Annotation
  • The corpus collection methodology determines
  • how different web pages characteristics must be
    taken into account and
  • how they are to be represented in the corpora.
  • According to this methodology, web pages in each
    domain are classified in categories

14
Categories of Web pages containing Offers
Descriptions (2nd Domain)
15
Customization Infrastructure Corpus Collection
and Annotation
  • The corpus collection methodology follows
    standard annotation practices for information
    extraction
  • The annotation task is based on guidelines that
    are issued for each new domain and on the use of
    an annotation tool

16
CROSSMARC Platform III architecture of the
trainable system
Ontology
17
CROSSMARC Platform IV The trainable system
- Web pages collection
  • Focused Crawler identifies web sites that are of
    relevance to a particular domain
  • Combines 3 distinct crawler types
  • Filters the list of Web sites produced
  • Site-specific spider navigates in a Web site to
    locate domain specific pages

18
Web Pages Collection Focused Crawler
19
Web Pages Collection Site-specific spider
  • Site navigation traverses a Web site, collecting
    information from each page visited and forwarding
    it to the Page-Filtering and Link-Scoring
    modules
  • Page-filtering is responsible for deciding
    whether a page is an interesting one and should
    be stored or not
  • before storing a page, its language is identified
  • the page is also converted to XHTML
  • Link-scoring validates the links to be followed.
    Only links with a score above a certain threshold
    are followed.

20
Web Pages Collection Site-specific spider
21
CROSSMARC Platform V The trainable system -
Information Extraction
22
Information Extraction different approaches of
the monolingual IE systems
  • English IE machine learning based NERC,
    heuristics demarcator, classification based FE,
    name matcher normaliser
  • Greek IE machine learning based NERC, heuristics
    demarcator, STALKER-based FE, name matcher
    normaliser
  • Italian IE rule based NERC, heuristics
    demarcator, name matcher normaliser,
    WHISK-based FE, normaliser
  • French IE hybrid NERC, heuristics demarcator,
    name matcher normaliser, hybrid FE

23
Information Extraction an example, the Greek IE
24
CROSSMARC Platform VI The trainable system
- Data Storage Presentation
  • Data storage component
  • stores the extracted facts, from the XML file
    produced by IE, into domain-specific databases.
  • Data presentation is implemented by the end User
    Interface, an internationalized web-based
    application
  • Personalization is performed by the
    general-purpose NCSR personalization server
    (PServer)

25
Application Building I
  • Development of three applications to extract
    information from
  • laptops offers in e-retailers web sites (in four
    languages),
  • job offers in IT companies web sites (in four
    languages),
  • holidays packages in the sites of travel
    agencies (in two languages)

26
Application Building II
  • Involves two main stages
  • Creation of application specific resources using
    the customization infrastructure
  • Training of the system components using the
    application specific resources, configuration of
    the system components

27
Application Building III
  • Stage 1 Creation of application specific
    resources
  • Creation of concepts, their relationships and
    attributes using the ontology editor
  • Creation of their linguistic realizations using
    the lexicon editor
  • Specification of the important named entity types
    using the NERC editor to form a common DTD
  • Specification of the FE XML schema using the
    template editor
  • Collection of the corpus for page filtering using
    the corpus formation tool
  • Collection and annotation of the corpus for IE
    using the annotation tools
  • Specification of user stereotypes using the
    stereotypes editor

28
Ontology editor
29
Lexicon editor
30
NERC editor
31
Shared DTDs
Domain 1 NE MANUF, MODEL, PROCESSOR,
SOFT_OS TIMEX TIME, DATE,
DURATION NUMEX LENGTH, WEIGHT,
SPEED, CAPACITY, RESOLUTION,
MONEY, PERCENT

Domain 2 NE MUNICIPALITY, REGION,
COUNTRY, ORGANIZATION, JOB_TITLE,
EDU_TITLE, LANGUAGE, S/W TIMEX DATE,
DURATION NUMEX MONEY TERM SCHEDULE, ORG_UNIT
32
Template editor
33
FE schemas for 1st and 2nd domain
34
Stereotypes editor
35
Application Building IV
  • Stage 2 Training of the system components,
    configuration
  • Training of the Crawler
  • Training of the page filtering and link scoring
    modules using the collected corpus
  • Training of each monolingual IE sub-system using
    the collected corpus
  • Customization of the UI exploiting the ontology,
    the lexicons and the stereotypes definitions
  • Configuration of the system components

36
Benefits I
  • CROSSMARC is not just another web extraction
    system
  • CROSSMARC is a platform that provides a
    customization infrastructure and a trainable
    system
  • To cope with the shortcomings of existing
    wrappers, CROSSMARC combines
  • Wrapper Induction techniques to exploit the
    formatting features of the web pages,
  • NLP techniques to exploit linguistic features of
    the web pages,
  • Machine Learning techniques to facilitate
    customization to new applications
  • enabling the process of application specific web
    pages in different sites and in different
    languages (multilingual, site-independent).
  • CROSSMARC also employs ontology engineering
    techniques to coordinate the creation and
    maintenance of application and language specific
    resources

37
Categorization of tools for Web IE (Laender et
al. 2002)
38
Benefits II
  • After evaluation performed during applications
    building, we can conclude that
  • Crawler increased effort must be put into the
    initial stage of forming hypotheses about what
    would be good directory and query start points.
  • Spider we are able to identify with a fairly
    high degree of confidence, when a Web page is an
    interesting one.
  • Information Extraction satisfactory performance
    especially for offer descriptions extracted from
    simpler web pages. In addition, the existing
    systems can be tuned further in order to achieve
    better performance.

39
Crawler Evaluation
  • more than one experimentation cycle may be needed
    depending on the domain and language

40
Spider Evaluation
41
Information extraction Evaluation
  • Results comparable to MUC
  • Full IE is a complex task which becomes even more
    complicated by the visual nature of web pages

42
Concluding remarks
  • CROSSMARC is an operational platform for
    site-independent and multilingual information
    retrieval and extraction from web pages
  • The run-time system is accessible from the
    CROSSMARC site.
  • Access to various project resources and corpora
    will be provided for research purposes.
  • Currently, more advanced components are tested
    and will be integrated in the platform.

43
Useful Links
  • CROSSMARC site
  • http//www.iit.demokritos.gr/skel/crossmarc
  • Ellogon
  • http//www.ellogon.org
Write a Comment
User Comments (0)
About PowerShow.com