Technologies for semi automatic metadata creation - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Technologies for semi automatic metadata creation

Description:

Uses pre-defined set of concepts in ontology ... Automated semantic tagging of large corpora, using TAP ontology (contains 65K instances) ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 20
Provided by: Dia571
Category:

less

Transcript and Presenter's Notes

Title: Technologies for semi automatic metadata creation


1
Technologies for (semi-) automatic metadata
creation http//gate.ac.uk/
http//nlp.shef.ac.uk/ Diana Maynard University
of Sheffield KnowledgeWeb WP 1.3 meeting, Crete,
14 May 2004
2
Overview
  • USFD is mainly concerned in this WP with best
    practices and guidelines for ontology-based web
    applications
  • State-of-the-art systems and platforms for
    metadata creation
  • Metadata is created through semantic tagging
  • Metadata can be represented as inline
    (modification of the original document) or
    standoff (separate storage from the document)

3
Semi-automatic v automatic metadata creation
  • Semi-automatic methods are more reliable, but
    require human intervention
  • MnM requires initial human annotation
    pre-defined ontology
  • S-CREAM
  • AERODAML
  • Automatic methods less reliable, but suitable for
    large volumes of text, and offer a dynamic view
  • SemTag semantic tagging from ontology
  • KIM semantic tagging and ontology population
  • hTechSight semantic tagging, ontology population
    and evolution

4
Semi-automatic methods
  • MnM
  • S-CREAM

5
MnM
  • Semi-automatic in that it requires initial
    training by user
  • Uses pre-defined set of concepts in ontology
  • User browses web and manually annotates his
    chosen pages
  • System learns annotation rules, tests them, and
    takes over annotation, populating ontologies with
    the instances found
  • Precision and recall are not perfect, however
    retraining is possible at any stage

6
S-CREAM
  • Semi-automatic CREAtion of Metadata
  • Uses Onto-O-Mat Amilcare
  • Trainable for different domains
  • Aligns conceptual markup (which defines
    relational metadata) provided by e.g. Ont-O-Mat
    with semantic markup provided by Amilcare

7
Annotated data in S-CREAM
8
Amilcare
  • Amilcare learns IE rules from pre-annotated data
    (e.g. using Ont-O-Mat)
  • Uses GATE (ANNIE) for pre-processing applies
    rules learnt in training phase to new documents
  • Concepts need to be pre-defined, but system can
    be trained for new domain
  • Can be tuned towards precision or recall

9
Automatic methods
  • SemTag
  • KIM
  • h-Techsight

10
SemTag and KIM
  • SemTag and KIM both annotate webpages using
    instances from an ontology
  • Main problem is to disambiguate such instances
    which occur in multiple parts of the ontology
  • SemTag aims for accuracy of classification,
    whereas KIM aims more for recall (finding all
    instances)
  • KIM also uses IE to find new instances not
    present in ontology

11
SemTag
  • Automated semantic tagging of large corpora,
    using TAP ontology (contains 65K instances)
  • Largest scale semantic tagging effort to date
  • Uses concept of Semantic Label Bureau
  • Annotations are stored separately from web pages
    (standoff markup)
  • Uses corpus-wide statistics to improve quality of
    tagging, e.g. automated alias discovery
  • Tags can be extracted using a variety of
    mechanisms, e.g. search for all tags matching a
    particular object

12
SemTag Architecture
13
KIM
  • Uses an ontology (KIMO) with 86K/200K instances
  • Lookup phase marks instances from the ontology
  • High ambiguity of instances with the same label
    (e.g. locations belonging to different countries)
  • Disambiguation uses an Entity Ranking algorithm,
    i.e., priority ordering of entities with the same
    label based on corpus statistics
  • Lookup is combined with rule-based IE system
    (from GATE) to recognise new instances of
    concepts and relations
  • Special KB enrichment stage where some of these
    new instances are added to the KB

14
KIM (2)
15
h-TechSight KMP
  • Knowledge management platform for fully automatic
    metadata creation and ontology population, and
    semi-automatic ontology evolution, powered by
    GATE and ToolBox.
  • Data-driven analysis of ontologies enables trends
    of instances to be monitored
  • Uses GATE to support the instance-based evolution
    of ontologies in the Chemical Engineering domain.
  • Analysis of unrestricted text to extract
    instances of concepts from such ontologies
  • Instances populated into a domain-specific
    ontology and/or exported to an Access / Oracle
    database

16
1
2
3
4
17
Ontology-based IE in h-TechSight
  • Ontology-Based IE for semantic tagging of job
    adverts, news and reports in chemical engineering
    domain
  • Semantic tagging used as input for ontological
    analysis
  • Fundamental to the application is a
    domain-specific ontology
  • Terminological gazetteer lists are linked to
    classes in the ontology
  • Rules classify the mentions in the text wrt the
    domain ontology
  • Annotations output into a database or as an
    ontology

18
Limitations
  • h-Techsight uses rule-based IE system
  • Requires human expert to write rules
  • Accurate on restricted domains with small
    ontologies
  • Adaptation to a new domain / ontology may require
    some effort

19
Summary
  • Tradeoff between semi-automatic and fully
    automatic systems, dependent on application,
    corpus size etc
  • Tradeoff between rule-based and ML techniques for
    IE
  • Tradeoff between dynamic vs static systems
Write a Comment
User Comments (0)
About PowerShow.com