Ontology-based Annotation - PowerPoint PPT Presentation

About This Presentation
Title:

Ontology-based Annotation

Description:

Title: Slide 1 Author: Sergey Sosnovsky Last modified by: Sergey Sosnovsky Created Date: 9/18/2006 12:02:17 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 16
Provided by: SergeySo8
Learn more at: https://sites.pitt.edu
Category:

less

Transcript and Presenter's Notes

Title: Ontology-based Annotation


1
Ontology-based Annotation
  • Sergey Sosnovsky_at_PAWS_at_SIS_at_PITT

2
Outline
  • O-based Annotation
  • Conclusion
  • Questions

3
Why Do We Need Annotation
  • Annotation-based Services
  • Integration of Disperse Information
    (knowledge-based linking)
  • Better Indexing and Retrieval (based on the
    document semantics)
  • Content-based Adaptation (modeling document
    content in terms of domain model)
  • Knowledge Management
  • Organizations Repositories as mini Webs (Boeing,
    Rolls Royce, Fiat, GlaxoSmithKline, Merck, NPSA,
    )
  • Collaboration Support
  • Knowledge sharing and communication

What is Added by O-based Annotation
  • Ontology-driven processing (effective formal
    reasoning)
  • Connecting other O-based Services (O-mapping,
    O-visualization)
  • Unified vocabulary
  • Connecting to the rest of SW knowledge

4
Definition
  • O-based Annotation is a process ofcreating a
    mark-up of Web-documents using a pre-existing
    ontologyand/orpopulating knowledge bases by
    marked up documents

Michael Jordan plays basketball
5
List of Tools
  • AeroDAML / AeroSWARM
  • Annotea / Annozilla
  • Armadillo
  • AktiveDoc
  • COHSE
  • GOA
  • KIM Semantic Annotation Platform
  • MagPie
  • Melita
  • MnM
  • OntoAnnotate
  • Ontobroker
  • OntoGloss
  • ONTO-H
  • Ont-O-Mat / S-CREAM / CREAM
  • Ontoseek
  • Pankow
  • SHOE Knowledge Annotator
  • Seeker
  • Information Extraction Tools
  • Alembic
  • Amilcare / T-REX
  • Annie
  • Fastus
  • Lasie
  • Poteus
  • SIFT

6
Important Characteristics
  • Automation of Annotation(manual / semiautomatic
    / automatic / editable)
  • Ontology-related issues
  • pluggable ontology (yes/no)
  • ontology language (RDFS / DAMLOIL / OWL / )
  • local / anywhere access
  • ontology elements available for annotation
    (concept / instances / relations / triples)
  • where annotations are stored (in the annotated
    document / on the dedicated server / where
    specified)
  • annotation format (XML / RDF / OWL / ).
  • Annotated Documents
  • document kinds (text / multimedia)
  • document formats (plain text / html / pdf / )
  • documents access (local / web)
  • Architecture / Interface / Interoperability
  • Standalone tool / web interface / web component /
    API /
  • Annotation Scale (large the WWW size / small -
    a hundred)
  • Existing Documentation / Tutorial
  • Availability

7
SMORE
  • Manual Annotation
  • OWL-based Markup
  • Simultaneous O modification (if necessary)
  • ScreenScraper mines metadata from annotated pages
    and suggests as candidates for the mark-up
  • Post-annotation O-based Inference

Michael Jordan plays basketball
8
Problems of Manual Annotation
  • Expensive / Time-consuming
  • Difficult / Error prone
  • Subjective (two people annotating the same
    documents have in 1530 annotate them
    differently)
  • Never ending
  • new documents
  • new versions of ontologies
  • Annotation storage problem
  • where?
  • Trust owners annotation
  • incompetence
  • Spam (Google does not use ltMETAgt info)

Solution Dedicated Automatic Annotation Services
(Search Engine- like)
9
Automatic O-based Annotation
  • Supervised
  • MnM
  • S-Cream
  • Melita AktiveDoc
  • Unsupervised
  • SemTag - Seeker
  • Armadillo
  • AeroSWARM

10
MnM
  • Ontology-based Annotation Interface
  • Ontology browser (rich navigation capabilities)
  • Document browser (usually Web-browser)
  • The annotation is mainly based on
    select-drag-N-drop association of text fragments
    with ontology elements
  • Built-in or External ML component classifies the
    main corpus of documents
  • Activity Flow
  • Markup (A human user manually annotate training
    set of documents by ontology elements)
  • Learn (A learning algorithm is run over the
    marked up corpus to learn the extraction rules)
  • Extract (An IE mechanism is selected and run over
    a set of documents)
  • Review (A human user observes the results and
    correct them if necessary)

11
Amilcare and T-REX
  • Amilcare
  • Automatic IE component
  • Is used in at least five O-based A tools (Melita,
    MnM, Ontoannotate, Ontomat, SemantiK)
  • Released to about 50 Industrial and Academic
    sites
  • Java API
  • Recently succeeded by T-REX

12
Pankow
  • Input A web page.
  • Step 1 Web page is scanned for phrases that
    might be categorized as instances of the ontology
    (partof-speech tagger to find candidate proper
    nouns)
  • Result 1 set of candidate proper nouns
  • Step 2 The system iterates through all candidate
    proper nouns and all candidate ontology concepts
    to derive hypothesis phrases using preset
    linguistic patterns.
  • Result 2 Set of hypothesis phrases.
  • Step 3 Google is queried for the hypothesis
    phrases through
  • Result 3 the number of hits for each hypothesis
    phrase.
  • Step 4 The system sums up the query results to a
    total for each instance-concept pair. Then the
    system categorizes the candidate proper nouns
    into their highest ranked concepts
  • Result 4 an ontologically annotated web page.

13
SemTag - Seeker
  • IBM-developed
  • 264 million web pages
  • 72 thousand of concepts (TAP taxonomy)
  • 434 million automatically disambiguated semantic
    tags
  • Spotting pass
  • Documents are retrieved from the Seeker store,
    and tokenized
  • Tokens are matched against the TAP concepts.
  • Each resulting label is saved with ten words to
    either side as a window'' of context around the
    particular candidate object.
  • Learning pass
  • A representative sample of the data is scanned to
    determine the corpus-wide distribution of terms
    at each internal node of the taxonomy. TBD
    (taxonomy-based disambiguation) algorithm is
    used.
  • Tagging pass
  • Windows are scanned once more to disambiguate
    each reference determine an TAP object
  • A record is entered into a database of final
    results containing the URL, the reference, and
    any other associated metadata.

14
Conclusions
  • Web-document A is a necessary thing
  • O-based A benefits (O-based post-processing,
    unified vocabularies, etc.)
  • Manual A is a bad thing
  • Automatic A is a good thing
  • Supervised O-based A
  • Useful O-based interface for annotating training
    set
  • Traditional IE tools for textual classification
  • Unsupervised O-based A
  • COHSE matches concept names from the ontology
    and a thesaurus against tokens from the text
  • Pankow uses ontology to build candidate
    queries, then uses community wisdom to choose the
    best candidate
  • SemTag uses concept names to match tokens and
    hierarchical relations in the ontology to
    disambiguate between candidate concepts for a
    text fragment

15
?
Questions
?
?
Write a Comment
User Comments (0)
About PowerShow.com