Technologies for semi automatic metadata creation - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Technologies for semi automatic metadata creation

Description:

Uses pre-defined set of concepts in ontology ... Automated semantic tagging of large corpora, using TAP ontology (contains 65K instances) ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 20

Provided by: Dia571

Category:

more less

Transcript and Presenter's Notes

Title: Technologies for semi automatic metadata creation

1
Technologies for (semi-) automatic metadata
creation http//gate.ac.uk/
http//nlp.shef.ac.uk/ Diana Maynard University
of Sheffield KnowledgeWeb WP 1.3 meeting, Crete,
14 May 2004
2
Overview

USFD is mainly concerned in this WP with best
practices and guidelines for ontology-based web
applications
State-of-the-art systems and platforms for
metadata creation
Metadata is created through semantic tagging
Metadata can be represented as inline
(modification of the original document) or
standoff (separate storage from the document)

3
Semi-automatic v automatic metadata creation

Semi-automatic methods are more reliable, but
require human intervention
MnM requires initial human annotation
pre-defined ontology
S-CREAM
AERODAML
Automatic methods less reliable, but suitable for
large volumes of text, and offer a dynamic view
SemTag semantic tagging from ontology
KIM semantic tagging and ontology population
hTechSight semantic tagging, ontology population
and evolution

4
Semi-automatic methods

MnM
S-CREAM

5
MnM

Semi-automatic in that it requires initial
training by user
Uses pre-defined set of concepts in ontology
User browses web and manually annotates his
chosen pages
System learns annotation rules, tests them, and
takes over annotation, populating ontologies with
the instances found
Precision and recall are not perfect, however
retraining is possible at any stage

6
S-CREAM

Semi-automatic CREAtion of Metadata
Uses Onto-O-Mat Amilcare
Trainable for different domains
Aligns conceptual markup (which defines
relational metadata) provided by e.g. Ont-O-Mat
with semantic markup provided by Amilcare

7
Annotated data in S-CREAM
8
Amilcare

Amilcare learns IE rules from pre-annotated data
(e.g. using Ont-O-Mat)
Uses GATE (ANNIE) for pre-processing applies
rules learnt in training phase to new documents
Concepts need to be pre-defined, but system can
be trained for new domain
Can be tuned towards precision or recall

9
Automatic methods

SemTag
KIM
h-Techsight

10
SemTag and KIM

SemTag and KIM both annotate webpages using
instances from an ontology
Main problem is to disambiguate such instances
which occur in multiple parts of the ontology
SemTag aims for accuracy of classification,
whereas KIM aims more for recall (finding all
instances)
KIM also uses IE to find new instances not
present in ontology

11
SemTag

Automated semantic tagging of large corpora,
using TAP ontology (contains 65K instances)
Largest scale semantic tagging effort to date
Uses concept of Semantic Label Bureau
Annotations are stored separately from web pages
(standoff markup)
Uses corpus-wide statistics to improve quality of
tagging, e.g. automated alias discovery
Tags can be extracted using a variety of
mechanisms, e.g. search for all tags matching a
particular object

12
SemTag Architecture
13
KIM

Uses an ontology (KIMO) with 86K/200K instances
Lookup phase marks instances from the ontology
High ambiguity of instances with the same label
(e.g. locations belonging to different countries)
Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics
Lookup is combined with rule-based IE system
(from GATE) to recognise new instances of
concepts and relations
Special KB enrichment stage where some of these
new instances are added to the KB

14
KIM (2)
15
h-TechSight KMP

Knowledge management platform for fully automatic
metadata creation and ontology population, and
semi-automatic ontology evolution, powered by
GATE and ToolBox.
Data-driven analysis of ontologies enables trends
of instances to be monitored
Uses GATE to support the instance-based evolution
of ontologies in the Chemical Engineering domain.
Analysis of unrestricted text to extract
instances of concepts from such ontologies
Instances populated into a domain-specific
ontology and/or exported to an Access / Oracle
database

16
1
2
3
4
17
Ontology-based IE in h-TechSight

Ontology-Based IE for semantic tagging of job
adverts, news and reports in chemical engineering
domain
Semantic tagging used as input for ontological
analysis
Fundamental to the application is a
domain-specific ontology
Terminological gazetteer lists are linked to
classes in the ontology
Rules classify the mentions in the text wrt the
domain ontology
Annotations output into a database or as an
ontology

18
Limitations

h-Techsight uses rule-based IE system
Requires human expert to write rules
Accurate on restricted domains with small
ontologies
Adaptation to a new domain / ontology may require
some effort

19
Summary

Tradeoff between semi-automatic and fully
automatic systems, dependent on application,
corpus size etc
Tradeoff between rule-based and ML techniques for
IE
Tradeoff between dynamic vs static systems

Write a Comment

User Comments (0)