Metadata Extraction: Human Language Technology and the Semantic Web

About This Presentation

Title:

Metadata Extraction: Human Language Technology and the Semantic Web

Description:

Metadata Extraction: Human Language Technology and the Semantic Web http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva – PowerPoint PPT presentation

Number of Views:398

Avg rating:3.0/5.0

Slides: 118

Provided by: hami69

Category:

more less

Transcript and Presenter's Notes

Title: Metadata Extraction: Human Language Technology and the Semantic Web

1
Metadata Extraction Human Language Technology
and the Semantic Web http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish Cunningham Kalina
Bontcheva Valentin Tablan Diana Maynard SEKT
meeting, London, 21 January 2004
2
The Knowledge Economy and Human Language

Gartner, December 2002
taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications
through 2012 more than 95 of human-to-computer
information input will involve textual language
A contradiction formal knowledge in
semantics-based systems vs. ambiguous informal
natural language
The challenge to reconcile these two opposing
tendencies

3
HLT and Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
4
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

5
Information Extraction

Information Extraction (IE) pulls facts and
structured information from the content of large
text collections.
Contrast IE and Information Retrieval
NLP history from NLU to IE
Progress driven by quantitative measures
MUC Message Understanding Conferences
ACE Automatic Content Extraction

6
MUC-7 tasks

Held in 1997, around 15 participants inc. 2 UK.
Broke IE down into component tasks
NE Named Entity recognition and typing
CO co-reference resolution
TE Template Elements
TR Template Relations
ST Scenario Templates

7
An Example

NE entities are "rocket", "Tuesday", "Dr. Head"
and "We Build Rockets"
CO "it" refers to the rocket "Dr. Head" and
"Dr. Big Head" are the same
TE the rocket is "shiny red" and Head's
"brainchild".
TR Dr. Head works for We Build Rockets Inc.
ST a rocket launching event occurred with the

various
participants.

The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.

8
Performance levels

Vary according to text type, domain, scenario,
language
NE up to 97 (tested in English, Spanish,
Japanese, Chinese)
CO 60-70 resolution
TE 80
TR 75-80
ST 60 (but human level may be only 80)

9
What are Named Entities?

NE involves identification of proper names in
texts, and classification into a set of
predefined categories of interest
Person names
Organizations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions

10
What are Named Entities (2)

Other common types measures (percent, money,
weight etc), email addresses, Web addresses,
street addresses, etc.
Some domain-specific entities names of drugs,
medical conditions, names of ships, bibliographic
references etc.
MUC-7 entity definition guidelines Chinchor97
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/ne_task.html

11
What are NOT NEs (MUC-7)

Artefacts Wall Street Journal
Common nouns, referring to named entities the
company, the committee
Names of groups of people and things named after
people the Tories, the Nobel prize
Adjectives derived from names Bulgarian,
Chinese
Numbers which are not times, dates, percentages,
and money amounts

12
Basic Problems in NE

Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types John Smith (company vs.
person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may"

13
More complex problems in NE

Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing, formatting, ...
all have an impact
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
Tell me more about Leonardo
Da Vinci

14
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

15
Corpora and System Development

Corpora are divided typically into a training and
testing portion
Rules/Learning algorithms are trained on the
training part
Tuned on the testing portion in order to optimise
Rule priorities, rules effectiveness, etc.
Parameters of the learning algorithm and the
features used
Evaluation set the best system configuration is
run on this data and the system performance is
obtained
No further tuning once evaluation set is used!

16
Some NE Annotated Corpora

MUC-6 and MUC-7 corpora - English
CONLL shared task corpora http//cnts.uia.ac.be/co
nll2003/ner/ - NEs in English and
Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
NEs in Spanish and Dutch
TIDES surprise language exercise (NEs in Cebuano
and Hindi)
ACE English - http//www.ldc.upenn.edu/Projects/
ACE/

17
The MUC-7 corpus

100 documents in SGML
News domain
Named Entities
1880 Organizations (46)
1324 Locations (32)
887 Persons (22)
Inter-annotator agreement very high (97)
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/muc_7_proceedings/marsh_slides.
pdf

18
The MUC-7 Corpus (2)

ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
Working in chilly temperatures ltTIMEX
TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
readied the space shuttle Endeavour for launch on
a Japanese satellite retrieval mission.
ltpgt
Endeavour, with an international crew of six, was
set to blast off from the ltENAMEX
TYPE"ORGANIZATIONLOCATION"gtKennedy Space
Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
the start of a 49-minute launching period. The
ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
flight was to be the 12th launched in darkness.

19
ACE Towards Semantic Tagging of Entities

MUC NE tags segments of text whenever that text
represents the name of an entity
In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer)
the mentions in the text of the entities
themselves
Rolls together the NE and CO tasks
Domain- and genre-independent approaches
ACE corpus contains newswire, broadcast news (ASR
output and cleaned), and newspaper reports (OCR
output and cleaned)

20
ACE Entities

Dealing with
Proper names e.g., England, Mr. Smith, IBM
Pronouns e.g., he, she, it
Nominal mentions the company, the spokesman
Identify which mentions in the text refer to
which entities, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor

21
ACE Example

ltentity ID"ft-airlines-27-jul-2001-2"
GENERIC"FALSE"
entity_type "ORGANIZATION"gt
ltentity_mention ID"M003"
TYPE "NAME"
string "National Air
Traffic Services"gt
lt/entity_mentiongt
ltentity_mention ID"M004"
TYPE "NAME"
string "NATS"gt
lt/entity_mentiongt
ltentity_mention ID"M005"
TYPE "PRO"
string "its"gt
lt/entity_mentiongt
ltentity_mention ID"M006"
TYPE "NAME"
string "Nats"gt
lt/entity_mentiongt

22
Annotation Tools Alembic, GATE, ...
23
Performance Evaluation

Evaluation metric mathematically defines how to
measure the systems performance against a
human-annotated, gold standard
Scoring program implements the metric and
provides performance measures
For each document and over the entire corpus
For each type of NE

24
The Evaluation Metric

Precision correct answers/answers produced
Recall correct answers/total possible correct
answers
Trade-off between precision and recall
F-Measure (ß2 1)PR / ß2R P van Rijsbergen
75
ß reflects the weighting between precision and
recall, typically ß1

25
The Evaluation Metric (2)

We may also want to take account of partially
correct answers
Precision Correct ½ Partially correct
Correct Incorrect Partial
Recall Correct ½ Partially correctCorrect
Missing Partial
Why NE boundaries are often misplaced, sosome
partially correct results

26
The GATE Evaluation Tool
27
Corpus-level Regression Testing

Need to track systems performance over time
When a change is made to the system we want to
know what implications are over the entire corpus
Why because an improvement in one case can lead
to problems in others
GATE offers automated tool to help with the NE
development task over time

28
Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
29
ChallengeEvaluating Richer NE Tagging

Need for new metrics when evaluating
hierarchy/ontology-based NE tagging
Need to take into account distance in the
hierarchy
Tagging a company as a charity is less wrong than
tagging it as a person

30
SW IE Evaluation tasks

Detection of entities and events, given a target
ontology of the domain.
Disambiguation of the entities and events from
the documents with respect to instances in the
given ontology. For example, measuring whether
the IE correctly disambiguated Cambridge in the
text to the correct instance Cambridge, UK vs
Cambridge, MA.
Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the
ontology.

31
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

32
Two kinds of IE approaches

Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
requires only small amount of training data
development could be very time consuming
some changes may be hard to accommodate

Learning Systems
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the
entire training corpus
annotators are cheap (but you get what you pay
for!)

33
NE Baseline list lookup approach

System that recognises only entities stored in
its lists (gazetteers).
Advantages - Simple, fast, language independent,
easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity

34
Shallow parsing approach using internal structure

Internal evidence names often have internal
structure. These components can be either stored
or guessed, e.g. location
Cap. Word City, Forest, Center, River
e.g. Sherwood Forest
Cap. Word Street, Boulevard, Avenue, Crescent,
Road
e.g. Portobello Street

35
Problems ...

Ambiguously capitalised words (first word in
sentence)All American Bank vs. All State
Police
Semantic ambiguity "John F. Kennedy" airport
(location) "Philip Morris" organisation
Structural ambiguity Cable and Wireless vs.
Microsoft and DellCenter for Computational
Linguistics vs. message from City Hospital for
John Smith

36
Shallow parsing with context

Use of context-based patterns is helpful in
ambiguous cases
"David Walton" and "Goldman Sachs" are
indistinguishable
But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "Person of
Organization" to identify "Goldman Sachs
correctly.

37
Examples of context patterns

PERSON earns MONEY
PERSON joined ORGANIZATION
PERSON left ORGANIZATION
PERSON joined ORGANIZATION as JOBTITLE
ORGANIZATION's JOBTITLE PERSON
ORGANIZATION JOBTITLE PERSON
the ORGANIZATION JOBTITLE
part of the ORGANIZATION
ORGANIZATION headquarters in LOCATION
price of ORGANIZATION
sale of ORGANIZATION
investors in ORGANIZATION
ORGANIZATION is worth MONEY
JOBTITLE PERSON
PERSON, JOBTITLE

38
Example Rule-based System - ANNIE

Created as part of GATE
GATE automatically deals with document formats,
saving of results, evaluation, and visualisation
of results for debugging
GATE has a finite-state pattern-action rule
language, used by ANNIE
ANNIE modified for MUC guidelines 89.5
f-measure on MUC-7 NE corpus

39
NE Components The ANNIE system a reusable and
easily extendable set of components
40
Gazetteer lists for rule-based NE

Needed to store the indicator strings for the
internal structure and context rules
Internal location indicators e.g., river,
mountain, forest for natural locations street,
road, crescent, place, square, for address
locations
Internal organisation indicators e.g., company
designators GmbH, Ltd, Inc,
Produces Lookup results of the given kind

41
The Named Entity Grammars

Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results
Hand-coded rules applied to annotations to
identify NEs
Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules
Use of contextual information
Finds person names, locations, organisations,
dates, addresses.

NE Rule in JAPE
JAPE a Java Annotation Patterns Engine
Light, robust regular-expression-based
processing
Cascaded finite state transduction
Low-overhead development of new components
Simplifies multi-phase regex processing
Rule Company1
Priority 25
(
( Token.orthography upperInitial )
//from tokeniser
Lookup.kind companyDesignator //from
gazetteer lists
)match
--gt
match.NamedEntity
kindcompany, ruleCompany1

43
Named Entities in GATE
44
Using co-reference to classify ambiguous NEs

Orthographic co-reference module that matches
proper names in a document
Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs
May not reclassify already classified entities
Classification of unknown entities very useful
for surnames which match a full name, or
abbreviations, e.g. Bonfield will match Sir
Peter Bonfield International Business
Machines Ltd. will match IBM

45
Named Entity Coreference
46
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

47
Machine Learning Approaches

Approaches
Train ML models on manually annotated text
Mixed initiative learning
Used for producing training data
Used for producing working systems
ML Methods
Symbolic learning rules/decision trees induction
Statistical models HMMs, Bayesian methods,
Maximum Entropy

48
ML Terminology

Instances (tokens, entities)
Occurrences of a phenomenon
Attributes (features)
Characteristics of the instances
Classes
Sets of similar instances

49
Methodology

The task can be broken into several subtasks
(that can use different methods)
Boundary detection
Entity classification into NE types
Different models for different entity types
Several models can be used in competition.
Some algorithms perform better on little data
while others are better when more training is
available

50
Methodology (2)

Boundaries (and entity types) notations
S(-XXX), E(-XXX)
ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
gt heads for
ltS-LOC/gtBaghdadltE-LOC/gt.
IOB notation (Inside, Outside, Beginning_of)
U.N. I-ORG
official O
Ekeus I-PER
heads O
for O
Baghdad I-LOC
. O
Translations between the two conventions are
straight-forward

51
Features

Linguistic features
POS
Morphology
Syntax
Lexicon data
Semantic features
Ontological class
ETC

Document structure
Original markup
Paragraph/sentence structure
Surface features
Token length
Capitalisation
Token type (word, punctuation, symbol)

Feature selection the most difficult part
Some automatic scoring methods can be used

52
Mixed Initiative Learning

Human computer interaction
Speeds up the creation of training data
Can be used for corpus/system creation
Example implementations
Alembic Day et al97
Amilcare Ciravegna03

53
Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
54
GATE Machine Learning support

Uses classification.
Attr1, Attr2, Attr3, Attrn ? Class
Classifies annotations.
(Documents can be classified as well using a
1-to1 relation with annotations.)
Annotations of a particular type are selected as
instances.
Attributes refer to features of the instance
annotations or their context.
Generic implementation for attribute collection
can be linked to any ML engine.
ML engines currently integrated WEKA and
Ontotexts HMM.

55
Implementation

Machine Learning PR in GATE.
Has two functioning modes
training
application
Uses an XML file for configuration
lt?xml version"1.0" encoding"windows-1252"?gt
ltML-CONFIGgt
ltDATASETgt lt/DATASETgt
ltENGINEgtlt/ENGINEgt
ltML-CONFIGgt

56
Attributes Collection
Instances type Token
57
Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
58
Amilcare Melita

Amilcare rule-learning algorithm
Tagging rules learn to insert tags in the text,
given training examples
Correction rules learn to move already inserted
tags to their correct place in the text
Novel aspect learns independently begin and end
tags
Melita support adaptive IE
Applied in SemWeb context (see below)
Being extended as part of the EU-funded DOT.KOM
project towards KM andSemWeb applications

Ciravegna03www.dcs.shef.ac.uk/fabio
59
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

60
Towards Semantic Tagging of Entities

The MUC NE task tags selected segments of text
whenever that text represents the name of an
entity.
Semantic tagging - view as mentions of the
underlying instances from the ontology
Identify which mentions in the text refer to
which instances in the ontology, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor

61
Tasks

Identify entity mentions in the text
Reference disambiguation
Add new instances if needed
Disambiguate wrt instances in the ontology
Identify instances of attributes and relations
take into account what are allowed given the
ontology, using domainrange as constraints

62
Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
63
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
64
Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
65
Why not put metadata in ontologies?

Can be encoded in RDF/OWL, etc. but does it need
to be put as instances in the ontology?
Typically we do not need to reason with it
Reasoning happens in the ontology when the new
instances of classes and properties are added,
but the metadata statements are different from
them, they only refer to them
A lot more metadata than instances
Millions of metadata statements, thousands of
instances, hundreds of concepts
Different access required
By offset (give me all metadata of the first
paragraph)
Efficient metadata-wide statistics based on
strings not an operation that people would do
on other concepts
Mixing with keyword-based search using IR-style
indexing

66
Metadata Creation with IE

Semantic tagging creates metadata
Stand-off or part of document
Semi-automatic
One view (given by the user, one ontology)
More reliable
Automatic metadata creation
Many views change with ontology, re-train IE
engine for each ontology
Always up to date, if ontology changes
Less reliable

67
Problems with traditional IE for metadata
creation

S-CREAM Semi-automatic CREAtion of Metadata
Handschuh et al02
Semantic tags from IE need to be mapped to
instances of concepts, attributes or relations
Most ML-based IE systems do not deal well with
relations, mainly entities
Amilcare does not handle anaphora resolution,
GATE has such component but not used here
Implemented a discourse model with logical rules
LASIE used discourse model with domainontology
problem is robustness and domain portability

68
Example
Handschuh et al02 S-CREAM, EKAW02
69
S-CREAM Discourse Rules

Rules to attach instances only when the ontology
allows that (e.g., prices)
Attach tag values to the nearest preceding
compatible entity (e.g., prices and rooms)
Create a complex object between two concept
instances if they are adjacent (e.g., rate
number followed by currency)
Experienced users can write new rules

70
Challenges for IE for SemWeb

Portability different and changing ontologies
Different text types structured, free, etc.
Utilise ontology information where available
Train from small amount of annotated text
Output results wrt the given ontology
bridge the gap demonstrated in S-CREAM
Learn/Model at the right level
ontologies are hierarchical and data will get
sparser the lower we go

DOT.KOM http//nlp.shef.ac.uk/dot.kom/
71
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

72
GATE Infrastructure for metadata extraction for
the SemWeb

Combines learning and rule-based methods
Allows combination of IE and IR
Enables use of large-scale linguistic resources
for IE, such as WordNet
Supports ontologies as part of IE applications -
Ontology-Based IE (OBIE)

73
Ontology Management in GATE
74
Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
75
WordNet support
76
Populating Ontologies with IE
77
Example OBIE Application

hTechSight project using Ontology-Based IE for
semantic tagging of job adverts, news and reports
in chemical engineering domain
Aim is to track technological change over time
through terminological analysis
Fundamental to the application is a
domain-specific ontology
Terminological gazetteer lists are linked to
classes in the ontology
Rules classify the mentions in the text wrt the
domain ontology
Annotations output into a database or as an
ontology

78
(No Transcript)
79
(No Transcript)
80
Exported Database
81
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

82
Platforms for Large-Scale Metadata Creation

Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation
Automated alias discovery
Generate SemWeb output (RDF, OWL)
Stand-off storage and indexing of metadata
Use large instance bases to disambiguate to
Ontology servers for reasoning and access
Architecture elements
Crawler, onto storage, doc indexing, query,
annotators
Apps sem browsers, authoring tools, etc.

83
SemTag

Lookup of all instances from the ontology (TAP)
65K instances
Disambiguate the occurrences as
One of those in the taxonomy
Not present in the taxonomy
Not very high ambiguity of instances with the
same label in TAP concentrate on the second
problem
Use bag-of-words approach for disambiguation
3 people evaluated 200 labels in context agreed
on only 68.5 - metonymy
Placing labels in the taxonomy is hard

Dill et al, SemTag and Seeker. WWW03
84
Seeker

High-performance distributed infrastructure
128 dual-processor machines with separate ½
terabyte of storage
Each node runs approx. 200 documents per sec.
Service-oriented architecture Vinci (SOAP)

Dill et al, SemTag and Seeker. WWW03
85
OBIE in KIM

The ontology (KIMO) and 86K/200K instances KB
High ambiguity of instances with the same label
need for disambiguation step
Lookup phase marks mentions from the ontology
Combined with rule-based IE system to recognise
new instances of concepts and relations
Special KB enrichment stage where some of these
new instances are added to the KB
Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics (e.g., Paris)

Popov et al. KIM. ISWC03
86
OBIE in KIM (2)
Popov et al. KIM. ISWC03
87
Comparison between SemTag and KIM

SemTag only aims for accuracy (precision) of
classification of the annotated entities
KIM also aims for coverage (recall) whether all
possible mentions of entities were found
Trade-off sometimes finding some is enough
SemTag does not attempt to discover and expand
the KB with new instances (e.g., new company)
the reason why KIM uses IE, not simple KB lookup
i.e. OBIE is often needed for ontology
population, not just metadata creation

88
Two Annotation Scenarios (1)

Getting the instances and the relations between
them is enough, maybe not all mentions in the
text are covered, but compensated by giving
access to this info from the annotated text

89
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
90
Two Annotation Scenarios (2)

Exhaustive annotation is required, so all
occurrences of all instances and relations are
needed
Allows sentence and paragraph-level exploration,
rather than document-level as in the previous
scenario
Harder to achieve
Distinction between these scenarios needs to be
made in the metadata annotation tools/KM tools
using IE

91
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
92
Semantic Reference Disambiguation

Possible approaches
Vector-space models compare context similarity
runs over a corpus
SemTag
Baggas cross-document coreference work
Communities of practise approach from KM
Identity criteria from the ontology based on
properties, e.g., date_of_birth, name

93
Why disambiguation is hard not all knowledge
is explicit in text

Paris fashion week underway as cancellations
continue
By Jo Johnson and Holly Finn - Oct 07 2001
184817 (FT)
Even as Paris fashion week opened at the
weekend, the cancellations and reschedulings were
still trickling in over the fax machines Loewe,
the leather specialists owned by LVMH empire, is
not showing, Cerruti, the Italian tailor,is
downscaling to private viewings, Helmut Lang,
master of the sharp suit, is cancelling his
catwalk.
The Oscar de la Renta show, for example, which
had been planned for September 11th in New York,
and which might easily enough have moved over to
Paris instead, is not on the schedule. When the
Dominican Republic-born designer consulted
America Vogue's influential editor, Anna Wintour,
she reportedly told him it would be unpatriotic
to decamp.

94
Structure of the Tutorial

Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation

95
Natural Language Generation

NLG is
subfield of AI and CL that is concerned with the
construction of computer systems that can produce
understandable texts in English or other human
languages from some underlying linguistic
representation of information ReiterDale97
NLG techniques are applied also for producing
speech, e.g., in speech dialogue systems

Natural Language Generation

Ontology/KB/Database
Lexicons Grammars
Text
97
Requirements Analysis

Create a corpus of target texts and (if possible)
their input representations
Analyse the information content
Unchanging texts thank you, hello, etc.
Directly available data timetable of buses
Computable data number of buses
Unavailable data not in the systems KB/DB

98
NLG Tasks

Content determination
Discourse planning
Sentence aggregation
Lexicalisation
Referring expression generation
Linguistic realisation

99
Content determination

What information to include in the text
filtering and summarising input data into a
formal knowledge representation
Application dependent
Example
project AKT
start_date October-2000
end_date October-2006
participants A,E,OU,So,Sh

100
Discourse Planning

Determine ordering and structure over the
knowledge to be generated
Theories of discourse how texts are structured
Influences text readability
Result tree structure imposing ordering over the
predicates and possibly providing discourse
relations

101
Example
SEQUENCE
LIST

ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL

project AKT participantOU
102
Planning-Based Approaches

Use AI-style planners (e.g., Moore Paris 93
Discourse relations (e.g., ELABORATION) are
encoded as planning operators
Preconditions specify when the relation can apply
Planning starts from a top-level goal, e.g.,
define-project(X)
Computationally expensive and require a lot of
knowledge problem for real-world systems

103
Schema-Based Approaches

Capture typical text structuring patterns in
templates (derived from corpus), e.g., McKeown
85
Typically implemented as RTN
Variety comes from different available knowledge
for each entity
Reusable ones available Exemplars
Example
Describe-Project-Schema -gt Sequence(duration,
ProjParticipants-Schema)

104
Sentence Aggregation

Determine which predicates should be grouped
together in sentences
Less understood process
Default each predicate can be expressed as a
sentence, so optional step
SPOT trainable planner
Example
AKT is a 6-year project with 5 participants
Sheffield (URL)
OU

105
Lexicalisation

Choosing words and phrases to express the
concepts and relations in predicates
Trivial solution 1-1 mapping between
concepts/relations and lexical entries
Variation is useful to avoid repetitiveness and
also convey pragmatic distinctions (e.g.
formality)

106
Referring Expression Generation

Choose pronouns/phrases to refer to the entities
in the text
Example he vs Mr Smith vs John Smith, the
president of XXX Corp.
Depends on what is previously said
He is only appropriate if the person is already
introduced in the text

107
Linguistic Realisation

Use grammar to generate text which is
grammatical, i.e., syntactically and
morphologically correct
Domain-independent
Reusable components are available e.g.,
RealPro, FUF/SURGE
Example
Morphology participant -gt participants
Syntactic agreement AKT starts on

108
A GATE-based generator

Input
The MIAKT ontology
The RDF file for the given case
The MIAKT lexicon
Output
GATE document with the generated text

109
Lexicalising Concepts and Instances
110
Example RDF Input

ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_patient'gt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlPatient'/gt
ltNS2has_agegt68lt/NS2has_agegt
ltNS2involved_in_ta rdfresource'c\breast_cance
r_ontology.damlta-soton-1069861276136'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_mammography'gt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlMammography'/gt
ltNS2carried_out_on rdfresource'c\breast_cance
r_ontology.daml01401_patient'/gt
ltNS2has_dategt22 9 1995lt/NS2has_dategt
ltNS2produce_result rdfresource'c\breast_cance
r_ontology.damlimage_01401_right_cc'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.damlimage_01401_right_cc'gt
ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
.LJPEGlt/NS2image_filegt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlRight_CC_Image'/gt
ltNS2has_lateral rdfresource'c\breast_cancer_o
ntology.damllateral_right'/gt
ltNS2view_of_image rdfresource'c\breast_cancer
_ontology.damlcraniocaudal_view'/gt
ltNS2contains_entity rdfresource'c\breast_canc
er_ontology.daml01401_right_cc_abnor_1'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_right_cc_abnor_1'gt

111
CASE0140.RDF

The 68 years old patient is involved in a
triple assessment procedure. The triple
assessment procedure contains a mammography exam.
The mammography exam is carried out on the
patient on 22 9 1995. The mammography exam
produced a right CC image. The right CC image
contains an abnormality and it has a right
lateral side and a craniocaudal view. The
abnormality has a mass, a microlobulated margin ,
a round shape, and a probably malignant
assessment.

112
Further Reading on IE for SemWeb

Requirements for Information Extraction for
Knowledge Management. http//nlp.shef.ac.uk/dot.ko
m/publications.html
Information Extraction as a Semantic Web
Technology Requirements and Promises. Adaptive
Text Extraction and Mining workshop, 2003.
A. Kiryakov, B. Popov, et al. Semantic
Annotation, Indexing, and Retrieval. 2nd
International Semantic Web Conference (ISWC2003),
http//www.ontotext.com/publications/index.htmlKi
ryakovEtAl2003
S. Handschuh, S. Staab, R. Volz
http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
73_handschuh.pdf. On Deep Annotation. WWW03.
S. Dill, N. Eiron, et al http//www.tomkinshome.c
om/papers/2Web/semtag.pdf . SemTag and Seeker
Bootstrapping the semantic web via automated
semantic annotation. WWW03.
E. Motta, M. Vargas-Vera, et al MnM Ontology
Driven Semi-Automatic and Automatic Support for
Semantic Markup. Knowledge Engineering and
Knowledge Management (Ontologies and the Semantic
Web), (EKAW02), http//www.aktors.org/publications
/selected-papers/06.pdf
K. Bontcheva, A. Kiryakov, H. Cunningham, B.
Popov. M. Dimitrov. Semantic Web Enabled, Open
Source Language Technology. Language Technology
and the Semantic Web, Workshop on NLP and XML
(NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
emweb/bontcheva-etal-final.pdf
Handschuh, Staab, Ciravegna. S-CREAM -
Semi-automatic CREAtion of Metadata (2002)
http//citeseer.nj.nec.com/529793.html

113
Further Reading on traditional IE

Day et al97 D. Day, J. Aberdeen, L. Hirschman,
R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language
Processing Systems. In Proceedings of the Fifth
Conference on Applied Natural Language Processing
(ANLP97). 1997.
Ciravegna02 F. Ciravegna, A. Dingli, D.
Petrelli, Y. Wilks User-System Cooperation in
Document Annotation based on Information
Extraction. Knowledge Engineering and Knowledge
Management (Ontologies and the Semantic Web),
(EKAW02), 2002.
N. Kushmerick, B. Thomas. Adaptive information
extraction Core technologies for information
agents (2002). http//citeseer.nj.nec.com/kushmeri
ck02adaptive.html
H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. 40th Anniversary Meeting of the
Association for Computational Linguistics
(ACL'02). 2002.
D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003.
Califf and Mooney Relational Learning of Pattern
Matching Rules for Information Extraction
http//citeseer.nj.nec.com/6804.html
Borthwick. A. A Maximum Entropy Approach to Named
Entity Recognition.PhD Dissertation. 1999
Bikel D., Schwarta R., Weischedel. R. An
algorithm that learns whats in a name. Machine
Learning 34, pp.211-231, 1999
Riloff, E. (1996) "Automatically Generating
Extraction Patterns from Untagged Text"
Proceedings of the Thirteenth National Conference
on Artificial Intelligence (AAAI-96) , 1996, pp.
1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
s/aaai96.pdf
Daelemans W. and Hoste V. Evaluation of Machine
Learning Methods for Natural Language Processing
Tasks. In LREC 2002 Third International
Conference on Language Resources and Evaluation,
pages 755760

114
Further Reading on traditional IE

Black W.J., Rinaldi F., Mowatt D. Facile
Description of the NE System Used For MUC-7.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998.
Collins M., Singer Y. Unsupervised models for
named entity classificationIn Proceedings of the
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, 1999
Collins M. Ranking Algorithms for Named-Entity
Extraction Boosting and the Voted Perceptron.
Proceedings of the 40th Annual Meeting of the
ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
Y., Renals S. Information extraction from
broadcast news, Philosophical Transactions of the
Royal Society of London, series A Mathematical,
Physical and Engineering Sciences, 2000.
Grishman R. The NYU System for MUC-6 or Where's
the Syntax? Proceedings of the MUC-6 workshop,
Washington. November 1995.
Krupka G. R., Hausman K. IsoQuest Inc.
Description of the NetOwlTM Extractor System as
Used for MUC-7. Proceedings of 7th Message
Understanding Conference, Fairfax, VA, 19 April -
1 May, 1998.
McDonald D. Internal and External Evidence in the
Identification and Semantic Categorization of
Proper Names. In B.Boguraev and J. Pustejovsky
editors Corpus Processing for Lexical
Acquisition. Pages21-39. MIT Press. Cambridge,
MA. 1996
Mikheev A., Grover C. and Moens M. Description of
the LTG System Used for MUC-7. Proceedings of 7th
Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998
Miller S., Crystal M., et al. BBN Description of
the SIFT System as Used for MUC-7. Proceedings of
7th Message Understanding Conference, Fairfax,
VA, 19 April - 1 May, 1998

115
Further Reading on multilingual IE

Palmer D., Day D.S. A Statistical Profile of the
Named Entity Task. Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3,
1997.
Sekine S., Grishman R. and Shinou H. A decision
tree method for finding and classifying names in
Japanese texts. Proceedings of the Sixth Workshop
on Very Large Corpora, Montreal, Canada, 1998
Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
Chinese Named Entity Identification Using
Class-based Language Model. In proceeding of the
19th International Conference on Computational
Linguistics (COLING2002), pp.967-973, 2002.
Takeuchi K., Collier N. Use of Support Vector
Machines in Extended Named Entity Recognition.
The 6th Conference on Natural Language Learning.
2002
D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003.
M. M. Wood and S. J. Lydon and V. Tablan and D.
Maynard and H. Cunningham. Using parallel texts
to improve recall in IE. Recent Advances in
Natural Language Processing, Bulgaria, 2003.
D.Maynard, V. Tablan and H. Cunningham. NE
recognition without training data on a language
you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition
Combining Statistical and Symbolic Models,
Sapporo, Japan, 2003.

116
Further Reading on multilingual IE

H. Saggion, H. Cunningham, K. Bontcheva, D.
Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
through Multisource and Multilingual Information
Extraction the MUMIS project. Data and Knowledge
Engineering, 2003.
D. Manov and A. Kiryakov and B. Popov and K.
Bontcheva and D. Maynard, H. Cunningham.
Experiments with geographic knowledge for
information extraction. Workshop on Analysis of
Geographic References, HLT/NAACL'03, Canada,
2003.
H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary
Meeting of the Association for Computational
Linguistics (ACL'02). Philadelphia, July 2002.
H. Cunningham. GATE, a General Architecture for
Text Engineering. Computers and the Humanities,
volume 36, pp. 223-254, 2002.
D. Maynard, H. Cunningham, K. Bontcheva, M.
Dimitrov. Adapting A Robust Multi-Genre NE System
for Automatic Content Extraction. Proc. of the
10th International Conference on Artificial
Intelligence Methodology, Systems, Applications
(AIMSA 2002), 2002.
K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
Y. Wilks. How feasible is the reuse of grammars
for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.