Title: Metadata Extraction: Human Language Technology and the Semantic Web
1Metadata Extraction Human Language Technology
and the Semantic Web http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish Cunningham Kalina
Bontcheva Valentin Tablan Diana Maynard SEKT
meeting, London, 21 January 2004
2The Knowledge Economy and Human Language
- Gartner, December 2002
- taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction formal knowledge in
semantics-based systems vs. ambiguous informal
natural language - The challenge to reconcile these two opposing
tendencies
3HLT and Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
4Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
5Information Extraction
- Information Extraction (IE) pulls facts and
structured information from the content of large
text collections. - Contrast IE and Information Retrieval
- NLP history from NLU to IE
- Progress driven by quantitative measures
- MUC Message Understanding Conferences
- ACE Automatic Content Extraction
6MUC-7 tasks
- Held in 1997, around 15 participants inc. 2 UK.
Broke IE down into component tasks - NE Named Entity recognition and typing
- CO co-reference resolution
- TE Template Elements
- TR Template Relations
- ST Scenario Templates
7An Example
- NE entities are "rocket", "Tuesday", "Dr. Head"
and "We Build Rockets" - CO "it" refers to the rocket "Dr. Head" and
"Dr. Big Head" are the same - TE the rocket is "shiny red" and Head's
"brainchild". - TR Dr. Head works for We Build Rockets Inc.
- ST a rocket launching event occurred with the
various
participants.
- The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.
8Performance levels
- Vary according to text type, domain, scenario,
language - NE up to 97 (tested in English, Spanish,
Japanese, Chinese) - CO 60-70 resolution
- TE 80
- TR 75-80
- ST 60 (but human level may be only 80)
9What are Named Entities?
- NE involves identification of proper names in
texts, and classification into a set of
predefined categories of interest - Person names
- Organizations (companies, government
organisations, committees, etc) - Locations (cities, countries, rivers, etc)
- Date and time expressions
10What are Named Entities (2)
- Other common types measures (percent, money,
weight etc), email addresses, Web addresses,
street addresses, etc. - Some domain-specific entities names of drugs,
medical conditions, names of ships, bibliographic
references etc. - MUC-7 entity definition guidelines Chinchor97
- http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/ne_task.html
11What are NOT NEs (MUC-7)
- Artefacts Wall Street Journal
- Common nouns, referring to named entities the
company, the committee - Names of groups of people and things named after
people the Tories, the Nobel prize - Adjectives derived from names Bulgarian,
Chinese - Numbers which are not times, dates, percentages,
and money amounts
12Basic Problems in NE
- Variation of NEs e.g. John Smith, Mr Smith,
John. - Ambiguity of NE types John Smith (company vs.
person) - May (person vs. month)
- Washington (person vs. location)
- 1945 (date vs. time)
- Ambiguity with common words, e.g. "may"
13More complex problems in NE
- Issues of style, structure, domain, genre etc.
- Punctuation, spelling, spacing, formatting, ...
all have an impact - Dept. of Computing and Maths
- Manchester Metropolitan University
- Manchester
- United Kingdom
- Tell me more about Leonardo
- Da Vinci
14Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
15Corpora and System Development
- Corpora are divided typically into a training and
testing portion - Rules/Learning algorithms are trained on the
training part - Tuned on the testing portion in order to optimise
- Rule priorities, rules effectiveness, etc.
- Parameters of the learning algorithm and the
features used - Evaluation set the best system configuration is
run on this data and the system performance is
obtained - No further tuning once evaluation set is used!
16Some NE Annotated Corpora
- MUC-6 and MUC-7 corpora - English
- CONLL shared task corpora http//cnts.uia.ac.be/co
nll2003/ner/ - NEs in English and
Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
NEs in Spanish and Dutch - TIDES surprise language exercise (NEs in Cebuano
and Hindi) - ACE English - http//www.ldc.upenn.edu/Projects/
ACE/
17The MUC-7 corpus
- 100 documents in SGML
- News domain
- Named Entities
- 1880 Organizations (46)
- 1324 Locations (32)
- 887 Persons (22)
- Inter-annotator agreement very high (97)
- http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/muc_7_proceedings/marsh_slides.
pdf
18The MUC-7 Corpus (2)
- ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
Working in chilly temperatures ltTIMEX
TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
readied the space shuttle Endeavour for launch on
a Japanese satellite retrieval mission. - ltpgt
- Endeavour, with an international crew of six, was
set to blast off from the ltENAMEX
TYPE"ORGANIZATIONLOCATION"gtKennedy Space
Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
the start of a 49-minute launching period. The
ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
flight was to be the 12th launched in darkness.
19ACE Towards Semantic Tagging of Entities
- MUC NE tags segments of text whenever that text
represents the name of an entity - In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer)
the mentions in the text of the entities
themselves - Rolls together the NE and CO tasks
- Domain- and genre-independent approaches
- ACE corpus contains newswire, broadcast news (ASR
output and cleaned), and newspaper reports (OCR
output and cleaned)
20ACE Entities
- Dealing with
- Proper names e.g., England, Mr. Smith, IBM
- Pronouns e.g., he, she, it
- Nominal mentions the company, the spokesman
- Identify which mentions in the text refer to
which entities, e.g., - Tony Blair, Mr. Blair, he, the prime minister, he
- Gordon Brown, he, Mr. Brown, the chancellor
21ACE Example
- ltentity ID"ft-airlines-27-jul-2001-2"
- GENERIC"FALSE"
- entity_type "ORGANIZATION"gt
- ltentity_mention ID"M003"
- TYPE "NAME"
- string "National Air
Traffic Services"gt - lt/entity_mentiongt
- ltentity_mention ID"M004"
- TYPE "NAME"
- string "NATS"gt
- lt/entity_mentiongt
- ltentity_mention ID"M005"
- TYPE "PRO"
- string "its"gt
- lt/entity_mentiongt
- ltentity_mention ID"M006"
- TYPE "NAME"
- string "Nats"gt
- lt/entity_mentiongt
22Annotation Tools Alembic, GATE, ...
23Performance Evaluation
- Evaluation metric mathematically defines how to
measure the systems performance against a
human-annotated, gold standard - Scoring program implements the metric and
provides performance measures - For each document and over the entire corpus
- For each type of NE
24The Evaluation Metric
- Precision correct answers/answers produced
- Recall correct answers/total possible correct
answers - Trade-off between precision and recall
- F-Measure (ß2 1)PR / ß2R P van Rijsbergen
75 - ß reflects the weighting between precision and
recall, typically ß1
25The Evaluation Metric (2)
- We may also want to take account of partially
correct answers - Precision Correct ½ Partially correct
- Correct Incorrect Partial
- Recall Correct ½ Partially correctCorrect
Missing Partial - Why NE boundaries are often misplaced, sosome
partially correct results
26The GATE Evaluation Tool
27Corpus-level Regression Testing
- Need to track systems performance over time
- When a change is made to the system we want to
know what implications are over the entire corpus - Why because an improvement in one case can lead
to problems in others - GATE offers automated tool to help with the NE
development task over time
28Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
29ChallengeEvaluating Richer NE Tagging
- Need for new metrics when evaluating
hierarchy/ontology-based NE tagging - Need to take into account distance in the
hierarchy - Tagging a company as a charity is less wrong than
tagging it as a person
30SW IE Evaluation tasks
- Detection of entities and events, given a target
ontology of the domain. - Disambiguation of the entities and events from
the documents with respect to instances in the
given ontology. For example, measuring whether
the IE correctly disambiguated Cambridge in the
text to the correct instance Cambridge, UK vs
Cambridge, MA. - Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the
ontology.
31Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
32Two kinds of IE approaches
- Knowledge Engineering
- rule based
- developed by experienced language engineers
- make use of human intuition
- requires only small amount of training data
- development could be very time consuming
- some changes may be hard to accommodate
- Learning Systems
- use statistics or other machine learning
- developers do not need LE expertise
- requires large amounts of annotated training data
- some changes may require re-annotation of the
entire training corpus - annotators are cheap (but you get what you pay
for!)
33NE Baseline list lookup approach
- System that recognises only entities stored in
its lists (gazetteers). - Advantages - Simple, fast, language independent,
easy to retarget (just create lists) - Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity
34Shallow parsing approach using internal structure
- Internal evidence names often have internal
structure. These components can be either stored
or guessed, e.g. location - Cap. Word City, Forest, Center, River
- e.g. Sherwood Forest
- Cap. Word Street, Boulevard, Avenue, Crescent,
Road - e.g. Portobello Street
35Problems ...
- Ambiguously capitalised words (first word in
sentence)All American Bank vs. All State
Police - Semantic ambiguity "John F. Kennedy" airport
(location) "Philip Morris" organisation - Structural ambiguity Cable and Wireless vs.
Microsoft and DellCenter for Computational
Linguistics vs. message from City Hospital for
John Smith
36Shallow parsing with context
- Use of context-based patterns is helpful in
ambiguous cases - "David Walton" and "Goldman Sachs" are
indistinguishable - But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "Person of
Organization" to identify "Goldman Sachs
correctly.
37Examples of context patterns
- PERSON earns MONEY
- PERSON joined ORGANIZATION
- PERSON left ORGANIZATION
- PERSON joined ORGANIZATION as JOBTITLE
- ORGANIZATION's JOBTITLE PERSON
- ORGANIZATION JOBTITLE PERSON
- the ORGANIZATION JOBTITLE
- part of the ORGANIZATION
- ORGANIZATION headquarters in LOCATION
- price of ORGANIZATION
- sale of ORGANIZATION
- investors in ORGANIZATION
- ORGANIZATION is worth MONEY
- JOBTITLE PERSON
- PERSON, JOBTITLE
38Example Rule-based System - ANNIE
- Created as part of GATE
- GATE automatically deals with document formats,
saving of results, evaluation, and visualisation
of results for debugging - GATE has a finite-state pattern-action rule
language, used by ANNIE - ANNIE modified for MUC guidelines 89.5
f-measure on MUC-7 NE corpus
39NE Components The ANNIE system a reusable and
easily extendable set of components
40Gazetteer lists for rule-based NE
- Needed to store the indicator strings for the
internal structure and context rules - Internal location indicators e.g., river,
mountain, forest for natural locations street,
road, crescent, place, square, for address
locations - Internal organisation indicators e.g., company
designators GmbH, Ltd, Inc, - Produces Lookup results of the given kind
41The Named Entity Grammars
- Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results - Hand-coded rules applied to annotations to
identify NEs - Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules - Use of contextual information
- Finds person names, locations, organisations,
dates, addresses.
42- Â NE Rule in JAPE
- JAPE a Java Annotation Patterns Engine
- Light, robust regular-expression-based
processing - Cascaded finite state transduction
- Low-overhead development of new components
- Simplifies multi-phase regex processing
- Rule Company1
- Priority 25
- (
- ( Token.orthography upperInitial )
//from tokeniser - Lookup.kind companyDesignator //from
gazetteer lists - )match
- --gt
- match.NamedEntity
- kindcompany, ruleCompany1
43Named Entities in GATE
44Using co-reference to classify ambiguous NEs
- Orthographic co-reference module that matches
proper names in a document - Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs - May not reclassify already classified entities
- Classification of unknown entities very useful
for surnames which match a full name, or
abbreviations, e.g. Bonfield will match Sir
Peter Bonfield International Business
Machines Ltd. will match IBM
45Named Entity Coreference
46Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
47Machine Learning Approaches
- Approaches
- Train ML models on manually annotated text
- Mixed initiative learning
- Used for producing training data
- Used for producing working systems
- ML Methods
- Symbolic learning rules/decision trees induction
- Statistical models HMMs, Bayesian methods,
Maximum Entropy
48ML Terminology
- Instances (tokens, entities)
- Occurrences of a phenomenon
- Attributes (features)
- Characteristics of the instances
- Classes
- Sets of similar instances
49Methodology
- The task can be broken into several subtasks
(that can use different methods) - Boundary detection
- Entity classification into NE types
- Different models for different entity types
- Several models can be used in competition.
- Some algorithms perform better on little data
while others are better when more training is
available
50Methodology (2)
- Boundaries (and entity types) notations
- S(-XXX), E(-XXX)
- ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
gt heads for - ltS-LOC/gtBaghdadltE-LOC/gt.
- IOB notation (Inside, Outside, Beginning_of)
- U.N. I-ORG
- official O
- Ekeus I-PER
- heads O
- for O
- Baghdad I-LOC
- . O
- Translations between the two conventions are
- straight-forward
51Features
- Linguistic features
- POS
- Morphology
- Syntax
- Lexicon data
- Semantic features
- Ontological class
- ETC
- Document structure
- Original markup
- Paragraph/sentence structure
- Surface features
- Token length
- Capitalisation
- Token type (word, punctuation, symbol)
- Feature selection the most difficult part
- Some automatic scoring methods can be used
52Mixed Initiative Learning
- Human computer interaction
- Speeds up the creation of training data
- Can be used for corpus/system creation
- Example implementations
- Alembic Day et al97
- Amilcare Ciravegna03
53Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
54GATE Machine Learning support
- Uses classification.
- Attr1, Attr2, Attr3, Attrn ? Class
- Classifies annotations.
- (Documents can be classified as well using a
1-to1 relation with annotations.) - Annotations of a particular type are selected as
instances. - Attributes refer to features of the instance
annotations or their context. - Generic implementation for attribute collection
can be linked to any ML engine. - ML engines currently integrated WEKA and
Ontotexts HMM.
55Implementation
- Machine Learning PR in GATE.
- Has two functioning modes
- training
- application
- Uses an XML file for configuration
- lt?xml version"1.0" encoding"windows-1252"?gt
- ltML-CONFIGgt
- ltDATASETgt lt/DATASETgt
- ltENGINEgtlt/ENGINEgt
- ltML-CONFIGgt
56Attributes Collection
Instances type Token
57Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
58Amilcare Melita
- Amilcare rule-learning algorithm
- Tagging rules learn to insert tags in the text,
given training examples - Correction rules learn to move already inserted
tags to their correct place in the text - Novel aspect learns independently begin and end
tags - Melita support adaptive IE
- Applied in SemWeb context (see below)
- Being extended as part of the EU-funded DOT.KOM
project towards KM andSemWeb applications
Ciravegna03www.dcs.shef.ac.uk/fabio
59Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
60Towards Semantic Tagging of Entities
- The MUC NE task tags selected segments of text
whenever that text represents the name of an
entity. - Semantic tagging - view as mentions of the
underlying instances from the ontology - Identify which mentions in the text refer to
which instances in the ontology, e.g., - Tony Blair, Mr. Blair, he, the prime minister, he
- Gordon Brown, he, Mr. Brown, the chancellor
61Tasks
- Identify entity mentions in the text
- Reference disambiguation
- Add new instances if needed
- Disambiguate wrt instances in the ontology
- Identify instances of attributes and relations
- take into account what are allowed given the
ontology, using domainrange as constraints
62Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
63Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
64Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
65Why not put metadata in ontologies?
- Can be encoded in RDF/OWL, etc. but does it need
to be put as instances in the ontology? - Typically we do not need to reason with it
- Reasoning happens in the ontology when the new
instances of classes and properties are added,
but the metadata statements are different from
them, they only refer to them - A lot more metadata than instances
- Millions of metadata statements, thousands of
instances, hundreds of concepts - Different access required
- By offset (give me all metadata of the first
paragraph) - Efficient metadata-wide statistics based on
strings not an operation that people would do
on other concepts - Mixing with keyword-based search using IR-style
indexing
66Metadata Creation with IE
- Semantic tagging creates metadata
- Stand-off or part of document
- Semi-automatic
- One view (given by the user, one ontology)
- More reliable
- Automatic metadata creation
- Many views change with ontology, re-train IE
engine for each ontology - Always up to date, if ontology changes
- Less reliable
67Problems with traditional IE for metadata
creation
- S-CREAM Semi-automatic CREAtion of Metadata
Handschuh et al02 - Semantic tags from IE need to be mapped to
instances of concepts, attributes or relations - Most ML-based IE systems do not deal well with
relations, mainly entities - Amilcare does not handle anaphora resolution,
GATE has such component but not used here - Implemented a discourse model with logical rules
- LASIE used discourse model with domainontology
problem is robustness and domain portability
68Example
Handschuh et al02 S-CREAM, EKAW02
69S-CREAM Discourse Rules
- Rules to attach instances only when the ontology
allows that (e.g., prices) - Attach tag values to the nearest preceding
compatible entity (e.g., prices and rooms) - Create a complex object between two concept
instances if they are adjacent (e.g., rate
number followed by currency) - Experienced users can write new rules
70Challenges for IE for SemWeb
- Portability different and changing ontologies
- Different text types structured, free, etc.
- Utilise ontology information where available
- Train from small amount of annotated text
- Output results wrt the given ontology
- bridge the gap demonstrated in S-CREAM
- Learn/Model at the right level
- ontologies are hierarchical and data will get
sparser the lower we go
DOT.KOM http//nlp.shef.ac.uk/dot.kom/
71Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
72GATE Infrastructure for metadata extraction for
the SemWeb
- Combines learning and rule-based methods
- Allows combination of IE and IR
- Enables use of large-scale linguistic resources
for IE, such as WordNet - Supports ontologies as part of IE applications -
Ontology-Based IE (OBIE)
73Ontology Management in GATE
74Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
75WordNet support
76Populating Ontologies with IE
77Example OBIE Application
- hTechSight project using Ontology-Based IE for
semantic tagging of job adverts, news and reports
in chemical engineering domain - Aim is to track technological change over time
through terminological analysis - Fundamental to the application is a
domain-specific ontology - Terminological gazetteer lists are linked to
classes in the ontology - Rules classify the mentions in the text wrt the
domain ontology - Annotations output into a database or as an
ontology
78(No Transcript)
79(No Transcript)
80Exported Database
81Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
82Platforms for Large-Scale Metadata Creation
- Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation - Automated alias discovery
- Generate SemWeb output (RDF, OWL)
- Stand-off storage and indexing of metadata
- Use large instance bases to disambiguate to
- Ontology servers for reasoning and access
- Architecture elements
- Crawler, onto storage, doc indexing, query,
annotators - Apps sem browsers, authoring tools, etc.
83SemTag
- Lookup of all instances from the ontology (TAP)
65K instances - Disambiguate the occurrences as
- One of those in the taxonomy
- Not present in the taxonomy
- Not very high ambiguity of instances with the
same label in TAP concentrate on the second
problem - Use bag-of-words approach for disambiguation
- 3 people evaluated 200 labels in context agreed
on only 68.5 - metonymy - Placing labels in the taxonomy is hard
Dill et al, SemTag and Seeker. WWW03
84Seeker
- High-performance distributed infrastructure
- 128 dual-processor machines with separate ½
terabyte of storage - Each node runs approx. 200 documents per sec.
- Service-oriented architecture Vinci (SOAP)
Dill et al, SemTag and Seeker. WWW03
85OBIE in KIM
- The ontology (KIMO) and 86K/200K instances KB
- High ambiguity of instances with the same label
need for disambiguation step - Lookup phase marks mentions from the ontology
- Combined with rule-based IE system to recognise
new instances of concepts and relations - Special KB enrichment stage where some of these
new instances are added to the KB - Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics (e.g., Paris)
Popov et al. KIM. ISWC03
86OBIE in KIM (2)
Popov et al. KIM. ISWC03
87Comparison between SemTag and KIM
- SemTag only aims for accuracy (precision) of
classification of the annotated entities - KIM also aims for coverage (recall) whether all
possible mentions of entities were found - Trade-off sometimes finding some is enough
- SemTag does not attempt to discover and expand
the KB with new instances (e.g., new company)
the reason why KIM uses IE, not simple KB lookup - i.e. OBIE is often needed for ontology
population, not just metadata creation
88Two Annotation Scenarios (1)
- Getting the instances and the relations between
them is enough, maybe not all mentions in the
text are covered, but compensated by giving
access to this info from the annotated text
89Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
90Two Annotation Scenarios (2)
- Exhaustive annotation is required, so all
occurrences of all instances and relations are
needed - Allows sentence and paragraph-level exploration,
rather than document-level as in the previous
scenario - Harder to achieve
- Distinction between these scenarios needs to be
made in the metadata annotation tools/KM tools
using IE
91Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
92Semantic Reference Disambiguation
- Possible approaches
- Vector-space models compare context similarity
runs over a corpus - SemTag
- Baggas cross-document coreference work
- Communities of practise approach from KM
- Identity criteria from the ontology based on
properties, e.g., date_of_birth, name
93Why disambiguation is hard not all knowledge
is explicit in text
- Paris fashion week underway as cancellations
continue - By Jo Johnson and Holly Finn  - Oct 07 2001
184817 (FT) - Even as Paris fashion week opened at the
weekend, the cancellations and reschedulings were
still trickling in over the fax machines Loewe,
the leather specialists owned by LVMH empire, is
not showing, Cerruti, the Italian tailor,is
downscaling to private viewings, Helmut Lang,
master of the sharp suit, is cancelling his
catwalk. - The Oscar de la Renta show, for example, which
had been planned for September 11th in New York,
and which might easily enough have moved over to
Paris instead, is not on the schedule. When the
Dominican Republic-born designer consulted
America Vogue's influential editor, Anna Wintour,
she reportedly told him it would be unpatriotic
to decamp.
94Structure of the Tutorial
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
95Natural Language Generation
- NLG is
- subfield of AI and CL that is concerned with the
construction of computer systems that can produce
understandable texts in English or other human
languages from some underlying linguistic
representation of information ReiterDale97 - NLG techniques are applied also for producing
speech, e.g., in speech dialogue systems
96- Natural Language Generation
Ontology/KB/Database
Lexicons Grammars
Text
97Requirements Analysis
- Create a corpus of target texts and (if possible)
their input representations - Analyse the information content
- Unchanging texts thank you, hello, etc.
- Directly available data timetable of buses
- Computable data number of buses
- Unavailable data not in the systems KB/DB
98NLG Tasks
- Content determination
- Discourse planning
- Sentence aggregation
- Lexicalisation
- Referring expression generation
- Linguistic realisation
99Content determination
- What information to include in the text
filtering and summarising input data into a
formal knowledge representation - Application dependent
- Example
- project AKT
- start_date October-2000
- end_date October-2006
- participants A,E,OU,So,Sh
100Discourse Planning
- Determine ordering and structure over the
knowledge to be generated - Theories of discourse how texts are structured
- Influences text readability
- Result tree structure imposing ordering over the
predicates and possibly providing discourse
relations
101Example
SEQUENCE
LIST
ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL
project AKT participantOU
102Planning-Based Approaches
- Use AI-style planners (e.g., Moore Paris 93
- Discourse relations (e.g., ELABORATION) are
encoded as planning operators - Preconditions specify when the relation can apply
- Planning starts from a top-level goal, e.g.,
define-project(X) - Computationally expensive and require a lot of
knowledge problem for real-world systems
103Schema-Based Approaches
- Capture typical text structuring patterns in
templates (derived from corpus), e.g., McKeown
85 - Typically implemented as RTN
- Variety comes from different available knowledge
for each entity - Reusable ones available Exemplars
- Example
- Describe-Project-Schema -gt Sequence(duration,
ProjParticipants-Schema)
104Sentence Aggregation
- Determine which predicates should be grouped
together in sentences - Less understood process
- Default each predicate can be expressed as a
sentence, so optional step - SPOT trainable planner
- Example
- AKT is a 6-year project with 5 participants
- Sheffield (URL)
- OU
105Lexicalisation
- Choosing words and phrases to express the
concepts and relations in predicates - Trivial solution 1-1 mapping between
concepts/relations and lexical entries - Variation is useful to avoid repetitiveness and
also convey pragmatic distinctions (e.g.
formality)
106Referring Expression Generation
- Choose pronouns/phrases to refer to the entities
in the text - Example he vs Mr Smith vs John Smith, the
president of XXX Corp. - Depends on what is previously said
- He is only appropriate if the person is already
introduced in the text
107Linguistic Realisation
- Use grammar to generate text which is
grammatical, i.e., syntactically and
morphologically correct - Domain-independent
- Reusable components are available e.g.,
RealPro, FUF/SURGE - Example
- Morphology participant -gt participants
- Syntactic agreement AKT starts on
108A GATE-based generator
- Input
- The MIAKT ontology
- The RDF file for the given case
- The MIAKT lexicon
- Output
- GATE document with the generated text
109Lexicalising Concepts and Instances
110Example RDF Input
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_patient'gt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlPatient'/gt - ltNS2has_agegt68lt/NS2has_agegt
- ltNS2involved_in_ta rdfresource'c\breast_cance
r_ontology.damlta-soton-1069861276136'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_mammography'gt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlMammography'/gt - ltNS2carried_out_on rdfresource'c\breast_cance
r_ontology.daml01401_patient'/gt - ltNS2has_dategt22 9 1995lt/NS2has_dategt
- ltNS2produce_result rdfresource'c\breast_cance
r_ontology.damlimage_01401_right_cc'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.damlimage_01401_right_cc'gt - ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
.LJPEGlt/NS2image_filegt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlRight_CC_Image'/gt - ltNS2has_lateral rdfresource'c\breast_cancer_o
ntology.damllateral_right'/gt - ltNS2view_of_image rdfresource'c\breast_cancer
_ontology.damlcraniocaudal_view'/gt - ltNS2contains_entity rdfresource'c\breast_canc
er_ontology.daml01401_right_cc_abnor_1'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_right_cc_abnor_1'gt
111CASE0140.RDF
- The 68 years old patient is involved in a
triple assessment procedure. The triple
assessment procedure contains a mammography exam.
The mammography exam is carried out on the
patient on 22 9 1995. The mammography exam
produced a right CC image. The right CC image
contains an abnormality and it has a right
lateral side and a craniocaudal view. The
abnormality has a mass, a microlobulated margin ,
a round shape, and a probably malignant
assessment.
112Further Reading on IE for SemWeb
- Requirements for Information Extraction for
Knowledge Management. http//nlp.shef.ac.uk/dot.ko
m/publications.html - Information Extraction as a Semantic Web
Technology Requirements and Promises. Adaptive
Text Extraction and Mining workshop, 2003. - A. Kiryakov, B. Popov, et al. Semantic
Annotation, Indexing, and Retrieval. 2nd
International Semantic Web Conference (ISWC2003),
http//www.ontotext.com/publications/index.htmlKi
ryakovEtAl2003 - S. Handschuh, S. Staab, R. Volz
http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
73_handschuh.pdf. On Deep Annotation. WWW03. - S. Dill, N. Eiron, et al http//www.tomkinshome.c
om/papers/2Web/semtag.pdf . SemTag and Seeker
Bootstrapping the semantic web via automated
semantic annotation. WWW03. - E. Motta, M. Vargas-Vera, et al MnM Ontology
Driven Semi-Automatic and Automatic Support for
Semantic Markup. Knowledge Engineering and
Knowledge Management (Ontologies and the Semantic
Web), (EKAW02), http//www.aktors.org/publications
/selected-papers/06.pdf - K. Bontcheva, A. Kiryakov, H. Cunningham, B.
Popov. M. Dimitrov. Semantic Web Enabled, Open
Source Language Technology. Language Technology
and the Semantic Web, Workshop on NLP and XML
(NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
emweb/bontcheva-etal-final.pdf - Handschuh, Staab, Ciravegna. S-CREAM -
Semi-automatic CREAtion of Metadata (2002)
http//citeseer.nj.nec.com/529793.html
113Further Reading on traditional IE
- Day et al97 D. Day, J. Aberdeen, L. Hirschman,
R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language
Processing Systems. In Proceedings of the Fifth
Conference on Applied Natural Language Processing
(ANLP97). 1997. - Ciravegna02 F. Ciravegna, A. Dingli, D.
Petrelli, Y. Wilks User-System Cooperation in
Document Annotation based on Information
Extraction. Knowledge Engineering and Knowledge
Management (Ontologies and the Semantic Web),
(EKAW02), 2002. - N. Kushmerick, B. Thomas. Adaptive information
extraction Core technologies for information
agents (2002). http//citeseer.nj.nec.com/kushmeri
ck02adaptive.html - H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. 40th Anniversary Meeting of the
Association for Computational Linguistics
(ACL'02). 2002. - D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003. - Califf and Mooney Relational Learning of Pattern
Matching Rules for Information Extraction
http//citeseer.nj.nec.com/6804.html - Borthwick. A. A Maximum Entropy Approach to Named
Entity Recognition.PhD Dissertation. 1999 - Bikel D., Schwarta R., Weischedel. R. An
algorithm that learns whats in a name. Machine
Learning 34, pp.211-231, 1999 - Riloff, E. (1996) "Automatically Generating
Extraction Patterns from Untagged Text"
Proceedings of the Thirteenth National Conference
on Artificial Intelligence (AAAI-96) , 1996, pp.
1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
s/aaai96.pdf - Daelemans W. and Hoste V. Evaluation of Machine
Learning Methods for Natural Language Processing
Tasks. In LREC 2002 Third International
Conference on Language Resources and Evaluation,
pages 755760
114Further Reading on traditional IE
- Black W.J., Rinaldi F., Mowatt D. Facile
Description of the NE System Used For MUC-7.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998. - Collins M., Singer Y. Unsupervised models for
named entity classificationIn Proceedings of the
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, 1999 - Collins M. Ranking Algorithms for Named-Entity
Extraction Boosting and the Voted Perceptron.
Proceedings of the 40th Annual Meeting of the
ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
Y., Renals S. Information extraction from
broadcast news, Philosophical Transactions of the
Royal Society of London, series A Mathematical,
Physical and Engineering Sciences, 2000. - Grishman R. The NYU System for MUC-6 or Where's
the Syntax? Proceedings of the MUC-6 workshop,
Washington. November 1995. - Krupka G. R., Hausman K. IsoQuest Inc.
Description of the NetOwlTM Extractor System as
Used for MUC-7. Proceedings of 7th Message
Understanding Conference, Fairfax, VA, 19 April -
1 May, 1998. - McDonald D. Internal and External Evidence in the
Identification and Semantic Categorization of
Proper Names. In B.Boguraev and J. Pustejovsky
editors Corpus Processing for Lexical
Acquisition. Pages21-39. MIT Press. Cambridge,
MA. 1996 - Mikheev A., Grover C. and Moens M. Description of
the LTG System Used for MUC-7. Proceedings of 7th
Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998 - Miller S., Crystal M., et al. BBN Description of
the SIFT System as Used for MUC-7. Proceedings of
7th Message Understanding Conference, Fairfax,
VA, 19 April - 1 May, 1998
115Further Reading on multilingual IE
- Palmer D., Day D.S. A Statistical Profile of the
Named Entity Task. Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3,
1997. - Sekine S., Grishman R. and Shinou H. A decision
tree method for finding and classifying names in
Japanese texts. Proceedings of the Sixth Workshop
on Very Large Corpora, Montreal, Canada, 1998 - Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
Chinese Named Entity Identification Using
Class-based Language Model. In proceeding of the
19th International Conference on Computational
Linguistics (COLING2002), pp.967-973, 2002. - Takeuchi K., Collier N. Use of Support Vector
Machines in Extended Named Entity Recognition.
The 6th Conference on Natural Language Learning.
2002 - D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003. - M. M. Wood and S. J. Lydon and V. Tablan and D.
Maynard and H. Cunningham. Using parallel texts
to improve recall in IE. Recent Advances in
Natural Language Processing, Bulgaria, 2003. - D.Maynard, V. Tablan and H. Cunningham. NE
recognition without training data on a language
you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition
Combining Statistical and Symbolic Models,
Sapporo, Japan, 2003.
116Further Reading on multilingual IE
- H. Saggion, H. Cunningham, K. Bontcheva, D.
Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
through Multisource and Multilingual Information
Extraction the MUMIS project. Data and Knowledge
Engineering, 2003. - D. Manov and A. Kiryakov and B. Popov and K.
Bontcheva and D. Maynard, H. Cunningham.
Experiments with geographic knowledge for
information extraction. Workshop on Analysis of
Geographic References, HLT/NAACL'03, Canada,
2003. - H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary
Meeting of the Association for Computational
Linguistics (ACL'02). Philadelphia, July 2002. - H. Cunningham. GATE, a General Architecture for
Text Engineering. Computers and the Humanities,
volume 36, pp. 223-254, 2002. - D. Maynard, H. Cunningham, K. Bontcheva, M.
Dimitrov. Adapting A Robust Multi-Genre NE System
for Automatic Content Extraction. Proc. of the
10th International Conference on Artificial
Intelligence Methodology, Systems, Applications
(AIMSA 2002), 2002. - K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
Y. Wilks. How feasible is the reuse of grammars
for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.
117THANK YOU!The slideshttp//gate.ac.uk/sale/ta
lks/sekt-tutorial.ppt