Named%20Entity%20Recognition

About This Presentation

Title:

Named%20Entity%20Recognition

Description:

Title: PowerPoint Presentation Author: hamish Last modified by: declerck Created Date: 10/2/2002 2:07:34 PM Document presentation format: Bildschirmpr sentation – PowerPoint PPT presentation

Number of Views:829

Avg rating:3.0/5.0

Slides: 109

Provided by: ham4165

Category:

more less

Transcript and Presenter's Notes

Title: Named%20Entity%20Recognition

1
Named Entity Recognition http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish
Cunningham Kalina Bontcheva RANLP, Borovets,
Bulgaria, 8th September 2003
2
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

3
Information Extraction

Information Extraction (IE) pulls facts and
structured information from the content of large
text collections.
IR - IE - NLU
MUC Message Understanding Conferences
ACE Automatic Content Extraction

4
MUC-7 tasks

NE Named Entity recognition and typing
CO co-reference resolution
TE Template Elements
TR Template Relations
ST Scenario Templates

5
An Example

NE entities are "rocket", "Tuesday", "Dr. Head"
and "We Build Rockets"
CO "it" refers to the rocket "Dr. Head" and
"Dr. Big Head" are the same
TE the rocket is "shiny red" and Head's
"brainchild".
TR Dr. Head works for We Build Rockets Inc.
ST a rocket launching event occurred with the

various
participants.

The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.

6
Performance levels

Vary according to text type, domain, scenario,
language
NE up to 97 (tested in English, Spanish,
Japanese, Chinese)
CO 60-70 resolution
TE 80
TR 75-80
ST 60 (but human level may be only 80)

7
What are Named Entities?

NER involves identification of proper names in
texts, and classification into a set of
predefined categories of interest
Person names
Organizations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions

8
What are Named Entities (2)

Other common types measures (percent, money,
weight etc), email addresses, Web addresses,
street addresses, etc.
Some domain-specific entities names of drugs,
medical conditions, names of ships, bibliographic
references etc.
MUC-7 entity definition guidelines Chinchor97
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/ne_task.html

9
What are NOT NEs (MUC-7)

Artefacts Wall Street Journal
Common nouns, referring to named entities the
company, the committee
Names of groups of people and things named after
people the Tories, the Nobel prize
Adjectives derived from names Bulgarian,
Chinese
Numbers which are not times, dates, percentages,
and money amounts

10
Basic Problems in NE

Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types John Smith (company vs.
person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may"

11
More complex problems in NE

Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing, formatting, ...
all have an impact
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
Tell me more about Leonardo
Da Vinci

12
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

13
Applications

Can help summarisation, ASR and MT
Intelligent document access
Browse document collections by the entities that
occur in them
Formulate more complex queries than IR can answer
Example application domains
News
Scientific articles, e.g, MEDLINE abstracts

14
Application -Threat tracker
Search by entity http//www.alias-i.com/iraq/feat
ure_description/entity_search.html
15
Application Example - KIM
Browsing by entity and ontology
http//www.ontotext.com/kim
16
Application Example - KIM
Ontotexts KIM formal query over OWL
(including relations between entities) and results
17
Application Example - Perseus
Time-line and geographic visualisation
http//www.perseus.tufts.edu/
18
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

19
Some NE Annotated Corpora

MUC-6 and MUC-7 corpora - English
CONLL shared task corpora http//cnts.uia.ac.be/co
nll2003/ner/ - NEs in English and
Germanhttp//cnts.uia.ac.be/conll2002/ner/ - NEs
in Spanish and Dutch
TIDES surprise language exercise (NEs in Cebuano
and Hindi)
ACE English - http//www.ldc.upenn.edu/Projects/
ACE/

20
The MUC-7 corpus

100 documents in SGML
News domain
1880 Organizations (46)
1324 Locations (32)
887 Persons (22)
Inter-annotator agreement very high (97)
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/muc_7_proceedings/marsh_slides.
pdf

21
The MUC-7 Corpus (2)

ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
Working in chilly temperatures ltTIMEX
TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
readied the space shuttle Endeavour for launch on
a Japanese satellite retrieval mission.
ltpgt
Endeavour, with an international crew of six, was
set to blast off from the ltENAMEX
TYPE"ORGANIZATIONLOCATION"gtKennedy Space
Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
the start of a 49-minute launching period. The
ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
flight was to be the 12th launched in darkness.

22
NE Annotation Tools - Alembic
23
NE Annotation Tools Alembic (2)
24
NE Annotation Tools - GATE
25
NE Annotation Tools SProUT http//sprout.dfki.d
e/ (added by TD)

SProUT (Shallow Processing with Unification and
Typed Feature Structures) is a platform for
development of multilingual shallow text
processing and information extraction systems.
See http//sprout.dfki.de/)
It consists of several reusable Unicode-capable
online linguistic processing components for basic
linguistic operations ranging from tokenization
to coreference matching. Since typed feature
structures (TFS) are used as a uniform data
structure for representing the input and output
by each of these processing resources, they can
be flexibly combined into a pipeline that
produces several streams of linguistically
annotated structures, which serve as an input for
the shallow grammar interpreter, applied at the
next stage.
The grammar formalism in SProUT, called XTDL is a
blend of very efficient finite-state techniques
and unification-based formalisms which are known
to guarantee transparency and expressiveness. A
grammar in SProUT consists of pattern/action
rules, where the LHS of a rule is a regular
expression over TFSs with functional operators
and coreferences, representing the recognition
pattern, and the RHS of a rule is a TFS
specification of the output structure. Click here
to learn more about XTDL.
Furthermore, SProUT comes with an integrated
grammar development and testing environment.
Currently, the platform provides linguistic
processing resources for several languages
including among other English, German, French,
Italian, Durch, Spanish, Polish, Czech, Chinese,
and Japanese.

26
On-Line IE Tools Open Calais (Added by TD)

http//www.opencalais.com/about

27
Corpora and System Development

Corpora are divided typically into a training and
testing portion
Rules/Learning algorithms are trained on the
training part
Tuned on the testing portion in order to optimise
Rule priorities, rules effectiveness, etc.
Parameters of the learning algorithm and the
features used
Evaluation set the best system configuration is
run on this data and the system performance is
obtained
No further tuning once evaluation set is used!

28
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

29
Performance Evaluation

Evaluation metric mathematically defines how to
measure the systems performance against a
human-annotated, gold standard
Scoring program implements the metric and
provides performance measures
For each document and over the entire corpus
For each type of NE

30
The Evaluation Metric

Precision correct answers/answers produced
Recall correct answers/total possible correct
answers
Trade-off between precision and recall
F-Measure (ß2 1)PR / ß2R P van Rijsbergen
75
ß reflects the weighting between precision and
recall, typically ß1

31
The Evaluation Metric (2)

We may also want to take account of partially
correct answers
Precision Correct ½ Partially correct
Correct Incorrect Partial
Recall Correct ½ Partially correctCorrect
Missing Partial
Why NE boundaries are often misplaced, sosome
partially correct results

32
The MUC scorer (1)

Document 9601020572
--------------------------------------------------
---------------
POS ACT COR PAR INC MIS SPU NON
REC PRE
------------------------------------------------
--------------
SUBTASK SCORES
enamex
organization 11 12 9 0 0 2 3
0 82 75
person 24 26 24 0 0 0 2
0 100 92
location 27 31 25 0 0 2 6
0 93 81
SUMMARY SCORES
--------------------------------------------------
---------------
POS ACT COR PAR INC MIS SPU
NON REC PRE
------------------------------------------------
--------------
TASK SCORES
enamex
organizatio 1855 17571553 0 37 265 167
30 84 88
person 883 859 797 0 13 73 49
4 90 93

33
The MUC scorer (2)

Using the detailed report we can track errors in
each document, for each NE in the text
ENAMEX cor inc PERSON PERSON "Wernher von Braun"
"Braun"
ENAMEX cor inc PERSON PERSON "von Braun"
"Braun"
ENAMEX cor cor PERSON PERSON "Braun"
"Braun"
ENAMEX cor cor LOCATI LOCATI "Saturn"
"Saturn"

34
The GATE Evaluation Tool
35
Regression Testing

Need to track systems performance over time
When a change is made to the system we want to
know what implications are over the entire corpus
Why because an improvement in one case can lead
to problems in others
GATE offers automated tool to help with the NE
development task over time

36
Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
37
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

38
Pre-processing for NE Recognition

Format detection
Word segmentation (for languages like Chinese)
Tokenisation
Sentence splitting
POS tagging

39
Two kinds of NE approaches

Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
requires only small amount of training data
development could be very time consuming
some changes may be hard to accommodate

Learning Systems
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the
entire training corpus
annotators are cheap (but you get what you pay
for!)

40
Baseline list lookup approach

System that recognises only entities stored in
its lists (gazetteers).
Advantages - Simple, fast, language independent,
easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity

41
Creating Gazetteer Lists

Online phone directories and yellow pages for
person and organisation names (e.g.
Paskaleva02)
Locations lists
US GEOnet Names Server (GNS) data 3.9 million
locations with 5.37 million names (e.g.,
Manov03)
UN site http//unstats.un.org/unsd/citydata
Global Discovery database from Europa
technologies Ltd, UK (e.g., Ignat03)
Automatic collection from annotated training data

42
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

43
Shallow Parsing Approach (internal structure)

Internal evidence names often have internal
structure. These components can be either stored
or guessed, e.g. location
Cap. Word City, Forest, Center, River
e.g. Sherwood Forest
Cap. Word Street, Boulevard, Avenue, Crescent,
Road
e.g. Portobello Street

44
Problems with the shallow parsing approach

Ambiguously capitalised words (first word in
sentence)All American Bank vs. All State
Police
Semantic ambiguity "John F. Kennedy" airport
(location) "Philip Morris" organisation
Structural ambiguity Cable and Wireless vs.
Microsoft and DellCenter for Computational
Linguistics vs. message from City Hospital for
John Smith

45
Shallow Parsing Approach with Context

Use of context-based patterns is helpful in
ambiguous cases
"David Walton" and "Goldman Sachs" are
indistinguishable
But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "Person of
Organization" to identify "Goldman Sachs
correctly.

46
Identification of Contextual Information

Use KWIC index and concordancer to find windows
of context around entities
Search for repeated contextual patterns of either
strings, other entities, or both
Manually post-edit list of patterns, and
incorporate useful patterns into new rules
Repeat with new entities

47
Examples of context patterns

PERSON earns MONEY
PERSON joined ORGANIZATION
PERSON left ORGANIZATION
PERSON joined ORGANIZATION as JOBTITLE
ORGANIZATION's JOBTITLE PERSON
ORGANIZATION JOBTITLE PERSON
the ORGANIZATION JOBTITLE
part of the ORGANIZATION
ORGANIZATION headquarters in LOCATION
price of ORGANIZATION
sale of ORGANIZATION
investors in ORGANIZATION
ORGANIZATION is worth MONEY
JOBTITLE PERSON
PERSON, JOBTITLE

48
Caveats

Patterns are only indicators based on likelihood
Can set priorities based on frequency thresholds
Need training data for each domain
More semantic information would be useful (e.g.
to cluster groups of verbs)

49
Rule-based Example FACILE

FACILE - used in MUC-7 Black et al 98
Uses Inxights LinguistiX tools for tagging and
morphological analysis
Database for external information, role similar
to a gazetteer
Linguistic info per token, encoded as feature
vector
Text offsets
Orthographic pattern (first/all capitals, mixed,
lowercase)
Token and its normalised form
Syntax category and features
Semantics from database or morphological
analysis
Morphological analyses
Example(1192 1196 10 T C "Mrs." "mrs." (PROP
TITLE) (ˆPER_CIV_F)(("Mrs." "Title" "Abbr"))
NIL)PER_CIV_F female civilian (from database)

50
FACILE (2)

Context-sensitive rules written in special rule
notation, executed by an interpreter
Writing rules in PERL is too error-prone and hard
Rules of the kind A gt B\C/D, where
A is a set of attribute-value expressions and
optional score, the attributes refer to elements
of the input token feature vector
B and D are left and right context respectively
and can be empty
B, C, D are sequences of attribute-value pairs
and Klene regular expression operations
variables are also supported
synNP, semORG (0.9) gt\ norm"university",
token"of",semREGIONCOUNTRYCITY /

51
FACILE (3)

Rule for the mark up of person names when the
first name is not
present or known from the gazetteers e.g 'Mr
J. Cass',
SYNPROP,SEMPER, FIRST_F, INITIALS_I,
MIDDLE_M, LAST_S _F, _I, _M, _S are
variables, transfer info from RHS
gt
SEMTITLE_MILTITLE_FEMALETITLE_MALE
\SYNNAME, ORTHIO, TOKEN_I?,
ORTHCA, SYNPROP, TOKEN_F?,
SYNNAME, ORTHIO, TOKEN_I?,
SYNNAME, TOKEN_M?,
ORTHCAO,SYNPROP,TOKEN_S, SOURCE!RULE
proper name, not recognised by a rule
/

52
FACILE (4)

Preference mechanism
The rule with the highest score is preferred
Longer matches are preferred to shorter matches
Results are always one semantic categorisation of
the named entity in the text
Evaluation (MUC-7 scores)
Organization 86 precision, 66 recall
Person 90 precision, 88 recall
Location 81 precision, 80 recall
Dates 93 precision, 86 recall

53
Example Rule-based System - ANNIE

Created as part of GATE
GATE Sheffields open-source infrastructure for
language processing
GATE automatically deals with document formats,
saving of results, evaluation, and visualisation
of results for debugging
GATE has a finite-state pattern-action rule
language, used by ANNIE
ANNIE modified for MUC guidelines 89.5
f-measure on MUC-7 corpus

54
NE Components The ANNIE system a reusable and
easily extendable set of components
55
Gazetteer lists for rule-based NE

Needed to store the indicator strings for the
internal structure and context rules
Internal location indicators e.g., river,
mountain, forest for natural locations street,
road, crescent, place, square, for address
locations
Internal organisation indicators e.g., company
designators GmbH, Ltd, Inc,
Produces Lookup results of the given kind

56
The Named Entity Grammars

Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results
Hand-coded rules applied to annotations to
identify NEs
Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules
Use of contextual information
Finds person names, locations, organisations,
dates, addresses.

NE Rule in JAPE
JAPE a Java Annotation Patterns Engine
Light, robust regular-expression-based
processing
Cascaded finite state transduction
Low-overhead development of new components
Simplifies multi-phase regex processing
Rule Company1
Priority 25
(
( Token.orthography upperInitial )
//from tokeniser
Lookup.kind companyDesignator //from
gazetteer lists
)match
--gt
match.NamedEntity kindcompany,
ruleCompany1

58
Named Entities in GATE
59
Using co-reference to classify ambiguous NEs

Orthographic co-reference module that matches
proper names in a document
Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs
May not reclassify already classified entities
Classification of unknown entities very useful
for surnames which match a full name, or
abbreviations, e.g. Bonfield will match Sir
Peter Bonfield International Business
Machines Ltd. will match IBM

60
Named Entity Coreference
61
DEMO
62
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

63
Machine Learning Approaches

ML approaches frequently break down the NE task
in two parts
Recognising the entity boundaries
Classifying the entities in the NE categories
Some work is only on one task or the other
Tokens in text are often coded with the IOB
scheme
O outside, B-XXX first word in NE, I-XXX
all other words in NE
Easy to convert to/from inline MUC-style markup
Argentina B-LOCplayed Owith ODel B-PERBos
que I-PER

64
IdentiFinder Bikel et al 99

Based on Hidden Markov Models
Features
Capitalisation
Numeric symbols
Punctuation marks
Position in the sentence
14 features in total, combining above info, e.g.,
containsDigitAndDash (09-96), containsDigitAndComm
a (23,000.00)

65
IdentiFinder (2)

MUC-6 (English) and MET-1(Spanish) corpora used
for evaluation
Mixed case English
IdentiFinder - 94.9 f-measure
Best rule-based 96.4
Spanish mixed case
IdentiFinder 90
Best rule-based - 93
Lower case names, noisy training data, less
training data
Training data 650,000 words, but similar
performance with half of the data. Less than
100,000 words reduce the performance to below 90
on English

66
MENE Borthwick et al 98

Combining rule-based and ML NE to achieve better
performance
Tokens tagged as XXX_start, XXX_continue,
XXX_end, XXX_unique, other (non-NE), where XXX is
an NE category
Uses Maximum Entropy
One only needs to find the best features for the
problem
ME estimation routine finds the best relative
weights for the features

67
MENE (2)

Features
Binary features token begins with capitalised
letter, token is a four-digit number
Lexical features dependencies on the
surrounding tokens (window 2) e.g., Mr for
people, to for locations
Dictionary features equivalent to gazetteers
(first names, company names, dates,
abbreviations)
External systems whether the current token is
recognised as an NE by a rule-based system

68
MENE (3)

MUC-7 formal run corpus
MENE 84.2 f-measure
Rule-based systems it uses 86 - 91
MENE rule-based systems 92
Learning curve
20 docs 80.97
40 docs 84.14
100 docs 89.17
425 docs 92.94

69
NE Recognition without Gazetteers Mikheev et al
99

How big should gazetteer lists be?
Experiment with simple list lookup approach on
MUC-7 corpus
Learned lists MUC-7 training corpus
1228 person names
809 organisations
770 locations
Common lists (from the Web)
5000 locations
33,000 organisations
27,000 person names

70
NE Recognition without Gazetteers (2)
Category Learned Learned Common Common Combined Combined
Recall Precision Precision Recall Preci-sion Preci-sion Recall Preci-sion Preci-sion
ORG 49 75 75 3 51 51 50 72 72
PER 26 92 92 31 81 81 47 85 85
LOC 76 93 93 74 94 94 86 90 90
71
NE Recognition without Gazetteers (3)

System combines rule-based grammars and
statistical (MaxEnt) models
Full gaz 4900 LOC, 30,000 ORG, 10,000 PER
Some locs 200 countries continents 8
planets
Ltd gaz Some locs lists inferred from 30
processed texts in the same domain

Full Gaz Full Gaz Ltd Gaz Ltd Gaz Some locs Some locs No Gaz No Gaz
rec prec prec rec prec prec rec prec prec rec prec prec
ORG 90 93 93 87 90 90 87 89 89 86 85 85
PER 96 98 98 92 97 97 90 97 97 90 95 95
LOC 95 94 94 91 92 92 85 90 90 46 59 59
72
NE Recognition without Gazetteers (4)
Stage ORG PER LOC
Sure-fire rule R42 P98 R40 P99 R36 P96
Part. match 1 R75 P98 R80 P99 R69 P93
Relaxed rules (use gaz.) R83 P96 R90 P98 R86 P93
Part. match 2 R85 P96 R93 P97 R88 P93
Title assignment R91 P95 R95 P97 R95 P93
73
Fine-grained Classification of NEs Fleischman 02

Finer-grained categorisation needed for
applications like question answering
Person classification into 8 sub-categories
athlete, politician/government, clergy,
businessperson, entertainer/artist, lawyer,
doctor/scientist, police.
Approach using local context and global semantic
information such as WordNet
Used a decision list classifier and Identifinder
to construct automatically training set from
untagged data
Held-out set of 1300 instances hand annotated

74
Fine-grained Classification of NEs (2)

Word frequency features how often the words
surrounding the target instance occur with a
specific category in training
For each 8 categories 10 distinct word positions
80 features per instance
3 words before after the instance
The two-word bigrams immediately before and after
the instance
The three-word trigrams before/after the instance

Position N-gram Category Freq.
1 Previous unigram introduce politician 3
2 Previous unigram introduce entertainer 43
3 Following bigram into that politician 2
4 Following bigram into that business 0
75
Fine-grained Classification of NEs (3)

Topic signatures and WordNet information
Compute lists of terms that signal relevance to a
topic/category LinHovy 00 expand with
WordNet synonyms to counter unseen examples
Politician campaign, republican, budget
The topic signature features convey information
about the overall context in which each instance
exists
Due to differing contexts, instances of the same
name in a single text were classified differently

76
Fine-grained Classification of NEs (4)

MemRun chooses the prevailing sub-category based
on their most frequent classification
Othomatching-like algorithm is developed to match
George Bush, Bush, and George W. Bush
Expts with k-NN, Naïve Bayes, SVMs, Neural
Networks and C4.5 show that C4.5 is best
Expts with different feature configurations
70.4 with all features discussed here
Future work treating finer grained
classification as a WSD task (categories are
different senses of a person)

77
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

78
Multilingual Named Entity Recognition

Recent experiments are aimed at NE recognition in
multiple languages
TIDES surprise language evaluation exercise
measures how quickly researchers can develop NLP
components in a new language
CONLL02, CONLL03 focus on language-independent
NE recognition

79
Analysis of the NE Task in Multiple Languages
PalmerDay 97
Language NE Time/Date Numeric exprs. Org/Per/Loc
Chinese 4454 17.2 1.8 80.9
English 2242 10.7 9.5 79.8
French 2321 18.6 3 78.4
Japanese 2146 26.4 4 69.6
Portuguese 3839 17.7 12.1 70.3
Spanish 3579 24.6 3 72.5
80
Analysis of Multilingual NE (2)

Numerical and time expressions are very easy to
capture using rules
Constitute together about 20-30 of all NEs
All numerical expressions in the 6 languages
required only 5 patterns
Time expressions similarly require only a few
rules (less than 30 per language)
Many of these rules are reusable across the
languages

81
Analysis of Multilingual NE (3)

Suggest a method for calculating the lower bound
for system performance given a corpus in the
target language
Conclusion Much of the NE task can be achieved
by simple string analysis and common phrasal
contexts
Zipfs law the prevalence of frequent phenomena
allow high scores to be achieved directly from
the training data
Chinese, Japanese, and Portuguese corpora had a
lower bound above 70
Substantial further advances require language
specificity

82
What is needed for multilingual NE

Extensive support for non-Latin scripts and text
encodings, including conversion utilities
Automatic recognition of encoding Ignat et al03
Occupied up to 2/3 of the TIDES Hindi effort
Bi-lingual dictionaries
Annotated corpus for evaluation
Internet resources for gazetteer list collection
(e.g., phone books, yellow pages, bi-lingual
pages)

83
Multilingual support - Alembic
Japanese example
84
Editing Multilingual Data

GATE Unicode Kit (GUK)
Complements Javas facilities
Support for defining Input Methods (IMs)
currently 30 IMs for 17 languages
Pluggable in other applications (e.g.
JEdit)

85
Multilingual Data - GATE All processing,
visualisation and editing tools use GUK
86
Gazetteer-based Approach to Multilingual NE
Ignat et al 03

Deals with locations only
Even more ambiguity than in one language
Multiple places that share the same name, such as
the fourteen cities and villages in the world
called Paris
Place names that are also words in one or more
languages, such as And (Iran), Split
(Croatia)
Places have varying names in different languages
(Italian Venezia vs. English Venice, German
Venedig, French Venise)

87
Gazetteer-based multilingual NE (2)

Disambiguation module applies heuristics based on
location size and country mentions (prefer the
locations from the country mentioned most)
Performance evaluation
853 locations from 80 English texts
96.8 precision
96.5 recall

88
Machine Learning for Multilingual NE

CONLL2002 and 2003 shared tasks were NE in
Spanish, Dutch, English, and German
The most popular ML techniques used
Maximum Entropy (5 systems)
Hidden Markov Models (4 systems)
Connectionist methods (4 systems)
Combining ML methods has been shown to boost
results

89
ML for NE at CONLL (2)

The choice of features is at least as important
as the choice of ML algorithm
Lexical features (words)
Part-of-speech
Orthographic information
Affixes
Gazetteers
External, unmarked data is useful to derive
gazetteers and for extracting training instances

90
ML for NE at CONLL (3)

English (f-measure)
Baseline - 59.5 (list lookup of entities with 1
class in training data)
Systems between 60.2 and 88.76
German (f-measure)
Baseline 30.3
Systems between 47.7 and 72.4
Spanish (f-measure)
Baseline 35.9
Systems between 60.9 and 81.4
Dutch (f-measure)
Baseline 53.1
Systems between 56.4 and 77

91
TIDES surprise language exercise

Collaborative effort between a number of sites to
develop resources and tools for various LE tasks
on a surprise language
Tasks IE (including NE), machine translation,
summarisation, cross-language IR
Dry-run lasted 10 days on the Cebuano language
from the Philippines
Surprise language was Hindi, announced at the
start of June 2003 duration 1 month

92
Language categorisation

LDC survey of 300 largest languages (by
population) to establish what resources are
available
http//www.ldc.upenn.edu/Projects/TIDES/language-s
ummary-table.html
Classification dimensions
Dictionaries, news texts, parallel texts, e.g.,
Bible
Script, orthography, words separated by spaces

93
The Surprise Languages

Cebuano
Latin script and words are spaced, but
Few resources and little work, so
Medium difficulty
Hindi
Non-latin script, different encodings used, words
are spaced, no capitalisation
Many resources available
Medium difficulty

94
Named Entity Recognition for TIDES

Information on other systems and results from
TIDES is still unavailable to non-TIDES
participants
Will be made available by the end of 2003 in a
Special issue of ACM Transactions on Asian
Language Information Processing (TALIP). Rapid
Development of Language Capabilities The
Surprise Languages
The Sheffield approach is presented below,
because it is not subject to these restrictions

95
Dictionary-based Adaptation of an English POS
tagger

Substituted Hindi/Cebuano lexicon for English one
in a Brill-like tagger
Hindi/Cebuano lexicon derived from a bi-lingual
dictionary
Used empty ruleset since no training data
available
Used default heuristics (e.g. return NNP for
capitalised words)
Very experimental, but reasonable results

96
Evaluation of the Tagger

No formal evaluation was possible
Estimate around 67 accuracy on Hindi evaluated
by a native speaker on 1000 words
Created in 2 person days
Results and a tagging service made available to
other researchers in TIDES
Important pre-requisite for NE recognition

97
NE grammars

Most English JAPE rules based on POS tags and
gazetteer lookup
Grammars can be reused for languages with similar
word order, orthography etc.
No time to make detailed study of Cebuano, but
very similar in structure to English
Most of the rules left as for English, but some
adjustments to handle especially dates
Used both English and Cebuano grammars and
gazetteers, because NEs appear in both languages

98
(No Transcript)
99
Evaluation Results
Cebuano English Baseline
P R F P P R F
Person 71 65 68 36 36 36 36
Org 75 71 73 31 31 47 38
Location 73 78 76 65 65 7 12
Date 83 100 92 42 42 58 49
Total 76 79 77.5 45 45 41.7 43
100
Structure of the Tutorial

task definition
applications
corpora, annotation
evaluation and testing
how to
preprocessing
approaches to NE
baseline
rule-based approaches
learning-based approaches
multilinguality
future challenges

101
Future challenges

Towards semantic tagging of entities
New evaluation metrics for semantic entity
recognition
Expanding the set of entities recognised e.g.,
vehicles, weapons, substances (food, drug)
Finer-grained hierarchies, e.g., types of
Organizations (government, commercial,
educational, etc.), Locations (regions,
countries, cities, water, etc)

102
Future challenges (2)

Standardisation of the annotation formats
Ide Romary 02 RDF-based annotation
standards
Collier et al 02 multi-lingual named entity
annotation guidelines
Aimed at defining how to annotate in order to
make corpora more reusable and lower the overhead
of writing format conversion tools
MUC used inline markup
TIDES and ACE used stand-off markup, but two
different kinds (XML vs one-word per line)

103
Towards Semantic Tagging of Entities

The MUC NE task tagged selected segments of text
whenever that text represents the name of an
entity.
In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer)
the mentions in the text of the entities
themselves.
ACE focuses on domain- and genre-independent
approaches
ACE corpus contains newswire, broadcast news (ASR
output and cleaned), and newspaper reports (OCR
output and cleaned)

104
ACE Entities

Dealing with
Proper names e.g., England, Mr. Smith, IBM
Pronouns e.g., he, she, it
Nominal mentions the company, the spokesman
Identify which mentions in the text refer to
which entities, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor

105
ACE Example

ltentity ID"ft-airlines-27-jul-2001-2"
GENERIC"FALSE"
entity_type "ORGANIZATION"gt
ltentity_mention ID"M003"
TYPE "NAME"
string "National Air
Traffic Services"gt
lt/entity_mentiongt
ltentity_mention ID"M004"
TYPE "NAME"
string "NATS"gt
lt/entity_mentiongt
ltentity_mention ID"M005"
TYPE "PRO"
string "its"gt
lt/entity_mentiongt
ltentity_mention ID"M006"
TYPE "NAME"
string "Nats"gt
lt/entity_mentiongt

106
ACE Entities (2)

Some entities can have different roles, i.e.,
behave as Organizations, Locations, or Persons
GPEs (Geo-political entities)
New York GPE role Person, flush with Wall
Street money, has a lot of loose change jangling
in its pockets.
All three New York GPE role Location
regional commuter train systems were found to be
punctual more than 90 percent of the time.

107
Further information on ACE

ACE is a closed-evaluation initiative, which does
not allow the publication of results
Further information on guidelines and corpora is
available at
http//www.ldc.upenn.edu/Projects/ACE/
ACE also includes other IE tasks, for further
details see Doug Appelts presentationhttp//www
.clsp.jhu.edu/ws03/groups/sparse/presentations/dou
g.ppt

108
Evaluating Richer NE Tagging

Need for new metrics when evaluating
hierarchy/ontology-based NE tagging
Need to take into account distance in the
hierarchy
Tagging a company as a charity is less wrong than
tagging it as a person

109
Further Reading

Aberdeen J., Day D., Hirschman L., Robinson P.
and Vilain M. 1995. MITRE Description of the
Alembic System Used for MUC-6. MUC-6 proceedings.
Pages141-155. Columbia, Maryland. 1995.
Black W.J., Rinaldi F., Mowatt D. Facile
Description of the NE System Used For MUC-7.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998.
Borthwick. A. A Maximum Entropy Approach to Named
Entity Recognition.PhD Dissertation. 1999
Bikel D., Schwarta R., Weischedel. R. An
algorithm that learns whats in a name. Machine
Learning 34, pp.211-231, 1999
Carreras X., Màrquez L., Padró. 2002. Named
Entity Extraction using AdaBoost.The 6th
Conference on Natural Language Learning. 2002
Chang J.S., Chen S. D., Zheng Y., Liu X. Z., and
Ke S. J. Large-corpus-based methods for Chinese
personal name recognition. Journal of Chinese
Information Processing, 6(3)7-15, 1992
Chen H.H., Ding Y.W., Tsai S.C. and Bian G.W.
Description of the NTU System Used for MET2.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998.
Chinchor. N. MUC-7 Named Entity Task Definition
Version 3.5.Available by from ftp.muc.saic.com/pu
b/MUC/MUC7-guidelines, 1997

110
Further reading (2)

Collins M., Singer Y. Unsupervised models for
named entity classificationIn Proceedings of the
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, 1999
Collins M. Ranking Algorithms for Named-Entity
Extraction Boosting and the Voted Perceptron.
Proceedings of the 40th Annual Meeting of the
ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
Y., Renals S. Information extraction from
broadcast news, Philosophical Transactions of the
Royal Society of London, series A Mathematical,
Physical and Engineering Sciences, 2000.
Grishman R. The NYU System for MUC-6 or Where's
the Syntax? Proceedings of the MUC-6 workshop,
Washington. November 1995.
Ign03a C. Ignat and B. Pouliquen and A. Ribeiro
and R. Steinberger. Extending and Information
Extraction Tool Set to Eastern-European
Languages. Proceedings of Workshop on Information
Extraction for Slavonic and other Central and
Eastern European Languages (IESL'03). 2003.
Krupka G. R., Hausman K. IsoQuest Inc.
Description of the NetOwlTM Extractor System as
Used for MUC-7. Proceedings of 7th Message
Understanding Conference, Fairfax, VA, 19 April -
1 May, 1998.
McDonald D. Internal and External Evidence in the
Identification and Semantic Categorization of
Proper Names. In B.Boguraev and J. Pustejovsky
editors Corpus Processing for Lexical
Acquisition. Pages21-39. MIT Press. Cambridge,
MA. 1996
Mikheev A., Grover C. and Moens M. Description of
the LTG System Used for MUC-7. Proceedings of 7th
Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998
Miller S., Crystal M., et al. BBN Description of
the SIFT System as Used for MUC-7. Proceedings of
7th Message Understanding Conference, Fairfax,
VA, 19 April - 1 May, 1998

111
Further reading (3)

Palmer D., Day D.S. A Statistical Profile of the
Named Entity Task. Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3,
1997.
Sekine S., Grishman R. and Shinou H. A decision
tree method for finding and classifying names in
Japanese texts. Proceedings of the Sixth Workshop
on Very Large Corpora, Montreal, Canada, 1998
Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
Chinese Named Entity Identification Using
Class-based Language Model. In proceeding of the
19th International Conference on Computational
Linguistics (COLING2002), pp.967-973, 2002.
Takeuchi K., Collier N. Use of Support Vector
Machines in Extended Named Entity Recognition.
The 6th Conference on Natural Language Learning.
2002
D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003.
M. M. Wood and S. J. Lydon and V. Tablan and D.
Maynard and H. Cunningham. Using parallel texts
to improve recall in IE. Recent Advances in
Natural Language Processing, Bulgaria, 2003.
D.Maynard, V. Tablan and H. Cunningham. NE
recognition without training data on a language
you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition
Combining Statistical and Symbolic Models,
Sapporo, Japan, 2003.

112
Further reading (4)

H. Saggion, H. Cunningham, K. Bontcheva, D.
Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
through Multisource and Multilingual Information
Extraction the MUMIS project. Data and Knowledge
Engineering, 2003.
D. Manov and A. Kiryakov and B. Popov and K.
Bontcheva and D. Maynard, H. Cunningham.
Experiments with geographic knowledge for
information extraction. Workshop on Analysis of
Geographic References, HLT/NAACL'03, Canada,
2003.
H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary
Meeting of the Association for Computational
Linguistics (ACL'02). Philadelphia, July 2002.
H. Cunningham. GATE, a General Architecture for
Text Engineering. Computers and the Humanities,
volume 36, pp. 223-254, 2002.
D. Maynard, H. Cunningham, K. Bontcheva, M.
Dimitrov. Adapting A Robust Multi-Genre NE System
for Automatic Content Extraction. Proc. of the
10th International Conference on Artificial
Intelligence Methodology, Systems, Applications
(AIMSA 2002), 2002.
E. Paskaleva and G. Angelova and M.Yankova and K.
Bontcheva and H. Cunningham and Y. Wilks.
Slavonic Named Entities in GATE. 2003. CS-02-01.
K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
Y. Wilks. How feasible is the reuse of grammars
for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.