Title: Named Entity Recognition
1Named Entity Recognition
- Sobha Lalitha Devi
- AU-KBC Research Centre
- Chennai
2Named Entity(NE) Recognition
- What is NE and What is not an NE
- How to identify NE
- Tagset and Annotation Guidelines
- Methods Used in developing NER
3Why do NER?
- Key part of Information Extraction system
- Robust handling of proper names essential for
many applications such as Summarization, IR,
Anaphora,......... - Pre-processing for different classification
levels - Information filtering
- Information linking
4What is NER ?
- NER involves identification of proper names in
texts, and classification into a set of
predefined categories of interest. - Three universally accepted categories
- Person, location and organisation
- Other common tasks recognition of date/time
expressions, measures (percent, money, weight
etc), email addresses etc. - Other domain-specific entities names of Drugs,
Genes, medical conditions, names of ships,
bibliographic references etc.
5NER Definition
- Named entity recognition (NER) (also known as
entity identification (EI) and entity extraction)
is the task that locate and classify atomic
elements in text into predefined categories such
as the names of persons, organizations,
locations, expressions of times, quantities,
monetary values, percentages, etc. - John sold 5 companies in 2002.
- ltENAMEX TYPE"PERSON"gtJohnlt/ENAMEXgt sold ltNUMEX
TYPE"QUANTITY"gt5lt/NUMEXgt companies in ltTIMEX
TYPE"DATE"gt2002lt/TIMEXgt.
6What is not NER?
- NER is not event recognition.
- NER does not create templates,
- NER does not perform co-reference or entity
linking, - though these processes are often implemented
alongside NER as part of a larger IE system. - NER is not just matching text strings with
pre-defined lists of names. - It recognises entities which are being used as
entities in a given context. - NER is not an easy task!
7Named Entity and Philosophy of Language
- Proper Names are defined by
- Descriptivist's theory of Names
- Frege, Russell, Ludwig , Wittgenstein and John
Searle - Causal theory of Reference
- Saul Kripke
8- Descriptivist's theory of Names
- Proper names either are synonymous with
descriptions, or have their reference determined
by virtue of the name's being associated with a
description or cluster of descriptions that an
object uniquely satisfies. - Causal theory of Reference
- Proper names refer to an object by virtue of a
causal connection - with the object as mediated through communities
of speakers. That is , proper names, in contrast
to descriptions, are rigid designators. - Rigid designators A proper name refers to the
named object in every possible world in which the
object exists. - Descriptions designate a proper name as
different objects in different possible worlds.
9Proper Names and Definite Descriptions
- A meaning of a Sentences involving Proper names
could be substituted by a contextually
appropriate description for a name. - eg Otto von Bismarck can be known or described
as the first Chancellor of the German Empire - Kripke argues that definite descriptions cannot
be rigid designators . Because definite
descriptions cannot be same/similar in all
possible worlds - More on Kripkes Proper name in Naming and
Necessity 1980
10What is Named Entity
- Named Entities are
- A Noun Phrase
- Rigid Designators It designates/denotes the
same thing in all possible worlds in which the
same thing exists and does not designate anything
else in those possible worlds in which that same
thing does not exist
11EXAMPLES for Named Entity and not a Named entity
- Hotel Taj Hotel
- Flower Rose Flower
- Beach Kovalam Beach
- Airport Indira Gandhi International airport
- The School Good Shepherd School
- Prime Minister Mr. Manmohan Singh
12Some problems in indentifying NE
- Variation of NEs.
- Manmohan Singh, Manmohan, Dr. Manmohan Singh
- Ambiguity of NE types
- 1945 (date vs. time)
- Washington (location vs. person)
- May (person vs. month)
- Tata (person vs. organization)
13Ambiguity Examples
- Person vs Location
- Sir C. P Ramaswamy was the Divan of Travancore
(Per) - Sir C.P Ramaswamy Road is in Chennai (Loc)
- Person vs Organization
- Anil Ambani opened Reliance Fresh (Per)
- Reliance Fresh is under Anil Amabani Group Ltd
(Org)
14More complex problems in NER
- Issues of style, structure, domain, genre etc.
- Punctuation, spelling, spacing, formatting, .all
have an impact - Dept. of Computing and Information Science
- Manchester Metropolitan University
- Manchester
- United Kingdom
- gt Tell me more about Leonardo
- gt Da Vinci
15Problems in NE Task Definition
- Category definitions are intuitively quite clear,
but there are many grey areas. - Many of these grey area are caused by metonymy.
- Person vs. Artefact
- Organisation vs. Location
- Company vs. Artefact
- Location vs. Organisation
16Tagset for Named Entity
- ACE tagset is Hierarchical
- ACE-Automatic Content Extraction
- The tagset
- CLIA-is Hierarchical -Similar to ACE
- Developed for two domains
- Tourism and Health
17- Manmade
- Religious Places
- Roads/Highways
- Museum
- Theme parks/Parks/Gardens
- Monuments
- Facilities
- Hospitals
- Institutes
- Library
- Hotel/Restaurants/Lodges
- Plant/Factories
- Police Station/Fire Services
- Public Comfort Stations
- Airports
- Ports
- Bus-Stations
- Locomotives
- Artifacts
- TAGSET
- ENAMEX
- Person
- Individual
- Family name
- Title
- Group
- Organization
- Government
- Public/private company
- Religious
- Non-government
- Political Party
- Para military
- Charitable
- Association
- GPE (Geo-political Social Entity)
- Media
- Location
18Tagset Continued
Tagset Counts First Level Tags -3 Second Level
-43 Third Level 40 Total -
86
- NUMEX
- Distance
- Money
- Quantity
- Count
- TIMEX
- Time
- Date
- Day
- Period
19How to Annotate
- 1.ENAMEX
- 1.1 Person
- 1.1.1 Individual
- These refer to names of each individual person,
also includes names of fictional characters found
in stories/novels etc. - Tag Structure
- ltENAMEX TYPE PERSON SUBTYPE_1
INDIVIDUALgt abc lt/ENAMEXgt - Examples
- English
- ltENAMEX TYPE PERSON SUBTYPE_1
INDIVIDUALgtAbdul Kalamlt/ENAMEXgt
20Annotation continued
- 1.1.1.1 Family name
- In general we find that a person name consists
of a family name. Whenever an instance of
individual name occurs with family name, then
that part of the name, which refers to family
name, must be tagged specifically with subtag
FAMILYNAME as shown below. - Tag Structure
- ltENAMEX TYPE PERSON SUBTYPE_1 INDIVIDUAL
SUBTYPE_2 FAMILYNAMEgt abc lt/ENAMEXgt - Examples
- English
- ltENAMEX TYPEPERSON SUBTYPE_1INDIVIDUALgt
Lalu PrasadltENAMEX TYPE PERSON SUBTYPE_1
INDIVIDUAL SUBTYPE_2 FAMILYNAMEgtYadavlt/ENAMEX
gtlt/ENAMEXgt
21NE Types
The Named entity hierarchy is divided into three
major classes Entity Name, Time and Numerical
expressions.
22Entity Types
23Entity Name Types
- Persons are entities limited to humans. A
person may be a single individual or a group.
Individual refer to names of each individual
person. Group refers to set of individual - Location entities are limited to geographical
entities such as geographical areas like names of
countries, cities, continents and landmasses,
bodies of water, and geological formations. - Organization entities are limited to
corporations, agencies, and other groups of
people defined by an established organizational
structure
24Examples for Entity Name Types
- En SitaPERSON is working at HCLORGANIZATION
, which is in Chennai LOCATION - Ta Seetha PERSON chennaiyilrukkira LOCATION
HCLlil ORGANIZATION - En Sita Chennai
HCL - velaiseikirAl.
- Working
- Ml Seetha PERSON chennaiyillula LOCATION
HCLlil ORGANIZATION - En Sita Chennai
HCL - jolicheyyunnu.
- Working
- Hi Seetha PERSON HCL ORGANIZATION main kaam
kar raha hai, jo - En Sita HCL
work is
which - chennai LOCATION main hain.
- Chennai in
-
25Entity Name Types
Facilities are limited to buildings and other
permanent man-made structures and real estate
improvements like hospitals, airport, colleges,
libraries etc. En Appolo Hospital FACILITY is
in Chennai LOCATION Ta Appallo
maruthuvamanAiFACILITY ChennaiyilLOCATION
irukkirathu Ml Appolo AsupathriFACILITY
chennaiyilLOCATION aaN Hi Appolo
aspathaalFACILITY chennaiLOCATION mein
haim.
26Entity Name Types
A locomotive entity is a physical device
primarily designed to move an object from one
location to another, by carrying, pulling, or
pushing the transported object. En Ananthapuri
ExpressLOCOMOTIVE departs from Chennai
LOCATION at 7.30pm Time. Hi Ananthapuri
express LOCOMOTIVE Chennai LOCATION se rAth
7.30 TIME ko ravana hoga Ml Ananthapuri
eksprass LOCOMOTIVE chennaiyilninn LOCATION
raathri 7.30 maNikk TIME puRappetum. Ta
Ananthapuri viraivu rayil LOCOMOTIVE
chennaiyilirunthu LOCATION iRavu 7.30 maNikku
TIME puRappatukirathu
27Entity Name Types
Artifact entities are objects or things, produced
or shaped by human craft, such as tools,
weapons/ammunition, art paintings, clothes,
ornaments, medicines En Vinayaga Statue
ARTIFACT is looking beautiful Ta Vinayakarin
Silai ARTIFACT pArpatharkku alakAkAkairukkirathu
Ml ganapathi vigrahamARTIFACT baMgiyaayi
irikkunnu. Hi Vinayaka moorthi ARTIFACT achi
lagh rahi haim.
28Entity Name Types
Entertainment entities denote activities, which
are diverting and hold human attention or
interest, giving pleasure, happiness, amusement
especially performance of some kind such as
dance, music, sports, events. En Flower
Exhibition ENTERTAINMENT is held at
HyderabadLOCATION Ta Malar kankAtchi
ENTERTAINMENT hyderabaadilLOCATION
Nadaiperukirathu Ml pushpa pradarshanam
ENTERTAINMENT hyderabaadil LOCATION
natakkunnu Hi phool pradarshnii ENTERTAINMENT
hyderabad LOCATION meN Ayojith kiyaa jAthA hai
29Entity Name Types
Materials refer to the names of food items,
cuisines, chemicals and cosmetics En
HoneyMATERIALS is good for face Ta
ThEnMATERIALS mukaththiRku nallathu Ml Madhu
MATERIALS mukaththinu nallathAN Hi Shahad
MATERIALS chehare ke liye achcha hai.
30Entity Name Types
ORGANISMS These are the names of different
animal species including birds, reptiles,
viruses, bacteria and names of herbs, medicinal
plants, shrubs, trees, fruits, flowers etc. En
Peacock ORGANISM is the national bird of
India LOCATION Ta Mayil ORGANISM
InthiyAvin LOCATION thEciyappaRavai Akum. Ml
Mayil ORGANISM indyayute LOCATION
raashtrapakshi AN. Hi Mor ORGANISM
bhaarath LOCATION kaa raashtrIya pakshi hai.
31Entity Name Types
Disease Names of disease, symptoms, diagonisis
and treatment are comes under this type. En
Smoking Causes Cancer DISEASE Ta
PukaippithithalAl puRRuNoi DISEASE
varukiRathu Ml pukavali aRbhudham DISEASE
uNtAkkunnu Hi dhumrapan kaansar DISEASE ka
kaaraN banaatha hai.
32Numerical Expressions
33Numerical Expressions
- Distance refers to the distance measures such as
kilometers, Centimeters, meters, acres, feet etc. - Example 10 cm., twenty feet, 15 hectares
- Money specifies the different currency value such
as rupee, euro, Dinar, dollar etc. - Example Rs. 1000, 250 Euro, 160
- Count denotes the number (or counts) of Items/
articles/things etc. - Example 5 subjects, 12 students, 20 books
- Quantity measurements like liters, tons, grams,
volts etc. are comes under this category. - Example 20 litres, 22 kg, 50g, 100 volts
34Time Expressions
35Temporal Expressions
- Temporal expressions are the entities refers to
time, date, year, month and day - Time These refer to expressions of time,
includes different forms - of expressing time. This also includes Hours,
minutes and seconds. - Example
- 5o clock in the morning
- 9.30 a.m.
- Evening 6.30 p.m.
- Date This refers to expressions of Date such as
13/12/2001 etc in - different forms. This also includes month, date
and year - Example
- August 15 1947
- 1956
- September 11
36Temporal Expressions
- Day These are expressions, which convey days in
a year. Also it can include - days occurring weekly /fortnightly/ monthly
/quarterly/ biennial etc. - Example
- Sunday
- Tomorrow
- Today
- Yesterday
- Special Day refers to special days in a year
- Example
- Gandhi Jayanthi
- Rama Navami
37Temporal Expressions
- Period refers to expressions, which express
duration of time or - time periods or time intervals.
- Example
- 17 th century
- 10 minutes
- 10 a.m. to 12 p.m.
- One year
38Methodologies
- Methods
- Rule Based
- Machine Learning
- Hidden Markov Model (HMM)
- Naïve Bayes Classifier
- Maximum Entropy Markov Model (MEMM)
- Conditional random Fields (CRF)
- 4) Hybrid Approach
-
39Challenges of NER in Indian Languages
- Following are the major challenges encountering
in Indian Languages. - Agglutination
- Ambiguity
- Between Proper and common nouns
- Between named entities
- Lack of Capitalization
40Challenges of NER in Indian Languages
Agglutination In Dravidian languages, words
consist of a lexical root to which one or more
affixes are attached. Example in Tamil 1) Ta
Ramanaiththavira (otherthan
Raman) 2) Ta Cevvaiyandru (On
Tuesday) 3) Ta Inthiyavilllula (In
India) 4) Ta KannanaippaRRikkondu
(hold onto Kannan)
41Challenges of NER in Indian Languages
Example in Malayalam 1) Ml hemayiluNtaayirunna
(that which Hema have) 2) Ml
Chennaiyilethunna (reach in Chennai) 3)
Ml arabikatalinaBimukhamaayi
(towards the arabian sea) 4) Ml
kaaSiyilekkozhukunna ( flowing
towards kaaSi)
42Challenges of NER in Indian Languages
- Ambiguity
- Comparatively Indian languages suffer more due to
the ambiguity that exists between common proper
nouns and between named entities itself. In some
cases same word can refer to different named
entity types. Those instances can recognized by
contextual information. - Examples
- Hi Akash - Person name and Sky
- Hi Sooraj - Person name and Sun
- Hi Chaanth Moon and Silver
- Hi Aam Mango and Common
- Ml Roopa Person name and Rupee
- Ml Madhu Person name and Honey
- Ml Mala Person name and Garland
43Challenges of NER in Indian Languages
- Ta Thinkal - Day and Month
- Ta Malar - Person name and Flower
- Ta Chevvai - Day and planet
- Ta Shakthi Person name and Power
- Ta MAlai Evening and Garland
- Ta Ml Velli Silver, Planet, Day
44Challenges of NER in Indian Languages
Spell Variation Due to the different writing
styles same entity is represented in various word
forms. In Tamil, sanskirit letters such as ja,
sha, sri Ha are replaced by sa,ciri,
ka Example Roja can be written as
Rosa Srimathi - cirimathi Raja -
rasa ShajahAn - sajakAn
45Challenges of NER in Indian Languages
- Lack of Capitalization
- In English and some other European languages
capitalization is considered as the important
feature to identify proper noun. - It plays a major role in NE identification.
- Unlike English capitalization concept is not
found in Indian languages.
46Nested Entities
Nested Entities Refers to the named entities
which occurs within another named entities. Also
called as embedded entities. Ta Mathurai
LOCATION MeenAtchi AmmanPERSON
KoyilRELPLACE En Mathurai
Meenatchi Amman Temple Ml
Nittoor PERSON Srinivasa rao PERSON En
Nitoor Srinivasa rao Hi
Rajeev PERSON MArg ROAD En Rajeev
Road
47Approaches in Named Entity Resolution
- Dictionary Look-up
- Rule based ( Using lexical, contextual and
morphological information) - Maximum entropy theory based
- Hidden Markov Model
- Conditional Random Fields
- Hybrid methods (Statistical Linguistics)
48Dictionary (Gazetteers) Look-up Approach
- Uses Dictionaries for identifying NERs (
Gazetteers) - Gazetteer contains NEs from all domains
- Advantage
- Very simple approach
- Gives very high precision
49Disadvantages of Dictionary Approach
- Preparation of exhaustive dictionary is a tedious
and expensive process. - The dictionary should cover the different
spellings of the same place.
50Rule Based Approach
- Rule Based System
- Needs more rules to tag all kinds of NE
- Advantages
- Rich and expressive rules
- Good results
- Disadvantages
- Requires huge experience and grammatical
knowledge - Experts to craft rules are expensive
- Highly domain specific ( not portable to a new
domain)
51General difficulties
- Italy's business world was rocked by the
announcement last Thursday that Mr. Verdi would
leave his job as vice-president of Music Masters
of Milan, Inc. to become operations director of
Arthur Andersen". - Capitalization useless for first word
- S not part of name "Italy"
- Date is "last Thursday" not "Thursday"
- Milan is location, not organization
- Arthur Andersen is organization, not person
52Rules success and failure
- Title Capitalized_Word Title Person_Name
- Correct Mr. Jones
- Incorrect Mrs. Field's Cookies (corporation)
- Month_name number_less_than_32 Date
- Correct February 28
- Incorrect Long March 3 (a Chinese Rocket)
- From Date to Date Date
- Correct from August 3 to August 9
- Incorrect I moved my trip from April to June
(two - separate dates)
53Statistical based approach
- Need to identify features
- Feature selection has to be correct for all types
of NE - Development of Tagged Corpus
- The Corpus should contain all types of tags in
appropriate number - Domain based corpus has to be generated.
54Automated approaches
- Address drawbacks of hand-coded system
- Automated training
- Human-annotated (with desired output
- standards) training data
- Annotation requires less effort and expertise
- than hand-coding rules
- Annotation accuracy
- Two annotators for checking, third annotator to
- resolve disputes
55Literature Survey
- Named Entity Recognition was one of the tasks
defined in Message Understanding
Conference(MUC) 6. - A survey on Named Entity Recognition was done by
David Nadeau (2007). - 3) Techniques used include
- - rule based technique by Krupka (1998)
- - using maximum entropy by Borthwick (1998)
- - using Hidden Markov Model by Bikel (1997)
- - bootstrapping approach using concept based
seeds (Niu et al., 2003) - - hybrid approaches such as rule based tagging
for certain entities such as date, time,
percentage and maximum entropy based approach for
entities like location and organization (Rohini
et al.,2000) - 4) The Stanford NER software (Finkel et al.,
2005), uses linear chain CRFs in their - NER engine. Here they identify three classes
of NERs viz., Person, Organization - and Location.
56References
- Arulmozhi, P. and Sobha, L. (2006). HMM-based
Part of Speech Tagger for Relatively Free - Word Order Language. Advances in Natural Language
Processing, Research in Computing Science
Journal, Mexico Volume18, pp. 37-48. - Bikel, D. M. Miller, S. Schwartz, R. Weischedel,
R. (1997). Nymble A high-performance learning
name-finder. In Fifth Conference on Applied
Natural Language Processing. pp. 194201. - Borthwick, A. Sterling, J. Agichtein, E. and
Grishman, R. (1998). Description of the MENE
named Entity System. In Seventh Machine
Understanding Conference (MUC-7). - Chen, W. Zhang, Y. and Isahara, H. (2006).
Chinese Named Entity Recognition with Conditional
Random Fields. In Fifth SIGHAN Workshop on
Chinese Language Processing, Sydney. pp.118-121. - Ekbal, A. Bandyopadhyay, S. (2009). A Conditional
Random Field Approach for Named Entity
Recognition in Bengali and Hindi. Linguistic
Issues in Language Technology, 2(1). pp.1-44.
57References
- Finkel, J. N. Grenager, T. and Manning, C.
(2005). Incorporating Non-local Information into
Information Extraction Systems by Gibbs Sampling.
In 43nd Annual Meeting of the Association for
Computational Linguistics (ACL 2005). pp.
363-370. - Finkel, J. Dingare, S. Nguyen, H. Nissim, M.
Sinclair, G. and Manning, C. (2004). Exploiting
Context for Biomedical Entity Recognition from
Syntax to the Web. In Joint Workshop on Natural
Language Processing in Biomedicine and its
Applications, (NLPBA), Geneva, Switzerland. - Gali, K. Surana, H. Vaidya, A. Shishtla, P.
Sharma, D. M. (2008). Aggregating Machine
Learning and Rule Based Heuristics for Named
Entity Recognition. In Workshop on NER for South
and South East Asian Languages, IJCNLP-08,
Hyderabad, India. - Kumar, K. N. Santosh, G. S. K. Varma, V. (2011).
A Language-Independent Approach to Identify the
Named Entities in under-resourced languages and
Clustering Multilingual Documents. In
International Conference on Multilingual and
Multimodal Information Access Evaluation,
University of Amsterdam, Netherlands. - Lafferty, J. McCallum, A. Pereira, F. (2001).
Conditional Random Fields for segmenting and
labeling sequence data. In ICML-01, pp. 282-289. - Loinaz, I.A. Uriarte, O. A. Ramos, N. E. Castro,
M. I. F. D (2006). Lessons from the Development
of Named Entity Recognizer for Basque. Natural
Language Processing, 36. pp. 25 37. - McCallum, A. and Li, W. (2003). Early Results for
Named Entity Recognition with Conditional Random
Fields, Feature Induction and Web-Enhanced
Lexicons. In Seventh Conference on Natural
Language Learning (CoNLL).
58References
- Nadeau, David and Sekine, S. (2007) A survey of
named entity recognition and classification.
Linguisticae Investigationes 30(1). pp.326. - Niu, C. Li, W. Ding, J. Srihari, R. K. (2003).
Bootstrapping for Named Entity Tagging using
Concept-based Seeds. In HLT-NAACL03, Companion
Volume, Edmonton, AT. pp.73-75. - Pandian, S. Lakshmana, Geetha, T. V. and Krishna.
(2007). Named Entity Recognition in Tamil using
Context-cues and the E-M algorithm. In the
Proceedings of the 3rd Indian International
Conference on Artificial Intelligence, Pune,
India. pp. 1951 -1958. - Sasidhar, B., Yohan, P.M., Babu, V.A., Govarhan,
A.(2011). A Survey on Named Entity Recognition in
Indian Languages with particular reference to
Telugu. J. International Journal of Computer
Science Issues, Volume. 8, pp. 1694-0814 . - Sobha, L., Vijay Sundar Ram. R. (2006). "Noun
Phrase Chunker for Tamil", In Proceedings of
Symposium on Modeling and Shallow Parsing of
Indian Languages, Indian Institute of Technology,
Mumbai, pp 194-198. - Srihari, R.K. Niu, C. Yu, L. (2000). A Hybrid
Approach for Named Entity Recognition in Indian
Languages. In 6th Applied Natural Language
Conference, pp. 247-254 - Gupta, S. and Bhattacharyya, P. (2010). Think
globally, apply locally using distributional
characteristics for Hindi named entity
identification. In 2010 Named Entities Workshop,
Association for Computational Linguistics
Stroudsburg, PA, USA - Vijayakrishna, R. and Sobha, L. (2008). Domain
focused Named Entity for Tamil using Conditional
Random Fields. In IJNLP-08 workshop on NER for
South and South East Asian Languages, Hyderabad,
India. pp. 59-66
59Literature Survey
- Indian Languages
- 5) Named Entity recognition for Hindi, Bengali,
Oriya, Telugu and Urdu (some of the major Indian
languages) were addressed as a shared task in the
NERSSEAL workshop of IJCNLP. The tagset used here
consisted of 12 tags. - 6) Vijayakrishna Sobha (2008) worked on Domain
focused Tamil Named Entity Recognizer for Tourism
domain using CRF. It handles nested tagging of
named entities with a hierarchical tag set
containing 106 tags. They considered root of
words, POS, combined word and POS, Dictionary of
named entities as features to build the system. - 7) Pandian et al (2007) have built a Tamil NER
system using contextual cues and E-M algorithm. - 8) The NER system (Gali et al., 2008) build for
NERSSEAL-2008 shared task which combines the
machine learning techniques with language
specific heuristics. The system has been tested
on five languages such as Telugu, Hindi, Bengali,
Urdu and Oriya using CRF followed by post
processing which involves some heuristics.
60Thank you