Title: Combining terminology resources and statistical methods for entity recognition: an evaluation
1Combining terminology resources and statistical
methods for entity recognition an evaluation
Angus Roberts, Robert Gaizauskas, Mark Hepple,
Yikun Guo presented by George Demetriou Natural
Language Processing Group, University of
Sheffield, UK
2Introduction
- Combining techniques for entity recognition
- Dictionary based term recognition
- Filtering of ambiguous terms
- Statistical entity recognition
- How do the techniques compare separately and in
combination? - When combined, can we retain the advantages of
both?
3Semantic annotation of clinical text
Investigation
Locus
Condition
Locus
- Our basic task is semantic annotation of clinical
text - For the purposes of this paper, we ignore
- Modifiers such as negation
- Relations and coreference
- These are the subject of other papers
4Entity recognition in specialist domains
- Specialist domains, e.g. medicine, are rich in
- Complex terminology
- Terminology resources and ontologies
- We might expect these resources to be of use in
entity recognition - We might expect annotation using these resources
to add value to the text, providing additional
information to applications
5Ambiguity in term resources
- Most term resources have not been designed with
NLP applications in mind - When used for dictionary lookup, many suffer from
problems of ambiguity - I Iodine, an Iodine test or the personal pronoun
- be bacterial endocarditis or the root of a verb
- Various techniques can overcome this
- Filtering or elimination of problematic terms
- Use of context in our case, statistical models
6Corpus the CLEF gold standard
- For experiments, we used a manually annotated
gold standard - Careful construction of a schema and guidelines
- Double annotation with a consensus step
- Measurement of Inter Annotator Agreement (IAA)?
- (Roberts et al 2008 LREC bio text mining
workshop)? - For the experiments reported, we use 77 gold
standard documents
7Entity types
8Dictionary lookup Termino
- Termino is loaded from external resources
- FSM matchers are compiled out of Termino
9Finding entities with termino
- Termino loaded with selected terms from UMLS
(600K terms)? - Pre-processing includes tokenisation and
morphological analysis - Lookup is against the roots of tokens
10Filtering problematic terms
- Many UMLS terms are not suitable for NLP
- Ambiguity with common general language words
- To identify the most problematic of these, we ran
Termino over a separate development corpus, and
manually inspected the results - A supplementary list of missing terms was
compiled by domain experts (6 terms) - Creation of these lists took a couple of hours
11Creating the filter list
- Add all unique terms of 1 character to the list
- For all unique terms of lt 6 characters
- Add to the list if it matches a common general
language word or abbreviation - Add to the list if it has a numeric component
- Reject from the list if it is an obvious
technical term - Reject from the list if none of the above apply
- Filter list size 232 terms
12Entities found by Termino
- UMLS alone gives poor precision, due to term
ambiguity with general language words - Adding in the filter list improves precision with
little loss in recall
13Statistical entity recognition
- Statistical entity recognition allows us to model
context - We use an SVM implementation provided with GATE
- Mapping of our multi-class entity recognition
task to binary SVM classifiers is handled by GATE
14Features for machine learning
- Token kind (e.g. number, word)
- Orthographic type (e.g. lower case, upper case)?
- Morphological root
- Affix
- Generalised part of speech
- The first two characters of Penn Treebank tagset
- Termino recognised terms
15Finding entities ML
GATE training pipeline
Linguistic processing
Gold standard annotated texts (human annotated)?
GATE application pipeline
Linguistic processing
Annotated texts
16Finding entities ML Termino
Linguistic processing
Gold standard annotated texts (human annotated)?
Linguistic processing
Annotated texts
17Entities found by SVM
- Statistical entity recognition alone gives a
higher P than dictionary lookup, but a lower R - The combined system gains from the higher R of
dictionary lookup, with no loss in P
18Linkage to external resources
- Semantic annotation allows us to link texts to
existing domain resources - Giving more intelligent indexing and making
additional information available to applications
19Linkage to external resources
- UMLS links terms to Concept Unique Identifiers
(CUIs) - Where a recognised entity is associated with an
underlying Termino term, can likewise
automatically link the entity to a CUI - If the SVM finds an entity when Termino has found
nothing, the entity cannot be linked to a CUI
20CUIs assigned
- At least one CUI can be automatically assigned to
83 of the terms in the gold standard - Some are ambiguous, and resolution is needed
21Availability
- Most of the software is open source and can be
downloaded as part of GATE - We are currently packaging Termino for public
release - We are currently preparing a UK research ethics
committee application for release of the
annotated gold standard
22Conclusions
- Dicitionary lookup gives a good recall but poor
precision, due to term ambiguity - Much ambiguity is due to a few of terms, which
can be filtered to give little loss in recall - Combining dictionary lookup with statistical
models of context improves precision - A benefit of dictionary lookup, linkage to
external resources, can be retained in the
combined system
23Questions?
http//www.clinical-escience.org
http//www.clef-user.com