Combining terminology resources and statistical methods for entity recognition: an evaluation

About This Presentation

Title:

Combining terminology resources and statistical methods for entity recognition: an evaluation

Description:

Specialist domains, e.g. medicine, are rich in: Complex terminology ... When used for dictionary lookup, many suffer from problems of ambiguity ... –

Number of Views:23

Avg rating:3.0/5.0

Slides: 24

Provided by: angusr

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Combining terminology resources and statistical methods for entity recognition: an evaluation

1
Combining terminology resources and statistical
methods for entity recognition an evaluation
Angus Roberts, Robert Gaizauskas, Mark Hepple,
Yikun Guo presented by George Demetriou Natural
Language Processing Group, University of
Sheffield, UK
2
Introduction

Combining techniques for entity recognition
Dictionary based term recognition
Filtering of ambiguous terms
Statistical entity recognition
How do the techniques compare separately and in
combination?
When combined, can we retain the advantages of
both?

3
Semantic annotation of clinical text
Investigation
Locus
Condition
Locus

Our basic task is semantic annotation of clinical
text
For the purposes of this paper, we ignore
Modifiers such as negation
Relations and coreference
These are the subject of other papers

4
Entity recognition in specialist domains

Specialist domains, e.g. medicine, are rich in
Complex terminology
Terminology resources and ontologies
We might expect these resources to be of use in
entity recognition
We might expect annotation using these resources
to add value to the text, providing additional
information to applications

5
Ambiguity in term resources

Most term resources have not been designed with
NLP applications in mind
When used for dictionary lookup, many suffer from
problems of ambiguity
I Iodine, an Iodine test or the personal pronoun
be bacterial endocarditis or the root of a verb
Various techniques can overcome this
Filtering or elimination of problematic terms
Use of context in our case, statistical models

6
Corpus the CLEF gold standard

For experiments, we used a manually annotated
gold standard
Careful construction of a schema and guidelines
Double annotation with a consensus step
Measurement of Inter Annotator Agreement (IAA)?
(Roberts et al 2008 LREC bio text mining
workshop)?
For the experiments reported, we use 77 gold
standard documents

7
Entity types
8
Dictionary lookup Termino

Termino is loaded from external resources
FSM matchers are compiled out of Termino

9
Finding entities with termino

Termino loaded with selected terms from UMLS
(600K terms)?
Pre-processing includes tokenisation and
morphological analysis
Lookup is against the roots of tokens

10
Filtering problematic terms

Many UMLS terms are not suitable for NLP
Ambiguity with common general language words
To identify the most problematic of these, we ran
Termino over a separate development corpus, and
manually inspected the results
A supplementary list of missing terms was
compiled by domain experts (6 terms)
Creation of these lists took a couple of hours

11
Creating the filter list

Add all unique terms of 1 character to the list
For all unique terms of lt 6 characters
Add to the list if it matches a common general
language word or abbreviation
Add to the list if it has a numeric component
Reject from the list if it is an obvious
technical term
Reject from the list if none of the above apply
Filter list size 232 terms

12
Entities found by Termino

UMLS alone gives poor precision, due to term
ambiguity with general language words
Adding in the filter list improves precision with
little loss in recall

13
Statistical entity recognition

Statistical entity recognition allows us to model
context
We use an SVM implementation provided with GATE
Mapping of our multi-class entity recognition
task to binary SVM classifiers is handled by GATE

14
Features for machine learning

Token kind (e.g. number, word)
Orthographic type (e.g. lower case, upper case)?
Morphological root
Affix
Generalised part of speech
The first two characters of Penn Treebank tagset
Termino recognised terms

15
Finding entities ML
GATE training pipeline
Linguistic processing
Gold standard annotated texts (human annotated)?
GATE application pipeline
Linguistic processing
Annotated texts
16
Finding entities ML Termino
Linguistic processing
Gold standard annotated texts (human annotated)?
Linguistic processing
Annotated texts
17
Entities found by SVM

Statistical entity recognition alone gives a
higher P than dictionary lookup, but a lower R
The combined system gains from the higher R of
dictionary lookup, with no loss in P

18
Linkage to external resources

Semantic annotation allows us to link texts to
existing domain resources
Giving more intelligent indexing and making
additional information available to applications

19
Linkage to external resources

UMLS links terms to Concept Unique Identifiers
(CUIs)
Where a recognised entity is associated with an
underlying Termino term, can likewise
automatically link the entity to a CUI
If the SVM finds an entity when Termino has found
nothing, the entity cannot be linked to a CUI

20
CUIs assigned

At least one CUI can be automatically assigned to
83 of the terms in the gold standard
Some are ambiguous, and resolution is needed

21
Availability

Most of the software is open source and can be
downloaded as part of GATE
We are currently packaging Termino for public
release
We are currently preparing a UK research ethics
committee application for release of the
annotated gold standard

22
Conclusions

Dicitionary lookup gives a good recall but poor
precision, due to term ambiguity
Much ambiguity is due to a few of terms, which
can be filtered to give little loss in recall
Combining dictionary lookup with statistical
models of context improves precision
A benefit of dictionary lookup, linkage to
external resources, can be retained in the
combined system

23
Questions?
http//www.clinical-escience.org
http//www.clef-user.com

Write a Comment

User Comments (0)