Combining terminology resources and statistical methods for entity recognition: an evaluation

About This Presentation
Title:

Combining terminology resources and statistical methods for entity recognition: an evaluation

Description:

Specialist domains, e.g. medicine, are rich in: Complex terminology ... When used for dictionary lookup, many suffer from problems of ambiguity ... –

Number of Views:23
Avg rating:3.0/5.0
Slides: 24
Provided by: angusr
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Combining terminology resources and statistical methods for entity recognition: an evaluation


1
Combining terminology resources and statistical
methods for entity recognition an evaluation
Angus Roberts, Robert Gaizauskas, Mark Hepple,
Yikun Guo presented by George Demetriou Natural
Language Processing Group, University of
Sheffield, UK
2
Introduction
  • Combining techniques for entity recognition
  • Dictionary based term recognition
  • Filtering of ambiguous terms
  • Statistical entity recognition
  • How do the techniques compare separately and in
    combination?
  • When combined, can we retain the advantages of
    both?

3
Semantic annotation of clinical text
Investigation
Locus
Condition
Locus
  • Our basic task is semantic annotation of clinical
    text
  • For the purposes of this paper, we ignore
  • Modifiers such as negation
  • Relations and coreference
  • These are the subject of other papers

4
Entity recognition in specialist domains
  • Specialist domains, e.g. medicine, are rich in
  • Complex terminology
  • Terminology resources and ontologies
  • We might expect these resources to be of use in
    entity recognition
  • We might expect annotation using these resources
    to add value to the text, providing additional
    information to applications

5
Ambiguity in term resources
  • Most term resources have not been designed with
    NLP applications in mind
  • When used for dictionary lookup, many suffer from
    problems of ambiguity
  • I Iodine, an Iodine test or the personal pronoun
  • be bacterial endocarditis or the root of a verb
  • Various techniques can overcome this
  • Filtering or elimination of problematic terms
  • Use of context in our case, statistical models

6
Corpus the CLEF gold standard
  • For experiments, we used a manually annotated
    gold standard
  • Careful construction of a schema and guidelines
  • Double annotation with a consensus step
  • Measurement of Inter Annotator Agreement (IAA)?
  • (Roberts et al 2008 LREC bio text mining
    workshop)?
  • For the experiments reported, we use 77 gold
    standard documents

7
Entity types
8
Dictionary lookup Termino
  • Termino is loaded from external resources
  • FSM matchers are compiled out of Termino

9
Finding entities with termino
  • Termino loaded with selected terms from UMLS
    (600K terms)?
  • Pre-processing includes tokenisation and
    morphological analysis
  • Lookup is against the roots of tokens

10
Filtering problematic terms
  • Many UMLS terms are not suitable for NLP
  • Ambiguity with common general language words
  • To identify the most problematic of these, we ran
    Termino over a separate development corpus, and
    manually inspected the results
  • A supplementary list of missing terms was
    compiled by domain experts (6 terms)
  • Creation of these lists took a couple of hours

11
Creating the filter list
  • Add all unique terms of 1 character to the list
  • For all unique terms of lt 6 characters
  • Add to the list if it matches a common general
    language word or abbreviation
  • Add to the list if it has a numeric component
  • Reject from the list if it is an obvious
    technical term
  • Reject from the list if none of the above apply
  • Filter list size 232 terms

12
Entities found by Termino
  • UMLS alone gives poor precision, due to term
    ambiguity with general language words
  • Adding in the filter list improves precision with
    little loss in recall

13
Statistical entity recognition
  • Statistical entity recognition allows us to model
    context
  • We use an SVM implementation provided with GATE
  • Mapping of our multi-class entity recognition
    task to binary SVM classifiers is handled by GATE

14
Features for machine learning
  • Token kind (e.g. number, word)
  • Orthographic type (e.g. lower case, upper case)?
  • Morphological root
  • Affix
  • Generalised part of speech
  • The first two characters of Penn Treebank tagset
  • Termino recognised terms

15
Finding entities ML
GATE training pipeline
Linguistic processing
Gold standard annotated texts (human annotated)?
GATE application pipeline
Linguistic processing
Annotated texts
16
Finding entities ML Termino
Linguistic processing
Gold standard annotated texts (human annotated)?
Linguistic processing
Annotated texts
17
Entities found by SVM
  • Statistical entity recognition alone gives a
    higher P than dictionary lookup, but a lower R
  • The combined system gains from the higher R of
    dictionary lookup, with no loss in P

18
Linkage to external resources
  • Semantic annotation allows us to link texts to
    existing domain resources
  • Giving more intelligent indexing and making
    additional information available to applications

19
Linkage to external resources
  • UMLS links terms to Concept Unique Identifiers
    (CUIs)
  • Where a recognised entity is associated with an
    underlying Termino term, can likewise
    automatically link the entity to a CUI
  • If the SVM finds an entity when Termino has found
    nothing, the entity cannot be linked to a CUI

20
CUIs assigned
  • At least one CUI can be automatically assigned to
    83 of the terms in the gold standard
  • Some are ambiguous, and resolution is needed

21
Availability
  • Most of the software is open source and can be
    downloaded as part of GATE
  • We are currently packaging Termino for public
    release
  • We are currently preparing a UK research ethics
    committee application for release of the
    annotated gold standard

22
Conclusions
  • Dicitionary lookup gives a good recall but poor
    precision, due to term ambiguity
  • Much ambiguity is due to a few of terms, which
    can be filtered to give little loss in recall
  • Combining dictionary lookup with statistical
    models of context improves precision
  • A benefit of dictionary lookup, linkage to
    external resources, can be retained in the
    combined system

23
Questions?
http//www.clinical-escience.org
http//www.clef-user.com
Write a Comment
User Comments (0)
About PowerShow.com