Natural Language Processing and the SemiAutomated Construction Approach - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Natural Language Processing and the SemiAutomated Construction Approach

Description:

Amphibian. Anatomical. Ontology. FlyBase Curation. Curator fills out record for each paper ... Amphibian. Anatomical. Ontology. Potential Problems for Text Mining ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 27
Provided by: annem68
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing and the SemiAutomated Construction Approach


1
Natural Language Processing and the
Semi-Automated Construction Approach
  • 2007 AmphibAnat Meeting
  • Wed, Nov. 14 11301200
  • Jennifer Leopold

2
Outline
  • Text mining
  • Manual curation
  • Document parsing
  • Potential problems
  • Methodology
  • Evaluation
  • Semi-automated construction
  • Enriching manually created ontology
  • Extrinsic and instrinsic benchmarking

3
Text Mining
  • Experts can design ontology (classes, hierarchy,
    etc.)
  • But need to systematically go through literature
    to identify instances and their properties
  • Particularly important to accommodate diversity

4
Text Mining
  • Goals
  • Discover new instances and properties
  • Increase strength of existing annotations by
    locating additional paper evidence

5
FlyBase Curation
  • Watch list of 35 journals
  • Each curator inspects latest issue of a journal
    to identify papers to curate
  • So curation takes place on paper-by-paper basis
    (as opposed to topic-by-topic)

6
FlyBase Curation
  • Curator fills out record for each paper
  • Some fields require rephrasing, paraphrasing,
    summarization
  • Other fields record very specific facts using
    terms from ontologies

7
FlyBase Curation
  • Software like PaperBrowser presents enhanced
    display of text with recognized terms highlighted
    (e.g., Named Entity Recognition)
  • Parser identifies boundaries of the NP around
    each term name and its grammatical relations to
    other NPs in the text

8
Document Parsing
  • PDF is only standard electronic format in which
    all relevant papers are available
  • PDF-to-text processors not aware of the
    typesetting of each journal, have trouble with
    some formatting (e.g., 2-column text, footnotes,
    headers, figure captions, etc.)
  • Document parsing best done with optical character
    recognition (OCR)
  • For images, can parse their captions

9
Potential Problems for Text Mining
  • Lexical ambiguity (e.g., words that denote gt 1
    concept)
  • Polysemy (e.g., term present in 2 papers denotes
    different concepts)
  • Abbreviation (e.g., same concept, but different
    abbreviations in different papers)

10
Potential Problems for Text Mining
  • Digit removal (e.g., 4-hydroxybutan vs.
    2-hydroxybutan)
  • Stemming (e.g., removing prefixes, suffixes,
    etc.)
  • Stop word removal (e.g., the, a)

Need a domain-specific text miner!!!
11
Methodology
  • Extract textual elements from papers identifying
    a term in the ontology
  • Construct patterns with reliability scores
    (confidence that pattern represents term)
  • Extend pattern set with longer pattern sets
  • Apply semantic pattern matching techniques (i.e.,
    consider synonyms)
  • Annotate terms based on quality of matched
    pattern to concept occurring in the text

12
Training Phase
  • Objective construct set of patterns that
  • characterize indicators for
    annotation
  • Find terms in the training set papers
  • Extract significant terms/phrases that appear in
    the papers
  • Construct patterns based on significant
    terms/phrases and terms surrounding significant
    terms

13
Annotation Phase
  • Look for possible matches to the patterns in the
    papers
  • Compute a matching score which indicates the
    strength of the prediction
  • Determine the term to be associated with the
    pattern match
  • Order new annotation predictions by their scores,
    and present to user

14
Pattern Construction
  • structured as LEFT lt MIDDLE gt RIGHT
  • ltMIDDLEgt is an ordered sequence of significant
    terms (i.e., identifying elements)
  • LEFT and RIGHT are sets of words that appear
    around significant terms (i.e., auxiliary
    descriptors)
  • number of words in LEFT and RIGHT can be
    limited
  • stop words not included in patterns

15
Pattern Construction
  • Example pattern template
  • LEFT lt invests gt RIGHT
  • pattern1 frontoparietal
  • lt invests gt
  • sphenethmoid
  • pattern2 anterior ramus pterygoid
  • lt invests gt
  • planum antorbitale

16
Pattern Scoring
  • Calculate score representing how confidently a
    pattern represents a term
  • MT source of ltmiddlegt
  • Patterns whose ltmiddlegt exactly matches ontology
    term gets higher score

17
Pattern Scoring
  • Calculate score representing how confidently a
    pattern represents a term
  • TT type of individual terms in the ltmiddlegt
  • Considers occurrence frequency of a word in
    ltmiddlegt among all ontology terms, and position
    of word in an ontology term (gets more specific
    from right to left)

18
Pattern Scoring
  • Calculate score representing how confidently a
    pattern represents a term
  • PP term-wise paper frequency of ltmiddlegt
  • Patterns with ltmiddlegt that is highly frequent in
    the paper dataset get higher scores

19
Evaluation
  • Recall correct responses by software
  • all human responses
  • Precision correct responses by software
  • all responses by software

20
Semi-Automated Construction
  • Goal Develop automated techniques to further
    reduce manual efforts and enhance existing,
    manually created ontologies by
  • Enriching concepts in the ontology
  • Applying metrics to benchmark empirically the
    ontologys suitability

21
Semi-Automated Construction
  • Related Work
  • Omega ontology is large general-purpose ontology
    created semi-automatically from other existing
    ontologies and lexicons (e.g., WordNet and
    Mikrokosmos)
  • Small test ontologies constructed using words
    automatically extracted from documents uses
    fuzzy logic to figure out hierarchical
    organization

22
Semi-Automated Construction
  • Enriching existing ontology
  • Identify relevant data sources
  • Use topic-specific spider to generate queries for
    concepts in the ontology
  • Collect potentially relevant documents
  • Train IE system to retrieve info from documents
  • Pattern-based extraction methods
  • Statistical NLP algorithms that identify and
    weight most important elements

23
Semi-Automated Construction
  • Extrinsic benchmarking
  • How well the distribution of concepts and
    properties in the ontology reflects that in the
    literature
  • Ex if 80 of concepts in the ontology have 0
    instances in the literature, and only 5 of
    concepts account for 90 of all instances, then
    the ontology is not a good fit
  • - Concepts with many instances may require
    refinement and subdivision
  • - Concepts with no instances may be pruned (or
    designated as a less preferred synonym)

24
Semi-Automated Construction
  • Intrinsic benchmarking
  • Show how well the software-based ontology
    reflects the concepts, properties, and
    hierarchies represented in the manually-developed
    ontology
  • Determine if software can learn the preferred
    terminology automatically by identifying the most
    frequently used term from a synset, or the term
    most often used by authoritative sources

25
Susan Gauch, Ph.D.
  • Department Chair
  • Computer Science Computer EngineeringUniversity
    of ArkansasFayetteville, AR 72704Email
    sgauch_at_uark.eduWeb http//www.csce.uark.edu/sga
    uch
  • Research areas
  • Intelligent information retrieval
  • Personalization and web search
  • Semi-automated ontology construction and
    modification

26
Discussion
Write a Comment
User Comments (0)
About PowerShow.com