Natural Language Processing and the SemiAutomated Construction Approach - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Natural Language Processing and the SemiAutomated Construction Approach

Description:

Amphibian. Anatomical. Ontology. FlyBase Curation. Curator fills out record for each paper ... Amphibian. Anatomical. Ontology. Potential Problems for Text Mining ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 27

Provided by: annem68

Category:

more less

Transcript and Presenter's Notes

Title: Natural Language Processing and the SemiAutomated Construction Approach

1
Natural Language Processing and the
Semi-Automated Construction Approach

2007 AmphibAnat Meeting
Wed, Nov. 14 11301200
Jennifer Leopold

2
Outline

Text mining
Manual curation
Document parsing
Potential problems
Methodology
Evaluation
Semi-automated construction
Enriching manually created ontology
Extrinsic and instrinsic benchmarking

3
Text Mining

Experts can design ontology (classes, hierarchy,
etc.)
But need to systematically go through literature
to identify instances and their properties
Particularly important to accommodate diversity

4
Text Mining

Goals
Discover new instances and properties
Increase strength of existing annotations by
locating additional paper evidence

5
FlyBase Curation

Watch list of 35 journals
Each curator inspects latest issue of a journal
to identify papers to curate
So curation takes place on paper-by-paper basis
(as opposed to topic-by-topic)

6
FlyBase Curation

Curator fills out record for each paper
Some fields require rephrasing, paraphrasing,
summarization
Other fields record very specific facts using
terms from ontologies

7
FlyBase Curation

Software like PaperBrowser presents enhanced
display of text with recognized terms highlighted
(e.g., Named Entity Recognition)
Parser identifies boundaries of the NP around
each term name and its grammatical relations to
other NPs in the text

8
Document Parsing

PDF is only standard electronic format in which
all relevant papers are available
PDF-to-text processors not aware of the
typesetting of each journal, have trouble with
some formatting (e.g., 2-column text, footnotes,
headers, figure captions, etc.)
Document parsing best done with optical character
recognition (OCR)
For images, can parse their captions

9
Potential Problems for Text Mining

Lexical ambiguity (e.g., words that denote gt 1
concept)
Polysemy (e.g., term present in 2 papers denotes
different concepts)
Abbreviation (e.g., same concept, but different
abbreviations in different papers)

10
Potential Problems for Text Mining

Digit removal (e.g., 4-hydroxybutan vs.
2-hydroxybutan)
Stemming (e.g., removing prefixes, suffixes,
etc.)
Stop word removal (e.g., the, a)

Need a domain-specific text miner!!!
11
Methodology

Extract textual elements from papers identifying
a term in the ontology
Construct patterns with reliability scores
(confidence that pattern represents term)
Extend pattern set with longer pattern sets
Apply semantic pattern matching techniques (i.e.,
consider synonyms)
Annotate terms based on quality of matched
pattern to concept occurring in the text

12
Training Phase

Objective construct set of patterns that
characterize indicators for
annotation
Find terms in the training set papers
Extract significant terms/phrases that appear in
the papers
Construct patterns based on significant
terms/phrases and terms surrounding significant
terms

13
Annotation Phase

Look for possible matches to the patterns in the
papers
Compute a matching score which indicates the
strength of the prediction
Determine the term to be associated with the
pattern match
Order new annotation predictions by their scores,
and present to user

14
Pattern Construction

structured as LEFT lt MIDDLE gt RIGHT
ltMIDDLEgt is an ordered sequence of significant
terms (i.e., identifying elements)
LEFT and RIGHT are sets of words that appear
around significant terms (i.e., auxiliary
descriptors)
number of words in LEFT and RIGHT can be
limited
stop words not included in patterns

15
Pattern Construction

Example pattern template
LEFT lt invests gt RIGHT
pattern1 frontoparietal
lt invests gt
sphenethmoid
pattern2 anterior ramus pterygoid
lt invests gt
planum antorbitale

16
Pattern Scoring

Calculate score representing how confidently a
pattern represents a term
MT source of ltmiddlegt
Patterns whose ltmiddlegt exactly matches ontology
term gets higher score

17
Pattern Scoring

Calculate score representing how confidently a
pattern represents a term
TT type of individual terms in the ltmiddlegt
Considers occurrence frequency of a word in
ltmiddlegt among all ontology terms, and position
of word in an ontology term (gets more specific
from right to left)

18
Pattern Scoring

Calculate score representing how confidently a
pattern represents a term
PP term-wise paper frequency of ltmiddlegt
Patterns with ltmiddlegt that is highly frequent in
the paper dataset get higher scores

19
Evaluation

Recall correct responses by software
all human responses
Precision correct responses by software
all responses by software

20
Semi-Automated Construction

Goal Develop automated techniques to further
reduce manual efforts and enhance existing,
manually created ontologies by
Enriching concepts in the ontology
Applying metrics to benchmark empirically the
ontologys suitability

21
Semi-Automated Construction

Related Work
Omega ontology is large general-purpose ontology
created semi-automatically from other existing
ontologies and lexicons (e.g., WordNet and
Mikrokosmos)
Small test ontologies constructed using words
automatically extracted from documents uses
fuzzy logic to figure out hierarchical
organization

22
Semi-Automated Construction

Enriching existing ontology
Identify relevant data sources
Use topic-specific spider to generate queries for
concepts in the ontology
Collect potentially relevant documents
Train IE system to retrieve info from documents
Pattern-based extraction methods
Statistical NLP algorithms that identify and
weight most important elements

23
Semi-Automated Construction

Extrinsic benchmarking
How well the distribution of concepts and
properties in the ontology reflects that in the
literature
Ex if 80 of concepts in the ontology have 0
instances in the literature, and only 5 of
concepts account for 90 of all instances, then
the ontology is not a good fit
- Concepts with many instances may require
refinement and subdivision
- Concepts with no instances may be pruned (or
designated as a less preferred synonym)

24
Semi-Automated Construction

Intrinsic benchmarking
Show how well the software-based ontology
reflects the concepts, properties, and
hierarchies represented in the manually-developed
ontology
Determine if software can learn the preferred
terminology automatically by identifying the most
frequently used term from a synset, or the term
most often used by authoritative sources

25
Susan Gauch, Ph.D.

Department Chair
Computer Science Computer EngineeringUniversity
of ArkansasFayetteville, AR 72704Email
sgauch_at_uark.eduWeb http//www.csce.uark.edu/sga
uch
Research areas
Intelligent information retrieval
Personalization and web search
Semi-automated ontology construction and
modification

26
Discussion

Write a Comment

User Comments (0)