BeeSpace Informatics Research - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

BeeSpace Informatics Research

Description:

Graduate School of Library & Information Science. University of Illinois at ... Leverage existing retrieval toolkits: Lemur/Indri. EXTRACT: Space Topic/Region ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 34
Provided by: Ale8228
Category:

less

Transcript and Presenter's Notes

Title: BeeSpace Informatics Research


1
BeeSpace Informatics Research
  • ChengXiang (Cheng) Zhai
  • Department of Computer Science
  • Institute for Genomic Biology
  • Statistics
  • Graduate School of Library Information Science
  • University of Illinois at Urbana-Champaign

BeeSpace Workshop, May 21, 2007
2
Overview of BeeSpace Technology
Users

Task Support
Function Annotator
Gene Summarizer
Space Navigation
Space/Region Manager, Navigation Support
Text Miner
Search Engine
Relational Database
Words/Phrases Entities
Content Analysis
Natural Language Understanding
Meta Data
Literature Text
3
Part 1 Content Analysis
4
Natural Language Understanding
  • We have cloned and sequenced
  • a cDNA encoding Apis mellifera ultraspiracle
    (AMUSP)
  • and examined its responses to

5
Sample Technique 1 Automatic Gene Recognition
  • Syntactic clues
  • Capitalization (especially acronyms)
  • Numbers (gene families)
  • Punctuation -, /, , etc.
  • Contextual clues
  • Local surrounding words such as gene,
    encoding, regulation, expressed, etc.
  • Global same noun phrase occurs several times in
    the same article

6
Maximum Entropy Modelfor Gene Tagging
  • Given an observation (a token or a noun phrase),
    together with its context, denoted as x
  • Predict y ? gene, non-gene
  • Maximum entropy model
  • P(yx) K exp(??ifi(x, y))
  • Typical f
  • y gene candidate phrase starts with a capital
    letter
  • y gene candidate phrase contains digits
  • Estimate ?i with training data

7
Domain overfitting problem
  • When a learning based gene tagger is applied to a
    domain different from the training domain(s), the
    performance tends to decrease significantly.
  • The same problem occurs in other types of text,
    e.g., named entities in news articles.

Training domain Test domain F1
mouse mouse 0.541
fly mouse 0.281
Reuters Reuters 0.908
Reuters WSJ 0.643
8
Observation I
  • Overemphasis on domain-specific features in the
    trained model

wingless daughterless eyeless apexless fly
suffix less weighted high in the model trained
from fly data
9
Observation II
  • Generalizable features generalize well in all
    domains
  • decapentaplegic and wingless are expressed in
    analogous patterns in each primordium of (fly)
  • that CD38 is expressed by both neurons and glial
    cellsthat PABPC5 is expressed in fetal brain and
    in a range of adult tissues. (mouse)

10
Observation II
  • Generalizable features generalize well in all
    domains
  • decapentaplegic and wingless are expressed in
    analogous patterns in each primordium of (fly)
  • that CD38 is expressed by both neurons and glial
    cellsthat PABPC5 is expressed in fetal brain and
    in a range of adult tissues. (mouse)
  • wi2 expressed is generalizable

11
Generalizability-based feature ranking
training data
fly
mouse
D3
Dm

-less expressed
expressed -less
expressed -less
expressed -less

1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
expressed -less
0.125 0.167
12
Adapting Biological Named Entity Recognizer
E
test data
training data
T1
Tm

13
Effectiveness of Domain Adaptation
Exp Method Precision Recall F1
FM?Y Baseline 0.557 0.466 0.508
FM?Y Domain 0.575 0.516 0.544
FM?Y Imprv. 3.2 10.7 7.1
FY?M Baseline 0.571 0.335 0.422
FY?M Domain 0.582 0.381 0.461
FY?M Imprv. 1.9 13.7 9.2
MY?F Baseline 0.583 0.097 0.166
MY?F Domain 0.591 0.139 0.225
MY?F Imprv. 1.4 43.3 35.5
  • Text data from BioCreAtIvE (Medline)
  • 3 organisms (Fly, Mouse, Yeast)

14
Gene Recognition in V3
  • A variation of the basic maximum entropy
  • Classes Begin, Inside, Outside
  • Features syntactical features, POS tags, class
    labels of previous two tokens
  • Post-processing to exploit global features
  • Leverage existing toolkit BMR

15
Part 2 Navigation Support
16
Space-Region Navigation

Topic Regions
My Regions/Topics
Bee Forager

Bee
Bird
Fly
My Spaces
Behavior
Literature Spaces
17
MAP Topic/Region?Space
  • MAP Use the topic/region description as a query
    to search a given space
  • Retrieval algorithm
  • Query word distribution p(w?Q)
  • Document word distribution p(w?D)
  • Score a document based on similarity of ?Q and ?D
  • Leverage existing retrieval toolkits Lemur/Indri

18
EXTRACT Space ?Topic/Region
  • Assume k topics, each being represented by a word
    distribution
  • Use a k-component mixture model to fit the
    documents in a given space (EM algorithm)
  • The estimated k component word distributions are
    taken as k topic regions

Likelihood
Maximum likelihood estimator
Bayesian estimator
19
A Sample Topic Corresponding Space
Word Distribution (language model)
Meaningful labels
labels
actin filaments flight muscle flight muscles
filaments 0.0410238 muscle
0.0327107 actin 0.0287701 z
0.0221623 filament
0.0169888 myosin 0.0153909 thick
0.00968766 thin
0.00926895 sections 0.00924286 er
0.00890264 band
0.00802833 muscles 0.00789018 antibodies
0.00736094 myofibrils 0.00688588 flight
0.00670859 images 0.00649626
Example documents
  • actin filaments in honeybee-flight muscle move
    collectively
  • arrangement of filaments and cross-links in the
    bee flight muscle z disk by image analysis of
    oblique sections
  • identification of a connecting filament protein
    in insect fibrillar flight muscle
  • the invertebrate myosin filament subfilament
    arrangement of the solid filaments of insect
    flight muscles
  • structure of thick filaments from insect flight
    muscle

20
Incorporating Topic Priors
  • Either topic extraction or clustering
  • User exploration usually has preference.
  • E.g., want one topic/cluster is about foraging
    behavior
  • Use prior to guild topic extraction
  • Prior as a simple language model
  • E.g. forage 0.2 foraging 0.3 food 0.05 etc.

21
Incorporating a Topic Prior
22
Incorporating Topic Priors Sample Topic 1
age 0.0672687 division
0.0551497 labor 0.052136 colony
0.038305 foraging 0.0357817 foragers
0.0236658 workers 0.0191248 task
0.0190672 behavioral
0.0189017 behavior 0.0168805 older
0.0143466 tasks 0.013823 old
0.011839 individual
0.0114329 ages 0.0102134 young
0.00985875 genotypic
0.00963096 social 0.00883439
Prior labor 0.2 division 0.2
23
Incorporating Topic Priors Sample Topic 2
behavioral 0.110674 age
0.0789419 maturation 0.057956 task
0.0318285 division 0.0312101 labor
0.0293371 workers
0.0222682 colony 0.0199028 social
0.0188699 behavior
0.0171008 performance 0.0117176 foragers
0.0110682 genotypic 0.0106029 differences
0.0103761 polyethism 0.00904816 older
0.00808171 plasticity
0.00804363 changes 0.00794045
Prior behavioral 0.2 maturation 0.2
24
Exploit Prior for Concept Switching
foraging 0.290076 nectar
0.114508 food 0.106655 forage
0.0734919 colony
0.0660329 pollen 0.0427706 flower
0.0400582 sucrose
0.0334728 source 0.0319787 behavior
0.0283774 individual 0.028029 rate
0.0242806 recruitment
0.0200597 time 0.0197362 reward
0.0196271 task
0.0182461 sitter 0.00604067 rover
0.00582791 rovers
0.00306051
foraging 0.142473 foragers
0.0582921 forage 0.0557498 food
0.0393453 nectar
0.03217 colony 0.019416 source
0.0153349 hive
0.0151726 dance 0.013336 forager
0.0127668 information
0.0117961 feeder 0.010944 rate
0.0104752 recruitment
0.00870751 individual 0.0086414 reward
0.00810706 flower
0.00800705 dancing 0.00794827 behavior
0.00789228
25
Part 3 Task Support
26
Gene Summarization
  • Task Automatically generate a text summary for a
    given gene
  • Challenge Need to summarize different aspects of
    a gene
  • Standard summarization methods would generate an
    unstructured summary
  • Solution A new method for generating
    semi-structured summaries

27
An Ideal Gene Summary
  • http//flybase.bio.indiana.edu/.bin/fbidq.html?FBg
    n0000017

GP
EL
SI
GI
MP
WFPI
28
Semi-structured Text Summarization
29
Summary example (Abl)
30
A General Entity Summarizer
  • Task Given any entity and k aspects to
    summarize, generate a semi-structured summary
  • Assumption Training sentences available for each
    aspect
  • Method
  • Train a recognizer for each aspect
  • Given an entity, retrieve sentences relevant to
    the entity
  • Classify each sentence into one of the k aspects
  • Choose the best sentences in each category

31
Summary
  • All the methods we developed are
  • General
  • Scalable
  • The problems are hard, but good progress has been
    made in all the directions
  • The V3 system has only incorporated the basic
    research results
  • More advanced technologies are available for
    immediate implementation
  • Better tokenization for retrieval
  • Domain adaptation techniques
  • Automatic topic labeling
  • General entity summarizer
  • More research to be done in
  • Entity relation extraction
  • Graph mining/question answering
  • Domain adaptation
  • Active learning

32
Looking Ahead X-Space
Users

Task Support
Function Annotator
Gene Summarizer
Space Navigation
Space/Region Manager, Navigation Support
Text Miner
Search Engine
Relational Database
Words/Phrases Entities
Content Analysis
Natural Language Understanding
Meta Data
Literature Text
33
Thank You!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com