Learning to Extract Genic Interactions Using Gleaner - PowerPoint PPT Presentation

About This Presentation
Title:

Learning to Extract Genic Interactions Using Gleaner

Description:

Mark Goadrich, Louis Oliphant and Jude Shavlik. Department of Computer Sciences ... Learning Language in Logic. Biomedical Information ... UW Condor Group ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 25
Provided by: laura71
Category:

less

Transcript and Presenter's Notes

Title: Learning to Extract Genic Interactions Using Gleaner


1
Learning to ExtractGenic InteractionsUsing
Gleaner
  • LLL05 Workshop, 7 August 2005
  • ICML 2005, Bonn, Germany
  • Mark Goadrich, Louis Oliphant and Jude Shavlik
  • Department of Computer Sciences
  • University of Wisconsin Madison USA

2
Learning Language in Logic
  • Biomedical Information Extraction Challenge
  • Two tasks with and without co-reference
  • 80 sentences for training
  • 40 sentences for testing
  • Our approach Gleaner (ILP 04)
  • Fast ensemble ILP algorithm
  • Focused on recall and precision evaluation

3
A Sample Positive Example
  • Given Medical Journal abstracts tagged
    with genic interaction relations
  • Do Construct system to extract genic
    interaction phrases from unseen text
  • ykuD was transcribed by SigK RNA polymerase
    from T4 of sporulation.

4
What is a Negative Example?
  • All unlabeled word pairings?
  • Wastes time with irrelevant words
  • We know the testset will include a dictionary
  • Use only unlabeled pairings of words in
    dictionary
  • 106 positive, 414 negative without co-reference
  • 59 positive, 261 negative with co-reference

5
Tagging and Parsing
  • ykuD was transcribed by SigK RNA

6
Some Additional Predicates
  • High-scoring words in agent phrases
  • depend, bind, protein,
  • High-scoring words in target phrases
  • gene, promote, product
  • High-scoring BETWEEN agent target
  • negative, regulate, transcribe,
  • Medical Subject Headings (MeSH)
  • canonized method for indexing biomedical articles
  • in_mesh(RNA), in_mesh(gene)

7
Even More Predicates
  • Lexical Predicates
  • Internal_caps(Word)
  • alphanumeric(Word)
  • Look-ahead Phrase Predicates
  • few_POS_in_phrase(Phrase, POS)
  • phrase_contains_specific_word_triple(Phrase, W1,
    W2, W3)
  • phrase_contains_some_marked_up_arg(Phrase, Arg,
    Word, Fold)
  • Relative Location of Phrases
  • agent_before_target(ExampleID)
  • word_pair_in_between_target_phrases(ExampleID,
    W1, W2)

8
Enriched Data From Committee
  • Link Parser (CMU) creates parse tree
  • Root lemma of each word (not used)
  • 27 Syntactic Information Predicates
  • complement_of_N_N(Word, Word)
  • modifier_ADV_V(Word, Word)
  • object_V_Passive_N(Word, Word)

9
Gleaner
  • Definition of Gleaner
  • One who gathers grain left behind by reapers
  • Key Ideas of Gleaner
  • Use Aleph as underlying ILP clause engine
  • Keep wide range of clauses usually discarded
  • Create separate theories for different recall
    ranges

10
Aleph - Background
  • Seed Example
  • A positive example that our clause must cover
  • Bottom Clause
  • All predicates which are true about seed example

11
Aleph - Learning
  • Aleph learns theories of clauses (Srinivasan,
    v4, 2003)
  • Pick positive seed example, find bottom clause
  • Use heuristic search to find best clause
  • Pick new seed from uncovered positivesand repeat
    until threshold of positives covered
  • Theory produces one recall-precision point
  • Learning complete theories is time-consuming
  • Can produce ranking with ensembles

12
Gleaner - Background
  • Rapid Random Restart (Zelezny et al ILP 2002)
  • Stochastic selection of initial clause
  • Time-limited local heuristic search
  • Randomly choose new initial clause and repeat

seed
13
Gleaner - Learning
  • Create B Bins
  • Generate Clauses
  • Record Best per Bin
  • Repeat for K seeds

Precision
Recall
14
Gleaner - Combining
  • Combine K clauses per bin
  • If at least L of K clauses match, call example
    positive
  • How to choose L ?
  • L1 then high recall, low  precision
  • LK then low  recall, high precision
  • We want a collection of high precision theories
    spanning space of recall levels

15
Gleaner - Overlap
  • Take topmost curve of overlapping theories

Precision
Recall
16
Gleaner - Practical Use
  • Generate Curve
  • User Selects Recall Bin
  • Return ClassificationsWith L of K Confidence

Precision
Recall 0.50 Precision 0.70
Recall
17
Sample Extraction Clause
  • agent_target(Agent, Target, Sentence) -
  • several_phrases_in_sentence(Sentence),
  • some_wordPOS_in_sentence(Sentence, novelword),
  • n(Agent),
  • alphabetic(Agent),
  • word_parent(Agent, F),
  • phrase_contains_internal_cap_word(F, noun,
    _),
  • few_POS_in_phrase(F, novelword),
  • in_between_target_phrases(Agent, Target, _),
  • n(Target).
  • 0.14 Recall, 0.93 Precision on without
    co-reference training set

18
Sample Extraction Clause
  • agent_target(Agent, Target, Sentence)
    - avg_length_sentence(Sentence),
  • n(Agent),
  • word_previous(Target,_),
  • in_between_target_phrases(Agent, Target, _).
  • 0.76 Recall, 0.49 Precision on without
    co-reference training set

19
Experimental Methodology
  • Used other trainset for tuneset in both cases
  • Testset unlabeled, but dictionary provided
  • Included sentences with no positives
  • 936 total testset examples generated
  • Parameter Settings
  • Gleaner (20 recall bins)
  • seeds 100
  • clauses 25,000
  • Aleph (0.75 minimum accruacy)
  • nodes 1K, 25K)

20
LLL Without Co-reference Results
Gleaner Basic
Aleph Basic 1K
Gleaner Enriched
21
LLL With Co-reference Results
Gleaner Basic
Aleph Basic 1K
Gleaner Enriched
22
We Need More Datasets
  • LLL Challenge task is small
  • Would prefer to do cross-validation
  • Need labels for testset
  • Our ILP04 dataset open to community
  • ftp//ftp.cs.wisc.edu/machine-learning/shavlik-gro
    up/datasets/IE-protein-location
  • Biomedical information-extraction tasks
  • Genetic Disorder (Ray and Craven 2001)
  • Genia BioCreAtiVe

23
Conclusions
  • Contributions
  • Develop large amount of background knowledge
  • Exploit normally discarded clauses
  • Visually present precision and recall trade-off
  • Proposed Work
  • Achieve gains in High-Recall areas
  • Reduce overfitting when using enriched data
  • Increase diversity of learned clauses

24
Acknowledgements
  • USA DARPA Grant F30602-01-2-0571
  • USA Air Force Grant F30602-01-2-0571
  • USA NLM Grant 5T15LM007359-02
  • USA NLM Grant 1R01LM07050-01
  • UW Condor Group
  • David Page, Vitor Santos Costa, Ines Dutra,
    Soumya Ray, Marios Skounakis, Mark Craven, Burr
    Settles, Jessie Davis
Write a Comment
User Comments (0)
About PowerShow.com