ProtChew - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

ProtChew

Description:

'Stimulation','was','effected','by','1','mug/kg/h','pentagastrin','via','dr ip','infusion' ... mug/kg/h/None , pentagastrin/None , via/in , drip/nn , infusion/nn ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 21
Provided by: idiN
Category:
Tags: protchew | drip

less

Transcript and Presenter's Notes

Title: ProtChew


1
ProtChew
  • Automatic Extraction of Protein Names from
    BioMedical Texts
  • Rune Sætre, NTNU

2
Outline
  • Introduction
  • Motivation GeneTUC
  • Motivation Challenges
  • The ProtChew System
  • Results
  • Summary

3
Acknowledgements
  • Project participants
  • Department of Computer and Information Science
  • Amund Tveit and Rune Sætre
  • Supervisor Tore Amble
  • Department of Cancer Research and Molecular
    Medicine
  • Astrid Lægreid and Tonje Strømmen Steigedal
  • At the Norwegian University of Technology and
    Science (NTNU), Norway

4
GeneTUC Project Goals
  • Help biologists discover already known facts from
    the literature
  • Make computers understand bio-medical language
  • Provide a natural language interface to
    bio-molecular information

5
Challenges
  • No complete Database of Protein names
  • Make such DBs automatically

6
W O R K F L O W
7
1 Data Selection
  • 7 Million Medline Abstracts
  • 12.000 about Gastrin
  • Example
  • Stimulation was effected by 1 mug/kg/h
    pentagastrin via drip infusion.

8
2 (Text) Tokenization
  • WhiteSpaceTokenizer
  • From Natural Language Tool Kit
  • Python NLTK
  • Example
  • Stimulation,was,effected,by,1,mug/kg/h
    ,pentagastrin,via,drip,infusion,.

9
3 POS-tagging
  • Brill Tagger
  • Trained on Brown Corpus (1 million words)
  • Based on initial Unigram tagging
  • Example
  • ltStimulation/Nonegt,ltwas/bedzgt,lteffected/vbngt,
    ltby/ingt,lt1/cdgt,ltmug/kg/h/Nonegt,ltpentagastrin/Nonegt
    ,ltvia/ingt,ltdrip/nngt,ltinfusion/nngt,lt./.gt

10
4 Porter Stemming token
  • Too many unknown words after Brown Tagging
  • Stemming helped only a bit
  • Example
  • ltTEXTStimulation', STEMStimulgt
  • ltTEXTmug/kg/h, STEMmug/kg/hgt
  • ltTEXTpentagastrin, STEMpentagastringt

11
5 Gsearch Tagging
  • Gsearch combines other databases
  • LocusLink, UniGene, Swiss-prot
  • Hits in Name or Symbol ? Part of Protein
    (PoP)
  • Hits in other fields ? Not PoP
  • Example
  • Pentagastrin GSearch_Medline (4) (NOT PoP)
  • mug/kg/h None (Not Found)
  • Gastrin GSearch_Name (4) (PoP)

12
6 Feature Selection
  • HASBRACKET
  • HASFIRSTUPPER
  • HASNONALPHANUMPREFIX
  • ISLOWERCASE
  • ISNUMERIC
  • ISUPPERCASE
  • Text, POS
  • Example
  • 0 0 0 1 0 0 pentagastrin
    GSearch_Medline (45)
  • 0 0 0 1 0 0 gastrin GSearch_Name
    (4)

13
7 Classification
  • Trained with PoPs from Gsearch as Positive
    Examples (How many?)
  • No-PoPs as Negative Examples (how Many?
  • Classified the rest using different
    Classifiers
  • C4.5, SVM, others?

14
8 Automatic Evaluation
  • Jack-knife test
  • 10-fold cross validation

15
9 Expert Evaluation
  • Random sample of 200 Unknowns

16
10 Post Mortem Analysis
  • Where to go?

17
Results
  • Precision 17.7
  • Recall 3.2
  • F-score (2PR)/(PR) 5.4
  • Classification Accuracy 82.5

18
Future
  • Running the system on BioCreative Data
  • Results that can be compared between projects

19
Example Results
  • Stimulation was effected by 1 mug/kg/h
    pentagastrin via drip infusion.
  • gastrin

20
Example Texts
  • URL1 http//www.ncbi.nlm.nih.gov/entrez/query.fcg
    i?cmdRetrievedbPubMedlist_uids26doptAbstrac
    t
  • URL2 http//www.ncbi.nlm.nih.gov/entrez/query.fcg
    i?cmdRetrievedbPubMedlist_uids491doptAbstra
    ct
Write a Comment
User Comments (0)
About PowerShow.com