AKANE System: ProteinProtein Interaction Pairs - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

AKANE System: ProteinProtein Interaction Pairs

Description:

University of Tokyo, Tsujii Lab. 6 /13. Our Approach. Two Interfaces to ... University of Tokyo, Tsujii Lab. 9 /13. Training Data (co-occurences) Evaluation ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 43
Provided by: wwwtsuji
Category:

less

Transcript and Presenter's Notes

Title: AKANE System: ProteinProtein Interaction Pairs


1
AKANE SystemProtein-Protein Interaction Pairs
  • Rune Sætre, K Yoshida, A Yakushiji, Y
    Matsubayashi, Y Miyao, T Ohta
  • Tsujii-Lab, University of Tokyo
  • BioCreAtIvE II, 28-May-09

2
Overview
  • System Description
  • Evaluation
  • Results
  • Error Analysis
  • Future Work

3
Our System
  • Core Component AKANE PPI System
  • Trained on the AImed corpus
  • Other components
  • Sentence Splitter (Matsubayashi)
  • Protein Name Identifier (Yoshida)
  • Parser ENJU (Miyao et al., 05)

4
Overview of the system
GENIA corpus
HTML-stripping Sentence splitter
Input Text
Penn Treebank
POS tagger
POS-tagged Text
Enju Parser
GENA Dictionary
Syntax Parsed Text
Uniprot DataBase
Named Entity Identifier
AIMED
Named Entity Tagged Text
AKANE
Co-occurrences
BioCreative Training Data
Interaction Selector
Interaction annotations
5
AKANE Acquisition of Information Extraction
Patterns
  • Acquiring patterns on predicate argument
    structures from training data Yakushiji et al.,
    2006
  • Training data interacting proteins are provided

CD4 protein interacts with non-polymorphic
regions of MHCII .
ENTITY1
ENTITY2
Pattern on predicate argument structure
arg1
arg1
arg2
arg1
arg2
argM
with
CD4
protein
interact
non-polymorphic
region
of
MHCII
ENTITY2
ENTITY1
arg1
6
Our Approach
  • Two Interfaces to the AKANE PPI System
  • Change Biocreative2 training data into AImed-like
    training data to train the system on full-text
    interactions
  • Use the AKANE System rules that were extracted
    from AImed abstracts earlier
  • All 3 runs were optimized for good F-score on the
    training data
  • Only top most likely ProtID for each text
    segment
  • Only 20 pairs suggested for each article

7
Results
  • Run1 BioCreative
  • F-score of 10.5 (P8.2 and R14.6)
  • Run2 BioCreative, Species-filter
  • F-score of 13.7 (P10.6 and R19.1)
  • Run3 Aimed, Species-filter
  • F-score of 15.8 (P15.7 and R15.9)

8
Evaluation
9
Training Data (co-occurences) Evaluation
  • Co-occurrence interaction?
  • 28 sentences / 47 pairs (59 including FNs)
  • 25 pairs (53) TP 53 Precision
  • 22 pairs (47) FP
  • 8 sentences / 12 pairs were FNs
  • 25 / (2512) 68 Recall
  • Problem Training data should be complete!
  • FN according to BioCreatives definition of
    verified interactions?
  • Bad NE Identification Example PMID 11447115
    The cell polarity protein ASIP / PAR-3 directly
    associates with junctional adhesion molecule
    (JAM)
  • Conclusion Transformation of training data into
    co-occurrence sentences leads to low quality
    training data for the Akane system.
  • file//C/biocreative2/erroranalysis/Ohta_evaluati
    on.xml

10
Species Evaluation
  • Species human Count 1510
  • Species human - hpv16 Count 34
  • Species human - human Count 1289
  • Species human - mouse Count 68
  • Species human - rat Count 24
  • Species mouse Count 351
  • Species mouse - human Count 58
  • Species mouse - mouse Count 258
  • Species rat Count 122
  • Species rat - human Count 23
  • Species rat - mouse Count 11
  • Species rat - rat Count 79
  • Species schpo Count 74
  • Species schpo - schpo Count 68
  • Species syny3 Count 73
  • Species syny3 - syny3 Count 71
  • Species xenla Count 22
  • Species xenla - xenla Count 21
  • Species yeast Count 182
  • Articles given in TrainSet 740
  • Collection pairs (perfect NER) in TrainSet 3085
  • Interacting Proteins in TrainSet 3221
  • Species/Interactions (10 representatives)
  • Species 9tryp Count 14
  • Species 9tryp - 9tryp Count 12
  • Species arath Count 150
  • Species arath - arath Count 141
  • Species bovin Count 17
  • Species bovin - bovin Count 16
  • Species caeel Count 138
  • Species caeel - caeel Count 133
  • Species chick Count 15
  • Species chick - chick Count 13
  • Species drome Count 136
  • Species drome - drome Count 128
  • Species ecoli Count 24
  • Species ecoli - ecoli Count 22

11
Summary
  • A small high-quality corpus (Aimed, Run3)
    produced better results than a large
    automatically generated (50 correct) corpus
    (Run12)
  • Machine Learning will improve the results
  • Aimed cross-validation gave 35 F-measure
  • Machine Learning raised the results to 58
    F-measure (on sentence level evaluation)
  • We will see if such gain is possible for the
    BioCreative data as well
  • A good corpus is needed for training

12
Summary / Message
  • Deep Syntactic Parsing produced good results in
    the BioCreative challenge (1st Quartile)
  • The more you know about the structure of the
    language, the more information you can extract
    from it

13
Future Improvements
  • Granularity
  • Sentence, Paragraph, Section or Paper as the unit
    of PPI detection
  • AKANE is sentence-based
  • Doesnt work well on document level?
  • Idea All co-occurrence sentences pr. pair is one
    unit!
  • Improved NE Identification Needed
  • 70 NER F-score ? 49 Pair F-score expected
  • Include (more) Machine Learning

14
Questions or Comments?
  • Thank You for Listening!
  • Acknowledgements
  • Yakushiji Akane, Yoshida Kazuhiro, Miyao Yusuke,
    Ohta Tomoko, Matsubyashi Yuichiro and Jin-Dong
    Kim
  • BioCreative Organizers
  • Jörg Hakenberg, for his species idea ?

15
AIMED
  • Consists of 225 Medline abstracts
  • 200 are known to describe interactions between
    human proteins
  • 25 do not refer to any interaction
  • 4084 protein references
  • 1000 tagged interactions

2.1 in http//www.cs.utexas.edu/users/ml/papers/
bionlp-aimed-04.pdf
5.1 in http//books.nips.cc/papers/files/nips18/
NIPS2005_0450.pdf
16
References
  • Miyao et al., 2005 Y. Miyao, T. Ninomiya, and
    J. Tsujii. Corpus-oriented grammar development
    for acquiring a Head-driven Phrase Structure
    Grammar from the Penn Treebank. In Natural
    Language Processing IJCNLP 2004, volume 3248 of
    LNAI, pages 684693. Springer-Verlag, 2005.
  • Yakushiji et al., 2006, Automatic Construction
    of Predicate-argument Structure Patterns for
    Biomedical Information Extraction, EMNLP06
    poster

17
(No Transcript)
18
BioCreative2 - Definition of PPI
  • PPI Protein-Protein Interaction
  • IntAct and MINT curate all interactions of
    interaction type (MI0190)
  • Colocalisation (MI0403)
  • Physical interaction (MI0218) (direct)
  • Binding or Reaction
  • Acetylation, Cleavage, Phosphorylation,
    Methylation
  • Lipid Addition, DNA Strand Elongation
  • Not included
  • Genetic interactions
  • Predicted interactions, Speculation

MI ontology (file)
19
BioCreative Training Data
  • Example PMID 9045636
  • pmid1 detection1 interactorName1interactorXref1
    interactorName2interactorXref2
    interactionType1
  • 9045636 coip cbl_humanP22681
    sam68_humanQ07666 physical interaction
  • 9045636 coip fyn_humanP06241
    sam68_humanQ07666 ...
  • 9045636 coip grb2_humanP62993
    sam68_humanQ07666 ...
  • 9045636 coip lyn_humanP07948
    sam68_humanQ07666 ...
  • 9045636 coip sam68_humanQ07666
    grb2_humanP62993 ...
  • 9045636 coip sam68_humanQ07666
    jak3_humanP52333 ...
  • 9045636 coip sam68_humanQ07666
    lck_humanP06239 ...
  • 9045636 coip sam68_humanQ07666
    plcg1_humanP19174 ...
  • 9045636 coip sam68_humanQ07666
    ptn6_humanP29350 ...
  • 9045636 pull down fyn_humanP06241
    sam68_humanQ07666 ...
  • 9045636 pull down sam68_humanQ07666
    fyn_humanP06241 ...
  • 9045636 pull down sam68_humanQ07666
    lck_mouseP06240 ...
  • 9045636 pull down sam68_humanQ07666
    lck_mouseP06240 ...

20
Training Data Creation
  • Given from BioCreative
  • Interaction sets, from 1 to many (66) proteins
  • Spokes model Baits and Preys
  • Spokes is the most important evaluation type
  • but not included in the training data
  • Matrix model All (66) proteins interact
    pair-wise (2145 pairs)
  • We used the Matrix Model, and extracted
    co-occurrence sentences

21
Predicate Argument Structure
So
NP1
VP15
VP21
DT2
NP4
VP16
ARG1
ARG1
ARG1
ARG2
VP17
AV19
VP22
AJ5
NP7
NP25
A
ARG2
ARG1
does
NP24
NP10
not
exclude
normal
NP8
MOD
ARG1
AJ26
NP28
NP13
serum
NP11
ARG1
MOD
NP29
NP31
measurement
deep
CRP
MOD
thrombosis
vein
A normal serum CRP measurement does not exclude
deep vein thrombosis.
22
(No Transcript)
23
Ideas
  • Rhetorical Zoning
  • Use only Abstract, Methods or other (composed)
    parts?

24
Example PMID 11447115
  • The cell polarity protein ASIP / PAR-3 directly
    associates with junctional adhesion molecule
    (JAM)
  • Correct pard3_ratQ9Z340 - jam1_ratQ9JHY1

25
Task Specification
  • Describe your system
  • novel features (parsing, patterns)
  • error analysis understanding the problem
  • the other task IAS
  • relationship between the systems.
  • (funny figure!)

26
Extra
  • Show actual sentence with two protein names, the
    confidence scores, and corresponding pair
    scores.
  • Prot A activate prot B
  • ListA ListB PairList (long!)

27
Results / Last Minute Ideas
  • Training Evaluation Low F-measure around 10
    (Balanced)
  • Use interactions only between proteins within the
    same species?
  • Not between
  • I2C2_MOUSE (Q8CJG0)
  • DICER_HUMAN (Q9UPY3)
  • F-measure increased from 10 to 14

28
Alternative Directions
  • A Classification of Pair-Text Task?
  • Group all sentences containing one or both of
    the proteins for each pair
  • Interaction or not? Training and Predicting
  • Based on MUC or SIGIR approaches/systems
  • Negative examples? Remaining Sentences?
  • Abstracts from IAS?
  • Without negative examples Probabilistic model?
  • Need good negative examples!
  • Interannotator agreement?

29
Our Submission
  • Two Interfaces to the AKANE System
  • Change Biocreative2 training data into AIMED-like
    training data to train the system on full-text
    interactions
  • Run number 1 and 2
  • Use the AKANE System that was trained on
    interactions from MEDLINE abstracts (AIMED
    corpus)
  • Run number 3
  • All runs optimized for balanced F-score
  • Show results
  • Excel and PDF

30
Complex Merging of Patterns
  • Hierarchy
  • General Root Node
  • Specified Child Nodes
  • Rewrite in C
  • Use some LiLFeS predicates?
  • Which Format?

31
Plural Forms
  • 1) , the C-terminal regions of the
    MDM2 and
    MDMX
    proteins likely play an important
    role in their interaction .
  • Yeast two-hybrid interaction between MDM2 and MDMX proteins .

32
Plural Forms
  • How to treat plural forms?
  • Are they proteins?
  • Or families?
  • Necessary to distinguish?
  • How?

33
AKANE System Challenges
  • Memory Boundaries exceeded in merging of
    patterns
  • Rewrite in C
  • Infinite Loops encountered while splitting the
    patterns
  • Because of bad sentence splitting
  • Adopt sentence splitter to BioCreative Domain?
  • Examples on following slides

34
Sentence-Splitting - Patterns
  • Primary reason for pattern splitting failure is
    bad sentence splitting
  • In contrast , the suppression effect of double
    knockdown appears much greater than the sum of
    effects of single knockdown of dicer and
    eIF2C1.(B and C) Coimmunoprecipitation
    experiments of Dicer and either eIF2C1 or
    pair1 eIF2C2
    with anti- Dicer
    or anti- Myc
    antibodies .

35
  • From set0
  • ((1 'enjutypes'conj_relation

    'enjutypes''PRED'\('mayzlexe
    ntry'lex_entry

    'mayzlexentry''LEX_WORD'\('papatenju_pos'
    word_entity

    'papatenju_pos''ENTITY'\"(entity1)"

    'papatenju_pos''PROTEIN'\"protei
    n"
    'mayzword''BASE_P
    OS'\"CC" ))
  • 'enjutypes''MODARG'\((2 'enjutypes'noun_rel
    ation

    'enjutypes''PRED'\('mayzlexentry'lex_entry


    'mayzlexentry''LEX_WORD'\('papatenju_pos'word
    _entity

    'papatenju_pos''PROTEIN'\""

    'mayzword''BASE'\"eif2c1"

    'mayzword''BASE_P
    OS'\"NN" )))))) ,
  • ((3 'enjutypes'noun_relation

    'enjutypes''PRED'\('mayzlexent
    ry'lex_entry

    'mayzlexentry''LEX_WORD'\('papatenju_pos'w
    ord_entity

    'papatenju_pos''ENTITY'\"(entity2)"

    'papatenju_pos''PROTEIN'\"protei
    n"
    'mayzword''BASE_P
    OS'\"NN" )))) ,
  • 3, ('enjutypes'coordination_relation

    'enjutypes''PRED'\('mayz
    lexentry'lex_entry

    'mayzlexentry''LEX_WORD'\('papatenju
    _pos'word_entity

    'papatenju_pos''PROTEIN'\""

    'mayzword''BASE'\"or"


    'mayzword''BASE_POS'\"CC" ))
  • 'enjutypes''ARG1'\2

    'enjutypes''ARG2'\3
    )
    , 2

    , 1


    , 1

    , 1

    .

36
Summary
37
Astra Zeneca
  • ESR1 (2099) IGF1R (3480) query with gProt
  • http//furu.idi.ntnu.no8080/gprot/index.y
  • 51 75 Facts extracted Around 10 GO terms
  • Parsing is needed to further improve and filter
    the results
  • Show Examples

38
Original Subtasks, No PDF-work
39
Presentation 16/11-06
  • Show false and positive co occurrences with
    percentage given
  • Show system architecture figure

40
TODO
  • Change to ID from Accession numbers (300k id, but
    only 200k AccNumber)
  • Less Ambiguous Term Identifiers
  • About pattern types
  • Two, Single, Double, Naive
  • Which is better? Make more patterns?

41
Biocreative2 Participants
  • Gene Mention Task 20 groups
  • Gene Normalization Task 20 groups
  • Protein-Protein Interaction Task 26 groups
  • 19 teams, Article/Abstract (IAS) subtask
  • 16 teams, Pairs (IPS) subtask
  • 11 teams, Sentences (ISS) subtask
  • 2 teams, Method (IMS) subtask

42
Overview
  • BioCreative2
  • PPI Definition
  • Our System
  • Co-occurrence Evaluation
  • Alternative Directions
  • Summary
Write a Comment
User Comments (0)
About PowerShow.com