Title: AKANE System: ProteinProtein Interaction Pairs
1AKANE SystemProtein-Protein Interaction Pairs
- Rune Sætre, K Yoshida, A Yakushiji, Y
Matsubayashi, Y Miyao, T Ohta
- Tsujii-Lab, University of Tokyo
- BioCreAtIvE II, 28-May-09
2Overview
- System Description
- Evaluation
- Results
- Error Analysis
- Future Work
3Our System
- Core Component AKANE PPI System
- Trained on the AImed corpus
- Other components
- Sentence Splitter (Matsubayashi)
- Protein Name Identifier (Yoshida)
- Parser ENJU (Miyao et al., 05)
4Overview of the system
GENIA corpus
HTML-stripping Sentence splitter
Input Text
Penn Treebank
POS tagger
POS-tagged Text
Enju Parser
GENA Dictionary
Syntax Parsed Text
Uniprot DataBase
Named Entity Identifier
AIMED
Named Entity Tagged Text
AKANE
Co-occurrences
BioCreative Training Data
Interaction Selector
Interaction annotations
5AKANE Acquisition of Information Extraction
Patterns
- Acquiring patterns on predicate argument
structures from training data Yakushiji et al.,
2006
- Training data interacting proteins are provided
CD4 protein interacts with non-polymorphic
regions of MHCII .
ENTITY1
ENTITY2
Pattern on predicate argument structure
arg1
arg1
arg2
arg1
arg2
argM
with
CD4
protein
interact
non-polymorphic
region
of
MHCII
ENTITY2
ENTITY1
arg1
6Our Approach
- Two Interfaces to the AKANE PPI System
- Change Biocreative2 training data into AImed-like
training data to train the system on full-text
interactions
- Use the AKANE System rules that were extracted
from AImed abstracts earlier
- All 3 runs were optimized for good F-score on the
training data
- Only top most likely ProtID for each text
segment
- Only 20 pairs suggested for each article
7Results
- Run1 BioCreative
- F-score of 10.5 (P8.2 and R14.6)
- Run2 BioCreative, Species-filter
- F-score of 13.7 (P10.6 and R19.1)
- Run3 Aimed, Species-filter
- F-score of 15.8 (P15.7 and R15.9)
8Evaluation
9Training Data (co-occurences) Evaluation
- Co-occurrence interaction?
- 28 sentences / 47 pairs (59 including FNs)
- 25 pairs (53) TP 53 Precision
- 22 pairs (47) FP
- 8 sentences / 12 pairs were FNs
- 25 / (2512) 68 Recall
- Problem Training data should be complete!
- FN according to BioCreatives definition of
verified interactions?
- Bad NE Identification Example PMID 11447115
The cell polarity protein ASIP / PAR-3 directly
associates with junctional adhesion molecule
(JAM) - Conclusion Transformation of training data into
co-occurrence sentences leads to low quality
training data for the Akane system.
- file//C/biocreative2/erroranalysis/Ohta_evaluati
on.xml
10Species Evaluation
- Species human Count 1510
- Species human - hpv16 Count 34
- Species human - human Count 1289
- Species human - mouse Count 68
- Species human - rat Count 24
- Species mouse Count 351
- Species mouse - human Count 58
- Species mouse - mouse Count 258
- Species rat Count 122
- Species rat - human Count 23
- Species rat - mouse Count 11
- Species rat - rat Count 79
- Species schpo Count 74
- Species schpo - schpo Count 68
- Species syny3 Count 73
- Species syny3 - syny3 Count 71
- Species xenla Count 22
- Species xenla - xenla Count 21
- Species yeast Count 182
- Articles given in TrainSet 740
- Collection pairs (perfect NER) in TrainSet 3085
- Interacting Proteins in TrainSet 3221
- Species/Interactions (10 representatives)
- Species 9tryp Count 14
- Species 9tryp - 9tryp Count 12
- Species arath Count 150
- Species arath - arath Count 141
- Species bovin Count 17
- Species bovin - bovin Count 16
- Species caeel Count 138
- Species caeel - caeel Count 133
- Species chick Count 15
- Species chick - chick Count 13
- Species drome Count 136
- Species drome - drome Count 128
- Species ecoli Count 24
- Species ecoli - ecoli Count 22
11Summary
- A small high-quality corpus (Aimed, Run3)
produced better results than a large
automatically generated (50 correct) corpus
(Run12) - Machine Learning will improve the results
- Aimed cross-validation gave 35 F-measure
- Machine Learning raised the results to 58
F-measure (on sentence level evaluation)
- We will see if such gain is possible for the
BioCreative data as well
- A good corpus is needed for training
12Summary / Message
- Deep Syntactic Parsing produced good results in
the BioCreative challenge (1st Quartile)
- The more you know about the structure of the
language, the more information you can extract
from it
13Future Improvements
- Granularity
- Sentence, Paragraph, Section or Paper as the unit
of PPI detection
- AKANE is sentence-based
- Doesnt work well on document level?
- Idea All co-occurrence sentences pr. pair is one
unit!
- Improved NE Identification Needed
- 70 NER F-score ? 49 Pair F-score expected
- Include (more) Machine Learning
14Questions or Comments?
- Thank You for Listening!
- Acknowledgements
- Yakushiji Akane, Yoshida Kazuhiro, Miyao Yusuke,
Ohta Tomoko, Matsubyashi Yuichiro and Jin-Dong
Kim
- BioCreative Organizers
- Jörg Hakenberg, for his species idea ?
15AIMED
- Consists of 225 Medline abstracts
- 200 are known to describe interactions between
human proteins
- 25 do not refer to any interaction
- 4084 protein references
- 1000 tagged interactions
2.1 in http//www.cs.utexas.edu/users/ml/papers/
bionlp-aimed-04.pdf
5.1 in http//books.nips.cc/papers/files/nips18/
NIPS2005_0450.pdf
16References
- Miyao et al., 2005 Y. Miyao, T. Ninomiya, and
J. Tsujii. Corpus-oriented grammar development
for acquiring a Head-driven Phrase Structure
Grammar from the Penn Treebank. In Natural
Language Processing IJCNLP 2004, volume 3248 of
LNAI, pages 684693. Springer-Verlag, 2005. - Yakushiji et al., 2006, Automatic Construction
of Predicate-argument Structure Patterns for
Biomedical Information Extraction, EMNLP06
poster
17(No Transcript)
18BioCreative2 - Definition of PPI
- PPI Protein-Protein Interaction
- IntAct and MINT curate all interactions of
interaction type (MI0190)
- Colocalisation (MI0403)
- Physical interaction (MI0218) (direct)
- Binding or Reaction
- Acetylation, Cleavage, Phosphorylation,
Methylation
- Lipid Addition, DNA Strand Elongation
- Not included
- Genetic interactions
- Predicted interactions, Speculation
MI ontology (file)
19BioCreative Training Data
- Example PMID 9045636
- pmid1 detection1 interactorName1interactorXref1
interactorName2interactorXref2
interactionType1
- 9045636 coip cbl_humanP22681
sam68_humanQ07666 physical interaction
- 9045636 coip fyn_humanP06241
sam68_humanQ07666 ...
- 9045636 coip grb2_humanP62993
sam68_humanQ07666 ...
- 9045636 coip lyn_humanP07948
sam68_humanQ07666 ...
- 9045636 coip sam68_humanQ07666
grb2_humanP62993 ...
- 9045636 coip sam68_humanQ07666
jak3_humanP52333 ...
- 9045636 coip sam68_humanQ07666
lck_humanP06239 ...
- 9045636 coip sam68_humanQ07666
plcg1_humanP19174 ...
- 9045636 coip sam68_humanQ07666
ptn6_humanP29350 ...
- 9045636 pull down fyn_humanP06241
sam68_humanQ07666 ...
- 9045636 pull down sam68_humanQ07666
fyn_humanP06241 ...
- 9045636 pull down sam68_humanQ07666
lck_mouseP06240 ...
- 9045636 pull down sam68_humanQ07666
lck_mouseP06240 ...
20Training Data Creation
- Given from BioCreative
- Interaction sets, from 1 to many (66) proteins
- Spokes model Baits and Preys
- Spokes is the most important evaluation type
- but not included in the training data
- Matrix model All (66) proteins interact
pair-wise (2145 pairs)
- We used the Matrix Model, and extracted
co-occurrence sentences
21Predicate Argument Structure
So
NP1
VP15
VP21
DT2
NP4
VP16
ARG1
ARG1
ARG1
ARG2
VP17
AV19
VP22
AJ5
NP7
NP25
A
ARG2
ARG1
does
NP24
NP10
not
exclude
normal
NP8
MOD
ARG1
AJ26
NP28
NP13
serum
NP11
ARG1
MOD
NP29
NP31
measurement
deep
CRP
MOD
thrombosis
vein
A normal serum CRP measurement does not exclude
deep vein thrombosis.
22(No Transcript)
23Ideas
- Rhetorical Zoning
- Use only Abstract, Methods or other (composed)
parts?
24Example PMID 11447115
- The cell polarity protein ASIP / PAR-3 directly
associates with junctional adhesion molecule
(JAM)
- Correct pard3_ratQ9Z340 - jam1_ratQ9JHY1
25Task Specification
- Describe your system
- novel features (parsing, patterns)
- error analysis understanding the problem
- the other task IAS
- relationship between the systems.
- (funny figure!)
26Extra
- Show actual sentence with two protein names, the
confidence scores, and corresponding pair
scores.
- Prot A activate prot B
- ListA ListB PairList (long!)
27Results / Last Minute Ideas
- Training Evaluation Low F-measure around 10
(Balanced)
- Use interactions only between proteins within the
same species?
- Not between
- I2C2_MOUSE (Q8CJG0)
- DICER_HUMAN (Q9UPY3)
- F-measure increased from 10 to 14
28Alternative Directions
- A Classification of Pair-Text Task?
- Group all sentences containing one or both of
the proteins for each pair
- Interaction or not? Training and Predicting
- Based on MUC or SIGIR approaches/systems
- Negative examples? Remaining Sentences?
- Abstracts from IAS?
- Without negative examples Probabilistic model?
- Need good negative examples!
- Interannotator agreement?
29Our Submission
- Two Interfaces to the AKANE System
- Change Biocreative2 training data into AIMED-like
training data to train the system on full-text
interactions
- Run number 1 and 2
- Use the AKANE System that was trained on
interactions from MEDLINE abstracts (AIMED
corpus)
- Run number 3
- All runs optimized for balanced F-score
- Show results
- Excel and PDF
30Complex Merging of Patterns
- Hierarchy
- General Root Node
- Specified Child Nodes
- Rewrite in C
- Use some LiLFeS predicates?
- Which Format?
31Plural Forms
- 1) , the C-terminal regions of the
MDM2 and
MDMX
proteins likely play an important
role in their interaction . - Yeast two-hybrid interaction between MDM2 and MDMX proteins .
32Plural Forms
- How to treat plural forms?
- Are they proteins?
- Or families?
- Necessary to distinguish?
- How?
33AKANE System Challenges
- Memory Boundaries exceeded in merging of
patterns
- Rewrite in C
- Infinite Loops encountered while splitting the
patterns
- Because of bad sentence splitting
- Adopt sentence splitter to BioCreative Domain?
- Examples on following slides
34Sentence-Splitting - Patterns
- Primary reason for pattern splitting failure is
bad sentence splitting
- In contrast , the suppression effect of double
knockdown appears much greater than the sum of
effects of single knockdown of dicer and
eIF2C1.(B and C) Coimmunoprecipitation
experiments of Dicer and either eIF2C1 or
pair1 eIF2C2
with anti- Dicer
or anti- Myc
antibodies .
35- From set0
- ((1 'enjutypes'conj_relation
'enjutypes''PRED'\('mayzlexe
ntry'lex_entry
'mayzlexentry''LEX_WORD'\('papatenju_pos'
word_entity
'papatenju_pos''ENTITY'\"(entity1)"
'papatenju_pos''PROTEIN'\"protei
n"
'mayzword''BASE_P
OS'\"CC" )) - 'enjutypes''MODARG'\((2 'enjutypes'noun_rel
ation
'enjutypes''PRED'\('mayzlexentry'lex_entry
'mayzlexentry''LEX_WORD'\('papatenju_pos'word
_entity
'papatenju_pos''PROTEIN'\""
'mayzword''BASE'\"eif2c1"
'mayzword''BASE_P
OS'\"NN" )))))) , - ((3 'enjutypes'noun_relation
'enjutypes''PRED'\('mayzlexent
ry'lex_entry
'mayzlexentry''LEX_WORD'\('papatenju_pos'w
ord_entity
'papatenju_pos''ENTITY'\"(entity2)"
'papatenju_pos''PROTEIN'\"protei
n"
'mayzword''BASE_P
OS'\"NN" )))) , - 3, ('enjutypes'coordination_relation
'enjutypes''PRED'\('mayz
lexentry'lex_entry
'mayzlexentry''LEX_WORD'\('papatenju
_pos'word_entity
'papatenju_pos''PROTEIN'\""
'mayzword''BASE'\"or"
'mayzword''BASE_POS'\"CC" )) - 'enjutypes''ARG1'\2
'enjutypes''ARG2'\3
)
, 2
, 1
, 1
, 1
.
36Summary
37Astra Zeneca
- ESR1 (2099) IGF1R (3480) query with gProt
- http//furu.idi.ntnu.no8080/gprot/index.y
- 51 75 Facts extracted Around 10 GO terms
- Parsing is needed to further improve and filter
the results
- Show Examples
38Original Subtasks, No PDF-work
39Presentation 16/11-06
- Show false and positive co occurrences with
percentage given
- Show system architecture figure
40TODO
- Change to ID from Accession numbers (300k id, but
only 200k AccNumber)
- Less Ambiguous Term Identifiers
- About pattern types
- Two, Single, Double, Naive
- Which is better? Make more patterns?
41Biocreative2 Participants
- Gene Mention Task 20 groups
- Gene Normalization Task 20 groups
- Protein-Protein Interaction Task 26 groups
- 19 teams, Article/Abstract (IAS) subtask
- 16 teams, Pairs (IPS) subtask
- 11 teams, Sentences (ISS) subtask
- 2 teams, Method (IMS) subtask
42Overview
- BioCreative2
- PPI Definition
- Our System
- Co-occurrence Evaluation
- Alternative Directions
- Summary