ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature

1 / 17
About This Presentation
Title:

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature

Description:

ISMB 2003 presentation. Extracting Synonymous Gene and Protein Terms from ... Friedman 03] [Pakhomov 02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida ... –

Number of Views:45
Avg rating:3.0/5.0
Slides: 18
Provided by: hon48
Category:

less

Transcript and Presenter's Notes

Title: ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature


1
ISMB 2003 presentation Extracting Synonymous
Gene and Protein Terms from Biological Literature
  • Hong Yu and Eugene Agichtein

Dept. Computer Science, Columbia University, New
York, USA hongyu, eugene_at_cs.columbia.edu 212-939
-7028
2
Significance and Introduction
  • Genes and proteins are often associated with
    multiple names
  • Apo3, DR3, TRAMP, LARD, and lymphocyte associated
    receptor of death
  • Authors often use different synonyms
  • Information extraction benefits from identifying
    those synonyms
  • Synonym knowledge sources are not complete
  • Developing automate approaches for identifying
    gene/protein synonyms from literature

3
Background-synonym identification
  • Semantically related words
  • Distributional similarity Lin 98Li and Abe
    98Dagan et al 95
  • beer and wine
  • drink, people, bottle and make
  • Mapping abbreviations to full forms
  • Map LARD to lymphocyte associated receptor of
    death
  • Bowden et al. 98 Hisamitsu and Niwa 98 Liu
    and Friedman 03 Pakhomov 02 Park and Byrd 01
    Schwartz and Hearst 03 Yoshida et al. 00 Yu
    et al. 02
  • Methods for detecting biomedical multiword
    synonyms
  • Sharing a word(s) Hole 00
  • cerebrospinal fluid? cerebrospinal fluid protein
    assay
  • Information retrieval approach
  • Trigram matching algorithm Wilbur and Kim 01
  • Vector space model
  • cerebrospinal fluid?cer, ere, , uid
  • cerebrospinal fluid protein assay?cer,ere, , say

4
Background-synonym identification
  • GPE Yu et al 02
  • A rule-based approach for detecting synonymous
    gene/protein terms
  • Manually recognize patterns authors use to list
    synonyms
  • Apo3/TRAMP/WSL/DR3/LARD
  • Extract synonym candidates and heuristics to
    filter out those unrelated terms
  • ng/kg/min
  • Advantages and disadvantages
  • High precision (90)
  • Recall might be low, expensive to build up

5
BackgroundMachine-learning
  • Machine-learning reduces manual effort by
    automatically acquiring rules from data
  • Unsupervised and supervised
  • Semi-supervised
  • Bootstrapping Hearst 92, Yarowsky 95 Agichtein
    and Gravano 00
  • Hyponym detection Hearst 92
  • The bow lute, such as the Bambara ndang, is
    plucked and has an individual curved neck for
    each string.
  • A Bambara ndang is a kind of bow lute
  • Co-training Blum and Mitchell 98

6
Method-Outline
  • Machine-learning
  • Unsupervised
  • Similarity Dagan et al 95
  • Semi-supervised
  • Bootstrapping
  • SNOWBALL Agichtein and Gravano 02
  • Supervised
  • Support Vector Machine
  • Comparison between machine-learning and GPE
  • Combined approach

7
Method--Unsupervised
  • Contextual similarity Dagan et al 95
  • Hypothesis synonyms have similar surrounding
    words
  • Mutual information
  • Similarity

8
Methodssemi-supervised
  • SNOWBALL Agichtein and Gravano 02
  • Bootrapping
  • Starts with a small set of user-provided seed
    tuples for the relation, automatically generates
    and evaluates patterns for extracting new tuples.

Apo3, also known as DR3
Apo3, DR3
DR3, also called LARD
ltGENEgt, also called ltGENEgt
LARD, Apo3
ltGENEgt, also known as ltGENEgt
DR3, LARD
9
Method--Supervised
  • Support Vector Machine
  • State-of-the-art text classification method
  • SVMlight
  • Training sets
  • The same sets of positive and negative tuples as
    the SNOWBALL
  • Features the same terms and term weights used by
    SNOWBALL
  • Kernel function
  • Radial basis kernel (rbf) kernel function

10
MethodsCombined
  • Rational
  • Machine-learning approaches increase recall
  • The manual rule-based approach GPE has a high
    precision with lower recall
  • Combined will boost both recall and precision
  • Method
  • Assume each system is an independent predictor
  • Prob1-Prob that all systems extracted incorrectly

11
Evaluation-data
  • Data
  • GeneWays corpora Friedman et al 01
  • 52,000 full-text journal articles
  • Science, Nature, Cell, EMBO, Cell Biology, PNAS,
    Journal of Biochemistry
  • Preprocessing
  • Gene/Protein name entity tagging
  • Abgene Tanabe and Wilbur 02
  • Segmentation
  • SentenceSplitter
  • Training and testing
  • 20,000 articles for training
  • Tuning SNOWBALL parameters such as context
    window, etc.
  • 32,000 articles for testing

12
Evaluation-matrices
  • Estimating precision
  • Randomly select 20 synonyms with confident scores
    (0.0-0.1, 0.1-0.2, ,0.9-1.0)
  • Biological experts judged the correctness of
    synonym pairs
  • Estimating recall
  • SWISSPROTGold Standard
  • 989 pairs of SWISSPROT synonyms co-appear in at
    least one sentence in the test set
  • Biological experts judged 588 pairs were indeed
    synonyms
  • and cdc47, cdc21, and mis5 form another
    complex, which relatively weakly associates with
    mcm2

13
Results
  • Patterns SNOWBALL found
  • Of 148 evaluated synonym pairs, 62(42) were not
    listed as synonyms in SWISSPROT

14
Results
15
Results
16
Results
  • System performance

17
Conclusions
  • Extraction techniques can be used as a valuable
    supplement to resources such as SWISSPROT
  • Synonym relations can be automated through
    machine-learning approaches
  • SNOWBALL can be applied successfully for
    recognizing the patterns
Write a Comment
User Comments (0)
About PowerShow.com