Predicting Gene Functions from Text Using a CrossSpecies Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Predicting Gene Functions from Text Using a CrossSpecies Approach

Description:

Gene Ontology (GO) controlled vocabulary for functional annotation ... in PAP-2 might be related to changes in lipid metabolism... Since PAP-2 plays ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 24
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Predicting Gene Functions from Text Using a CrossSpecies Approach


1
Predicting Gene Functions from Text Using a
Cross-Species Approach
  • Emilia Stoica and Marti HearstSchool of
    InformationUniversity of California, Berkeley

Research Supported by NSF DBI-0317510 and a gift
from Genentech
2
Goal
  • Annotate genes with functional information
    derived from journal articles.

3
Gene Ontology (GO)
  • Gene Ontology (GO) controlled vocabulary for
    functional annotation
  • 17,600 terms (circa July 2004)
  • Organized into 3 distinct acyclic graphs
  • molecular functions
  • biological processes
  • cellular locations
  • More general terms are parents of less general
    terms
  • development (GO0007275) is the parent of
    embryonic development (GO0001756)

4
Challenges
  • GO tokens might not appear explicitly
  • Example PubMed 10692450
  • GO0008285 negative regulation of cell
    proliferation
  • Occurs as inhibition of cell
    proliferation
  • GO tokens might not occur contiguously
  • Example PubMed 10734056,
  • GO0007186
  • G-protein coupled receptor protein signaling
    pathway
  • Occurs as
  • Results indicate that CCR1-mediated responses
    are regulated in the signaling pathway, by
    receptor phosphorylation at the level of
    receptor G/protein coupling CCR1 binds MIP-1
    alpha.

5
Challenges
  • The simplest strategy (assigning GO codes to
    genes simply because the GO tokens occur near the
    gene) yields a large number of false positives.
  • Issues
  • The text does not contain evidence to support the
    annotation,
  • The text contains evidence for the annotation,
    but the curator knows the gene to be involved in
    a function that is more general or more specific
    than the GO code matched in text.

6
Challenges
  • GO contains hints about what kinds of evidence
    are required for annotation, e.g.
  • The text should mention co-purification,
    co-immunoprecipitation experiments
  • Requiring these evidence terms does not seem to
    improve algorithms.

7
Related Work
  • Mainly in the context of BioCreative competition
    (2004)
  • Chiang and Yu 2003, 2004
  • Find phrase patterns commonly used in sentences
    describing gene functions
  • (e.g., gene plays an important role in, gene
    is involved in)
  • Final assignments made with a Naïve Bayes
    classifier
  • Ray and Craven 2004, 2005
  • Learn a statistical model for each GO code (which
    words are likely to co-occur in the paragraphs
    containing GO codes)
  • Decide among candidates via a multinomial Naïve
    Bayes classifier
  • Rice et al. 2004
  • Train an SVM for each GO code.
  • Target genes assigned best-scoring GO code.

8
Related Work, cont.
  • Couto et al. 2004
  • Determine if the information content of the
    matching GO terms is larger than for all the
    candidate GO terms.
  • Verspoor et al. 2004
  • Expand GO tokens with words that frequently
    co-occur in a training set use a categorizer
    that explores the structure of the Gene Ontology
    to find best hits.
  • Ehler and Ruch 2004
  • Treat each document as a query to be categorized
  • Create a score based on a combination of pattern
    matching and TFIDF weighting
  • Annotate gene with top-scoring GO codes.

9
Our Approach
  • Two main contributions
  • Use cross-species information (CSM)
  • Check for biological (in) consistencies (CSC)

10
Cross-Species MatchMain Idea
  • Use orthologous genes
  • Genes of different species that have evolved
    directly from a common ancestor.
  • Assumption
  • Since there is an overlap between the genomes of
    the two species, their orthologs may share some
    functions, and consequently some GO codes
  • Idea to predict GO codes for target genes in
    target species, use the GO codes assigned to
    their orthologous genes
  • We use Mouse vs. Human genes

11
General procedure
  • Analyze text at sentence level
  • Eliminate stop words, punctuation characters and
    divide the text into tokens using space as
    delimiter
  • Normalize and match different variations of gene
    names using the algorithm of Bhalotia et al.03
  • For every sentence that contains the target gene
  • A GO code is matched if the sentence contains a
    percentage of GO tokens larger than a threshold
    (0.75 for CSM and 1 for CSC)

12
Cross Species Match Algorithm
  • CSM(g, a) For a target gene g, search in article
    a for only the GO codes annotated to its ortholog
  • If at least 75 of the GO code terms are found in
    a sentence containing the gene name, the code is
    matched.
  • Note we must eliminate annotations of orthologs
    marked with IEA and ISS codes to avoid circular
    references.

13
Cross-Species Correlation Main Idea
  • Observation
  • Since GO codes indicate gene function, it is
    logical for some to often co-occur in annotations
    and for others to rarely do so.
  • Assumption
  • If one GO code tends to occur in the orthologous
    genes annotations when another one does not,
    then assume the second is not a valid assignment
    for the target species
  • Example
  • If text seems to contain evidence for rRNA
    transcription (GO0009303) nucleolus (GO0005737)
    and extracellular (GO0005576), then
    extracellular is suspicious.
  • The algorithm identifies the suspicious cases.

14
Cross-Species Correlation Algorithm
  • For every pair of GO codes in the orthologous
    genes database, compute a X2 coefficient.
  • N the total number of GO codes
  • O11 of times the ortholog is annotated with
    both GO1 and GO2
  • O12 of times the ortholog is annotated with
    GO1 but not GO2
  • O21 of times the ortholog is annotated with
    GO2 but not GO1
  • O12 of times the ortholog is not annotated
    with GO1 or GO2

X2
15
Cross-Species Correlation Algorithm
  • M(g,a) GO codes matched in article a for gene g
  • O(g) GO codes assigned to the ortholog of g
  • o size of O(g), p percentage (0.2)
  • For every potentially matching GO code GO1 in
    M(g,a)
  • For every GO code GO2 in O(g)
  • Count how often X2(GO1,GO2) is significant
  • If this count is lt po then assume GO1 is not
    valid.
  • Else assign GO1 to g

16
Information Flow
17
Evaluation using BioCreative
  • Task 2.2
  • Annotate 138 human genes with GO codes using 99
    full text articles
  • For each annotation, provide the passage of text
    that the annotation was based upon.
  • Annotations from participants were manually
    judged by human curators
  • A prediction was considered perfect if the text
    passage
  • contained the gene name, and
  • provided evidence for annotating the gene with
    the GO code

18
Results on BioCreative
  • Our research was conducted after the competition
    had past, so our annotations could not be judged
    by the same curators
  • Used the perfect predictions
  • (unfair to our system ignores relevant
    predictions we find that other systems do not)
  • Our prediction is correct if it matches a perfect
    prediction (e.g., vhl is annotated with
    transcription (GO0006350) in PubMed 12169961
    vhl inhibits transcription elongation, mRNA
    stability and PKC activity)

19
BioCreative Results
20
Results on Larger Dataset
  • A much larger test set has been made publicly
    available by Chiang and Yu.
  • EBI human test set
  • 4,410 genes
  • 13,626 GO code annotations
  • MGI mouse test set
  • 2,188 genes
  • 6,338 GO code annotations
  • Note that Chiang and Yu used the same data for
    both training and testing.

21
Results on EBI Human and MGI datasets
  • EBI human 4,410 genes and 5,714 abstracts
  • MGI 2,188 genes and 1,947 abstracts

22
Conclusions and Future Work
  • We propose an algorithm that annotates genes with
    GO codes using the information available from
    other species
  • Experimental results on three datasets show that
    our algorithm consistently achieves higher
    F-measures than other solutions
  • Future improvements to our algorithm
    - combine or use a
    voting scheme between the predictions our system
    makes and the predictions of a machine learning
    system
  • - investigate how effective are other genes
    with sequences similar to the target gene (but
    not orthologous to the gene) for predicting the
    GO codes

23
Thank you!
  • http//biotext.berkeley.edu

Research Supported by NSF DBI-0317510 and a gift
from Genentech
24
Example
  • The marked accumulation of lipid droplets in
    LNCaP cells...is accompanied by an increase in
    phospholipid synthesis. The increase in PAP-2
    might be related to changes in lipid metabolism
    Since PAP-2 plays a pivotal role in the control
    of signal transduction by lipid mediator
    mediators, the ability of androgens to stimulate
    this enzyme in prostatic cells may provide
    opportunity for cross-talk between signaling
    pathways involving lipid mediators and androgens.

25
CSC Algorithm
  • M(g,a) GO codes matched in article a for gene g
  • O(g) GO codes annotated to the ortholog of g
  • o size of O(g), p percentage (0.2)
  • CSC(g,a)
    for every GO1 in M(g,a)
    count
    0
    for every GO2 in O(g)
    if((X2(GO1,GO2)gt3.84) (GO1 ne GO2))
    count
  • if(count gt po)
    add GO1 to CSC(g,a)
Write a Comment
User Comments (0)
About PowerShow.com