Interpreting Microarray Expression Data Using Text Annotating the Genes - PowerPoint PPT Presentation

About This Presentation
Title:

Interpreting Microarray Expression Data Using Text Annotating the Genes

Description:

Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik ... Informed by text data, the leaner can make first-pass model for the scientist ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 22
Provided by: Mol5
Learn more at: https://ftp.cs.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: Interpreting Microarray Expression Data Using Text Annotating the Genes


1
Interpreting Microarray Expression DataUsing
Text Annotating the Genes
  • Michael Molla, Peter Andreae, Jeremy Glasner,
    Frederick Blattner, Jude Shavlik
  • University of Wisconsin Madison

2
The Basic Task
  • Given
  • Microarray Expression Data
  • Text Annotations of Genes
  • Generate
  • Model of Expression

3
Motivation
  • Lots of Data Available on the Internet
  • Microarray Expression Data
  • Text Annotations of Genes
  • Maybe we can Make the Scientists Job Easier
  • Generate a Model of Expression Automatically
  • Easier First Step for the Human

4
Microarray Expression Data
  • Each spot represents a gene in E. coli
  • Colors Indicate Up- or Down-Regulation Under
    Antibiotic Shock
  • Four our Purpose 3 Classes
  • Up-Regulated
  • Down-Regulated
  • No-Change

5
Microarray Expression Data
From Genome-Wide Expression in Escheria Coli
K-12, Blattner et al., 1999
6
Our Microarray Experiment
  • 4290 genes
  • 574 up-regulated
  • 333 down-regulated
  • 2747 un-regulated
  • 636 non enough signal

7
Text Annotations of Genes
  • The text from a sample SwissProt entry (b1382)
  • The description field
  • HYPOTHETICAL 6.8 KDA PROTEIN IN LDHA-FEAR
    INTERGENIC REGION
  • The keyword field
  • HYPOTHETICAL PROTEIN

8
Sample Rules From a Model for Up-Regulation
  • IF
  • The annotation contains FLAGELLAR AND does NOT
    contain HYPOTHETICAL
  • OR
  • The annotation contains BIOSYNTHESIS
  • THEN
  • The gene is up-regulated

9
Why use Machine Learning?
  • Concerned with machines learning from available
    data
  • Informed by text data, the leaner can make
    first-pass model for the scientist

10
Desired Properties of a Model
  • Accurate
  • Measure with cross validation
  • Comprehensible
  • Measure with model size
  • Stable to Small Changes in the Data
  • Measure with random subsampling

11
Approaches
  • Naïve Bayes
  • Statistical method
  • Uses all of the words (present or absent)
  • PFOIL
  • Covering algorithm
  • Chooses words to use one at a time

12
Naïve Bayes
  • For each word wi, there are two likelihood ratios
    (lr)
  • lr (wi present) p(wi present up) / p(wi
    present down)
  • lr (wi absent) p(wi absent up) / p(wi
    absent down)
  • For each annotation, the lrs are combined to form
    a lr for a gene
  • where X is either present or absent.

13
PFOIL
  • Learn rules from data
  • Produces multiple if-then rules from data
  • Builds rules by adding one word at a time
  • Easy to interpret models

14
Accuracy/Comprehensibility Tradeoff
15
Stabilized PFOIL
  • Repeatedly run PFOIL on randomly sampled subsets
  • For each word, count the number of models it
    appears in
  • Restrict PFOIL to only those words that appear in
    a minimum of m models
  • Rerun PFOIL with only those words

16
Stability Measure
  • After running the algorithm N times to generate
    N rule sets
  • Where
  • U the set of words appearing in any rule set
  • count(wi) number of rule sets containing word
    wi

17
Accuracy/Stability Tradeoff
18
Discussion
  • Not very severe tradeoffs in Accuracy
  • vs. stability
  • vs. comprehensibility
  • PFOIL not as good at characterizing data
  • suggests not many dependencies
  • need for softer rules

19
Future Directions
  • M of N rules
  • Permutation Test
  • More Sources of Text Data

20
Take-Home Message
  • This is just a first step toward an aid for
    understanding expression data
  • Make expression models based on text in stead of
    DNA sequence.

21
Acknowledgements
  • This research was funded by the following grants
  • NLM 1 R01 LM07050-01,
  • NSF IRI-9502990,
  • NIH 2 P30 CA14520-29, and
  • NIH 5 T32 GM08349.
Write a Comment
User Comments (0)
About PowerShow.com