Title: Interpreting Microarray Expression Data Using Text Annotating the Genes
1Interpreting Microarray Expression DataUsing
Text Annotating the Genes
- Michael Molla, Peter Andreae, Jeremy Glasner,
Frederick Blattner, Jude Shavlik - University of Wisconsin Madison
2The Basic Task
- Given
- Microarray Expression Data
- Text Annotations of Genes
- Generate
- Model of Expression
3Motivation
- Lots of Data Available on the Internet
- Microarray Expression Data
- Text Annotations of Genes
- Maybe we can Make the Scientists Job Easier
- Generate a Model of Expression Automatically
- Easier First Step for the Human
4Microarray Expression Data
- Each spot represents a gene in E. coli
- Colors Indicate Up- or Down-Regulation Under
Antibiotic Shock - Four our Purpose 3 Classes
- Up-Regulated
- Down-Regulated
- No-Change
5Microarray Expression Data
From Genome-Wide Expression in Escheria Coli
K-12, Blattner et al., 1999
6Our Microarray Experiment
- 4290 genes
- 574 up-regulated
- 333 down-regulated
- 2747 un-regulated
- 636 non enough signal
7Text Annotations of Genes
- The text from a sample SwissProt entry (b1382)
- The description field
- HYPOTHETICAL 6.8 KDA PROTEIN IN LDHA-FEAR
INTERGENIC REGION - The keyword field
- HYPOTHETICAL PROTEIN
8Sample Rules From a Model for Up-Regulation
- IF
- The annotation contains FLAGELLAR AND does NOT
contain HYPOTHETICAL - OR
- The annotation contains BIOSYNTHESIS
- THEN
- The gene is up-regulated
9Why use Machine Learning?
- Concerned with machines learning from available
data - Informed by text data, the leaner can make
first-pass model for the scientist
10Desired Properties of a Model
- Accurate
- Measure with cross validation
- Comprehensible
- Measure with model size
- Stable to Small Changes in the Data
- Measure with random subsampling
11Approaches
- Naïve Bayes
- Statistical method
- Uses all of the words (present or absent)
- PFOIL
- Covering algorithm
- Chooses words to use one at a time
12Naïve Bayes
- For each word wi, there are two likelihood ratios
(lr) - lr (wi present) p(wi present up) / p(wi
present down) - lr (wi absent) p(wi absent up) / p(wi
absent down) - For each annotation, the lrs are combined to form
a lr for a gene - where X is either present or absent.
13PFOIL
- Learn rules from data
- Produces multiple if-then rules from data
- Builds rules by adding one word at a time
- Easy to interpret models
14Accuracy/Comprehensibility Tradeoff
15Stabilized PFOIL
- Repeatedly run PFOIL on randomly sampled subsets
- For each word, count the number of models it
appears in - Restrict PFOIL to only those words that appear in
a minimum of m models - Rerun PFOIL with only those words
16Stability Measure
- After running the algorithm N times to generate
N rule sets - Where
- U the set of words appearing in any rule set
- count(wi) number of rule sets containing word
wi
17Accuracy/Stability Tradeoff
18Discussion
- Not very severe tradeoffs in Accuracy
- vs. stability
- vs. comprehensibility
- PFOIL not as good at characterizing data
- suggests not many dependencies
- need for softer rules
19Future Directions
- M of N rules
- Permutation Test
- More Sources of Text Data
20Take-Home Message
- This is just a first step toward an aid for
understanding expression data - Make expression models based on text in stead of
DNA sequence.
21Acknowledgements
- This research was funded by the following grants
- NLM 1 R01 LM07050-01,
- NSF IRI-9502990,
- NIH 2 P30 CA14520-29, and
- NIH 5 T32 GM08349.