Title: talk in bioitworld2002
1From Datamining to Bioinformatics
Limsoon Wong Laboratories for Information
Technology Singapore
2What is Bioinformatics?
3Themes of Bioinformatics
Bioinformatics Data Mgmt Knowledge
Discovery Data Mgmt Integration
Transformation Cleansing Knowledge Discovery
Statistics Algorithms Databases
4Benefits of Bioinformatics
To the patient Better drug, better treatment To
the pharma Save time, save cost, make more To
the scientist Better science
5From Informatics to Bioinformatics
MHC-Peptide Binding (PREDICT)
Protein Interactions Extraction (PIES)
8 years of bioinformatics RD in Singapore
Gene Expression Medical Record Datamining (PCL)
Cleansing Warehousing (FIMM)
Gene Feature Recognition (Dragon)
Integration Technology (Kleisli)
Venom Informatics
1994
1998
1996
2002
2000
ISS
LIT
KRDL
6Quick Samplings
7Epitope Prediction
TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYS
E EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIH
LYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDA
LLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKI
AVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAV
CVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CE
EERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPN
PEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNP
EDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQ
SDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREE
HE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPY
AGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
8Epitope Prediction Results
- Prediction by our ANN model for HLA-A11
- 29 predictions
- 22 epitopes
- 76 specificity
- Prediction by BIMAS matrix for HLA-A1101
Number of experimental
binders 19 (52.8) 5 (13.9)
12 (33.3)
Rank by BIMAS
9Transcription Start Prediction
10Transcription Start Prediction Results
11Medical Record Analysis
- Looking for patterns that are
- valid
- novel
- useful
- understandable
12Gene Expression Analysis
- Classifying gene expression profiles
- find stable differentially expressed genes
- find significant gene groups
- derive coordinated gene expression
13Medical Record Gene Expression Analysis Results
- PCL, a novel emerging pattern method
- Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI
benchmarks - Works well for gene expressions
Cancer Cell, March 2002, 1(2)
14Behind the Scene
- Allen Chong
- Judice Koh
- SPT Krishnan
- Huiqing Liu
- Seng Hong Seah
- Soon Heng Tan
- Guanglan Zhang
- Zhuo Zhang
- Vladimir Bajic
- Vladimir Brusic
- Jinyan Li
- See-Kiong Ng
- Limsoon Wong
- Louxin Zhang
and many more students, folks from
geneticXchange, MolecularConnections, and other
collaborators.
15Questions?
16A More Detailed Account
17What is Datamining?
Jonathans rules Blue or Circle Jessicas
rules All the rest
18What is Datamining?
Question Can you explain how?
19The Steps of Data Mining
- Training data gathering
- Signal generation
- k-grams, colour, texture, domain know-how, ...
- Signal selection
- Entropy, ?2, CFS, t-test, domain know-how...
- Signal integration
- SVM, ANN, PCL, CART, C4.5, kNN, ...
20Translation Initiation Recognition
21A Sample cDNA
299 HSU27655.1 CAT U27655 Homo
sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAA
CACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCA
GCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGG
CCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAG
GACAAGACCTTCCACCCAACAAGGACTCCCCT .................
...........................................
80 ................................iEEEEEEEEEEEEEE
EEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
22Signal Generation
- K-grams (ie., k consecutive letters)
- K 1, 2, 3, 4, 5,
- Window size vs. fixed position
- Up-stream, downstream vs. any where in window
- In-frame vs. any frame
23Too Many Signals
- For each value of k, there are
- 4k 3 2 k-grams
- If we use k 1, 2, 3, 4, 5, we have
- 4 24 96 384 1536 6144 8188
- features!
- This is too many for most machine learning
algorithms
24Signal Selection (Basic Idea)
- Choose a signal w/ low intra-class distance
- Choose a signal w/ high inter-class distance
- Which of the following 3 signals is good?
25Signal Selection (eg., t-statistics)
26Signal Selection (eg., MIT-correlation)
27Signal Selection (eg., ?2)
28Signal Selection (eg., CFS)
- Instead of scoring individual signals, how about
scoring a group of signals as a whole? - CFS
- A good group contains signals that are highly
correlated with the class, and yet uncorrelated
with each other - Homework find a formula that captures the key
idea of CFS above
29Sample k-grams Selected
Leaky scanning
Kozak consensus
- Position 3
- in-frame upstream ATG
- in-frame downstream
- TAA, TAG, TGA,
- CTG, GAC, GAG, and GCC
Stop codon
Codon bias
30Signal Integration
- kNN
- Given a test sample, find the k training samples
that are most similar to it. Let the majority
class win. - SVM
- Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error. - Naïve Bayes, ANN, C4.5, ...
31Results (on Pedersen Nielsens mRNA)
32Acknowledgements
- Roland Yap
- Zeng Fanfan
- A.G. Pedersen
- H. Nielsen
33Questions?
34Common Mistakes
35Self-fulfilling Oracle
- Consider this scenario
- Given classes C1 and C2 w/ explicit signals
- Use ?2 to C1 and C2 to select signals s1, s2, s3
- Run 3-fold x-validation on C1 and C2 using s1,
s2, s3 and get accuracy of 90 - Is the accuracy really 90?
- What can be wrong with this?
36Phil Longs Experiment
- Let there be classes C1 and C2 w/ 100000 features
having randomly generated values - Use ?2 to select 20 features
- Run k-fold x-validation on C1 and C2 w/ these 20
features - Expect 50 accuracy
- Get 90 accuracy!
- Lesson choose features at each fold
37Apples vs Oranges
- Consider this scenario
- Fanfan reported 89 accuracy on his TIS
prediction method - Hatzigeorgiou reported 94 accuracy on her TIS
prediction method - So Hatzigeorgious method is better
- What is wrong with this conclusion?
38Apples vs Oranges
- Differences in datasets used
- Fanfans expt used Pedersens dataset
- Hatzigeorgious used her own dataset
- Differences in counting
- Fanfans expt was on a per ATG basis
- Hatzigeorgious expt used the scanning rule and
thus was on a per cDNA basis - When Fanfan ran the same dataset and count the
same way as Hatzigeorgiou, got 94 also!
39Questions?