Title: Data mining in bioinformatics: problems and challenges
1Data mining in bioinformatics problems and
challenges
Sorin Draghici
Email sod_at_cs.wayne.edu WWW http//vortex.cs.wayn
e.edu http//www.cs.wayne.edu/sod
2Why bioinformatics?
- We are witnessing a "biotechnology revolution"
- Biotechnology
- has the potential to improve our lives
dramatically (new drugs, treatments, etc.) - has also a huge distructive potential (careless
genetic manipulations, etc)
3Why bioinformatics?
- Human genome project
- completed by Celera
- How is that to be used?
- map functions on genes
- find/treat/correct/eliminate genetic diseases
- gene treatment
- patient oriented treatment and drugs
(pharmacogenomics) ACE inhibitors (blood
pressure medication
4The HIV virus
- HIV is a retrovirus that attacks the immune
system - Replication mechanism
- RNA based
- makes lots of mistakes during the replication
- Compensates for the primitive replication through
a high replication speed
5Why is it so deadly?
- 10 billion copies of HIV are produced every day
-
- High replication speed
- Many random mutations
- Selection pressure from the drug
- very good search ability in the version space of
all viable HIV viruses
6Current treatments
- Protease inhibitors
- Reverse transcriptease inhibitors
7Current problems
- Very few drugs available
- 5 FDA approved protease inhibitors
- 9 FDA approved RT inhibitors
- Cross-resistance
- patient treated with drug A may develop
resistance to drug B as well
8Current problems
- Drug development is
- very slow (10 years)
- very expensive (10-30 milion/year)
- Viral mutations are
- very probable in each generation
- very rapid (10 billion copies a day)
- The result
throwing stones at fighter planes
9Our approach
- Find the structural features which
- cause drug resistance
- are common to several mutants
- Design drugs to counteract such common features
as opposed to individual mutants - secondary therapy
10effective
mutant HIV
wild type HIV
less effective
drug development
option 1
mutant HIV 1
option 2
mutant HIV 2
wild type HIV
resistance
genotyping
option 3
mutant HIV 3
first antiretroviral therapy (FAT)
second antiretroviral therapy (SAT)
11Our data
- Genotypic data (genetic sequences of mutants)
- easy to obtain
- there are lots of them
- Structural data (X ray crystallography)
- difficult to obtain
- not very many
- Phenotypic data (drug resistance)
- very difficult to obtain
- very few available
12Our data
- Genotypic data
- PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200
residues for protease) - Structure data
- Phenotypic data
- IC90 3.51
- fold resistance IC90 mutant/IC90 wildtype
13Our work
- Develop a structure-function model of HIV drug
resistance
sequence
resistance
structure
14Dataflow
Sequence
Contacts/PDB
Structures
Machine learning
15Supervised learning
- Inputs
- Atomic contacts between the inhibitor and the
protease - Atomic distances
- Output
- Fold resistance
16Ligplot Contacts File
17Atomic contacts - resistance
18Unsupervised learning
- Inputs
- Contact residues (21 distinct contacts)
- Output
- A self organized map embedding structural
information
19Ligplot Contacts File
20Self-organizing feature maps
21Residue contacts - resistance
22Problems and challenges in bioinformatics
- Insufficient data
- Example
- Largest data set has 50 mutants
- Why?
- The field is very recent
- Data collection can be very difficult (one
structure may take 1-2 years if done from
scratch one IC90 value may take up to two weeks) - Data has commercial value
- Solutions
- Get more data
- Cross-validate very carefully
23Problems and challenges in bioinformatics
- Data consistency
- Example
- Same sample sent to two different labs can come
back with different IC90 values - Why?
- The experimental tools are not mature yet
- Solutions
- Select your data carefully
- Use data from consistent sources
- If not possible, pre-process the data to make it
consistent (not very good since you actually
change the data!)
24Problems and challenges in bioinformatics
- Data accuracy
- Example
- Same sample sent to the same lab at different
times can be reported with different IC90 values
(4 fold error) - Why?
- The experimental tools are not mature yet
- Solutions
- Use relative values to reduce the requirement for
high numerical precision - Map data into clusters and attach values to
clusters (1-4 no resistance, 4-10 reduced
resistance, gt10 resistance)
25Problems and challenges in bioinformatics
- Data quality
- Example
- Papers reporting IC90 values do not give the
whole sequence - Why?
- People are not aware of its importance
- Data may have commercial value
- Solutions
- Never trust your data...
26Problems and challenges in bioinformatics
- The choice of features
- Example
- Atoms?, Residues?, Genes?, Larger structures?
- Why?
- The phenomena are very complex and span different
scales in time and space - Solutions
- Try to merge different types of data in order to
capture the complexity of the phenomenon - Use several qualitatively different analysis and
machine learning techniques
27Problems and challenges in bioinformatics
- Lack of tools
- Example
- There were no tools able to correlate
sequence/structure/resistance data for the HIV
virus - We wrote more than 15,000 lines of code for this
problem - Why?
- The field is new
- The structure/function problem is just starting
to be addressed - Solutions
- Develop your own software
- Partnerships with bioinformatics companies?
28Problems and challenges in bioinformatics
- Difficult communication between the "bio" and the
"informatics" sides - Example
- Definition of "successful prediction"
- Why?
- Different backgrounds, different traditions
- Solution
- Cross-training
- Exposure to "the other" field
29Conclusions
- Data mining in bioinformatics is
- Challenging
- Interesting
- Useful