Data mining in bioinformatics: problems and challenges - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Data mining in bioinformatics: problems and challenges

Description:

Why bioinformatics? We are witnessing a 'biotechnology revolution' Biotechnology ... Problems and challenges in bioinformatics. Insufficient data. Example: ... – PowerPoint PPT presentation

Number of Views:480

Avg rating:3.0/5.0

Slides: 30

Provided by: iroUmo

Category:

more less

Transcript and Presenter's Notes

Title: Data mining in bioinformatics: problems and challenges

1
Data mining in bioinformatics problems and
challenges
Sorin Draghici
Email sod_at_cs.wayne.edu WWW http//vortex.cs.wayn
e.edu http//www.cs.wayne.edu/sod

2
Why bioinformatics?

We are witnessing a "biotechnology revolution"
Biotechnology
has the potential to improve our lives
dramatically (new drugs, treatments, etc.)
has also a huge distructive potential (careless
genetic manipulations, etc)

3
Why bioinformatics?

Human genome project
completed by Celera
How is that to be used?
map functions on genes
find/treat/correct/eliminate genetic diseases
gene treatment
patient oriented treatment and drugs
(pharmacogenomics) ACE inhibitors (blood
pressure medication

4
The HIV virus

HIV is a retrovirus that attacks the immune
system
Replication mechanism
RNA based
makes lots of mistakes during the replication
Compensates for the primitive replication through
a high replication speed

5
Why is it so deadly?

10 billion copies of HIV are produced every day
High replication speed
Many random mutations
Selection pressure from the drug
very good search ability in the version space of
all viable HIV viruses

6
Current treatments

Protease inhibitors
Reverse transcriptease inhibitors

7
Current problems

Very few drugs available
5 FDA approved protease inhibitors
9 FDA approved RT inhibitors
Cross-resistance
patient treated with drug A may develop
resistance to drug B as well

8
Current problems

Drug development is
very slow (10 years)
very expensive (10-30 milion/year)
Viral mutations are
very probable in each generation
very rapid (10 billion copies a day)
The result

throwing stones at fighter planes
9
Our approach

Find the structural features which
cause drug resistance
are common to several mutants
Design drugs to counteract such common features
as opposed to individual mutants
secondary therapy

10
effective
mutant HIV
wild type HIV
less effective
drug development
option 1
mutant HIV 1
option 2
mutant HIV 2
wild type HIV
resistance
genotyping
option 3
mutant HIV 3
first antiretroviral therapy (FAT)
second antiretroviral therapy (SAT)
11
Our data

Genotypic data (genetic sequences of mutants)
easy to obtain
there are lots of them
Structural data (X ray crystallography)
difficult to obtain
not very many
Phenotypic data (drug resistance)
very difficult to obtain
very few available

12
Our data

Genotypic data
PQITLWQRPLVTIKIGGQLKEALLDTGADDT... (approx. 200
residues for protease)
Structure data
Phenotypic data
IC90 3.51
fold resistance IC90 mutant/IC90 wildtype

13
Our work

Develop a structure-function model of HIV drug
resistance

sequence
resistance
structure
14
Dataflow
Sequence
Contacts/PDB
Structures
Machine learning
15
Supervised learning

Inputs
Atomic contacts between the inhibitor and the
protease
Atomic distances
Output
Fold resistance

16
Ligplot Contacts File
17
Atomic contacts - resistance
18
Unsupervised learning

Inputs
Contact residues (21 distinct contacts)
Output
A self organized map embedding structural
information

19
Ligplot Contacts File
20
Self-organizing feature maps
21
Residue contacts - resistance
22
Problems and challenges in bioinformatics

Insufficient data
Example
Largest data set has 50 mutants
Why?
The field is very recent
Data collection can be very difficult (one
structure may take 1-2 years if done from
scratch one IC90 value may take up to two weeks)
Data has commercial value
Solutions
Get more data
Cross-validate very carefully

23
Problems and challenges in bioinformatics

Data consistency
Example
Same sample sent to two different labs can come
back with different IC90 values
Why?
The experimental tools are not mature yet
Solutions
Select your data carefully
Use data from consistent sources
If not possible, pre-process the data to make it
consistent (not very good since you actually
change the data!)

24
Problems and challenges in bioinformatics

Data accuracy
Example
Same sample sent to the same lab at different
times can be reported with different IC90 values
(4 fold error)
Why?
The experimental tools are not mature yet
Solutions
Use relative values to reduce the requirement for
high numerical precision
Map data into clusters and attach values to
clusters (1-4 no resistance, 4-10 reduced
resistance, gt10 resistance)

25
Problems and challenges in bioinformatics

Data quality
Example
Papers reporting IC90 values do not give the
whole sequence
Why?
People are not aware of its importance
Data may have commercial value
Solutions
Never trust your data...

26
Problems and challenges in bioinformatics

The choice of features
Example
Atoms?, Residues?, Genes?, Larger structures?
Why?
The phenomena are very complex and span different
scales in time and space
Solutions
Try to merge different types of data in order to
capture the complexity of the phenomenon
Use several qualitatively different analysis and
machine learning techniques

27
Problems and challenges in bioinformatics

Lack of tools
Example
There were no tools able to correlate
sequence/structure/resistance data for the HIV
virus
We wrote more than 15,000 lines of code for this
problem
Why?
The field is new
The structure/function problem is just starting
to be addressed
Solutions
Develop your own software
Partnerships with bioinformatics companies?

28
Problems and challenges in bioinformatics