Data Mining and Bioinformatics - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Data Mining and Bioinformatics

Description:

Data Mining and Bioinformatics. Wei Wang. Assistant Professor. 329 Sitterson Hall ... Bioinformatics. Biological data are abundant and information rich ... – PowerPoint PPT presentation

Number of Views:803

Avg rating:5.0/5.0

Slides: 28

Provided by: UNC52

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining and Bioinformatics

1
Data Mining and Bioinformatics

Wei Wang
Assistant Professor

329 Sitterson Hall weiwang_at_cs.unc.edu www.cs.unc.e
du/weiwang
2
What is Data Mining?

Techniques that can extract valuable knowledge
from massive data
Clustering
Anomaly detection
Data modeling
Compression
Classification
Association/correlation analysis
Similarity search
Trends detection

Wide applications
Bioinformatics
Image processing
Traffic engineering
Security
E-commerce

3
Bioinformatics

Biological data are abundant and information rich
Data produced at different levels
molecules, cells, organs, organisms, populations
Data obtained from different channels
Structure sequence, shape, energy,
Function gene expression, pathway, phenotypic
and clinical data,

4
Bioinformatics

Molecular level

5
Bioinformatics

Molecular level

6
Bioinformatics

Challenges
Highly complex
Noisy
inconsistent
Redundant
Data mining can help!

7
What We Are Doing

Proteomics
Protein structure modeling and analysis
with Jan Prins (CS), Alex Tropsha (Pharmacology)
Gene expression and pathways
OP-Cluster tendency based gene expression
analysis
Classification based on association
Discriminative feature selection
Classification on one class and unlabelled data
with Andrew Nobel (Statistics), Peter Petrusz
(Medicine), UIUC

8
What We Are Doing

Semantic integration of heterogeneous genome
databases
Similarity Queries across heterogeneous
Microarray data
UIUC

9
Protein Data Bank (PDB) Growth
Can we find patterns from the exponentially
growing PDB?
10
Protein Structure Visualization
11
Protein Classification
12
Computer Understands Numbers
13
Graph Representation of Proteins

We present a protein by an undirected labeled
graph
Every node corresponds to a residue in the
protein, labeled by its type.
(u, v) is a edge in the graph iff
there is a peptide bond between residue u and v
(peptide edge), or
the distance between u, v (represented by the two
C?s) is less than 10 Å (proximity edge)

ATOM 820 CA THR 115 -7.108 8.835 6.640
1.00 8.21 ATOM 1280 CA THR 175 -19.567
2.837 0.682 1.00 14.73 ATOM 1671 CA ARG
229 -15.242 -4.327 0.885 1.00 6.50
ATOM 1707 CA SER 233 -15.989 -6.491 -4.881
1.00 6.86
14
Graph Representation of Proteins
15
Finding All Frequent Subgraphs
NP Hard!
16
Classifying Proteins from SCOP
SCOP classifies proteins by five levels Class,
Fold, Superfamily, Family and Individual
proteins. We formed three datasets from SCOP
Accuracy is defined as (true positive true
negative) / total samples. The results are
reported as average values of ten fold cross
validation. Used LibSVM classifier from
http//www.csie.ntu.edu.tw/cjlin/libsvm/
Parameters C-SVM classification model the
linear kernel and leaving others as default
17
Fingerprints in Prokaryotic Serine Protease
G1
G2
Backbone Achromobacter lyticus protease I (PDB
ID 1ARB).
18
An Even Larger Fingerprint
ASP 2 2 0 0 0 0 0 2 GLY 0 2 0 0 0 0 2 0 GLY 0 0 0
0 0 0 2 0 PHE 2 0 0 0 0 0 0 2 LEU 2 2 0 0 0 0 0 2
ALA 0 2 0 0 0 0 2 0 VAL 0 0 0 0 0 0 2 0 ALA 2
proximity edge.
19
Clustering and Classification

Gene expression data

Cells
Labeled transcript
AAAA
IVT (Biotin-UTP Biotin-CTP)
L
L
L
L
Poly (A)/ Total RNA
cDNA
Fragment (heat, Mg2)
L
L
Wash Stain
Hybridize (16 hours)
L
L
Scan
Labeled fragments
20
Clustering
21
We are looking for solution to..

Gene Discovery
screening technique to identify regulated genes.
e.g. transcriptional response of yeast to
environmental stresses (cold, saline,
nutrient-starvation,)
transcript profiles of diseases e.g. cancer
gt identification of single genes products
establishment of tumor markers for diagnostic
purpose
gt drug development only affecting expressed
genes
gt cancer classification
Toxicological research, drug discovery
genetic network interference
.

22
We work on numbers again
samples
genes
23
OP-Clustering on mouse gene expression
24
Database Integration
?
?
Schema Integration
25
Want to learn more?