Data Mining For Bioinformatics: Tools and Applications - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Data Mining For Bioinformatics: Tools and Applications

Description:

Example: Yeast Sporulation. Chu et.al. Science 282 ... Example: Yeast Sporulation. Data Mining Tools and Applications - Craig A. Struble. 12 ... – PowerPoint PPT presentation

Number of Views:1081

Avg rating:3.0/5.0

Slides: 29

Provided by: CraigAS7

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining For Bioinformatics: Tools and Applications

1
Data Mining For Bioinformatics Tools and
Applications

Craig A. Struble
Department of Mathematics, Statistics, and
Computer Science
craig.struble_at_marquette.edu

2
Overview

Clustering
Hierarchical Clustering, SOMs, Model-based
clustering
Classification
SVMs, neural networks
Tool building

3
Clustering

Basic idea
Group similar things together
d(x,y) Distance function between x and y
Euclidean, mismatches, etc.
Bioinformatics context
Similar expression profiles imply similar
function
This is under some scrutiny
Unsupervised
Useful when no other information is available
Just to see what happens

4
Example Data

Genes x Experiments
6000 genes x 16 experiments
Could use ratios or other values for data

5
Hierarchical Clustering

Bottom up (agglomerative)
Top down (divisive)
Linkage
How groups are combined (or split)

6
Hierarchical Clustering (Example)

Analyzing yeast data (different experiment)

7
Self Organizing Maps

Also called Kohonen maps
Example of neural networks

8
Self Organizing Maps

Yeast data set
Typical values
Error bars
Using GeneSOM in R
Many other visualizations

9
Clustering With Models

Create/select representative points (i.e. models)
Perform cluster analysis (K-means, K-medoids,
etc.)
Classify/identify real data items by finding
which representative they cluster with

10
Example Yeast SporulationChu et.al. Science 282
11
Example Yeast Sporulation
12
After Clustering

Try multiple sequence alignment of genes closely
clustered together
Include upstream/downstream sequence to look for
promoter regions, etc.
Search in KEGG for metabolic pathways genes may
be involved with
Look at functional classification of genes in the
same cluster
May be able to assign putative function to genes
with unknown function (again, this is under
scrutiny)

13
Classification

Use data to create a classifier
A predictive model for labeling new data items
Supervised learning
Generate data associated with known labels
Train the specific technique with labeled data

14
Support Vector Machines

Find a hyperplane to separate data points

15
Feature Selection

Identify subset of attributes as most important
Identify group of genes that play most important
role in distinguishing classes
Use information gain or other statistical
measures to determine the importance of a data
item
In many cases, feature selection is the true goal

16
Example Leukemia Classification

Ovarian Cancer (Furey et al, 2000)
Also tested on Leukemia data (Golub et al, 1999)
97,802 DNA clones used
31 tissue samples
Cancerous and normal ovarian tissue
Non-ovarian tissue

17
Example (cont.)

Feature selection
High scoring genes differ most on average and
have small deviations in value
50 relevant clones identified
Leave one out testing
80 accuracy
Really testing if selected features are good

18
Neural Networks

Network of connected neurons
Trained with data with known output values
Errors propogated through network for learning

Input Layer
Output Layer
Hidden Layer
19
Example Cleavage Site Prediction (Nielsen et al,
1997)

Predict where cleavage sites are in protein
precursors
Training data from SWISS-PROT database

20
Example (cont.)

Input layer
Groups of 20 neurons per location
20 amino acids
Sliding window of 5-39 amino acids
Hidden layer
0-10 neurons tried
Output layer
2 neurons per location, P(c),P(s)
Trained with backpropogation

a
P(c)

P(s)
v
a

v
a

P(c)
v
P(s)
21
Tool Building

Many commercial packages have lots of tools
Not always integrated
May not provide enough flexibility
Sometimes youve just gotta do it yourself

22
Tool Building

Identify problem to work on
E.g. predict where miRNAs are on the genome
Determine where to get data
E.g. NCBI, KEGG, literature
Determine the final format of your data
E.g. Oracle, PostgreSQL, CSV, XML, etc.
Select data mining techniques to use
Literature and experience

23
Tool Building

Select user interface style
Web-based vs. applet vs. application
Upload data/download data?
Select visualizations
Decide how to present the data
Communicate with your users

24
Tool Building

Select software to use
Does it contain the data mining techniques to
use?
Is the source code available?
Library vs. application vs. interpreter
What do you know and what are you willing to
learn?
Eventually, youll build up a collection of tools
to build on

25
Typical architecture for small project
Neural Network
Output
Perl Scripts
Perl Scripts
Clustering
26
Where to Find Examples

Search Google (http//www.google.com)
E.g. self organizing maps bioinformatics
Citeseer (http//citeseer.nj.nec.com/)
PubMed (http//www.ncbi.nih.gov)
Web pages and papers usually contain links to
software used

27
Be Prepared

Programming for bioinformatics/data mining often
requires knowing many languages (Perl, C, Java
at a minimum)
Practice on supplied sample data sets if any
Read, read, read!

28
Useful free tools