Title: CS491JH: Data Mining in Bioinformatics
1- CS491JH Data Mining in Bioinformatics
- Introduction to Microarray Technology
- Technology Background
- Data Processing Procedure
- Characteristics of Data
- Data integration and Data mining
2Substrates for High Throughput Arrays
Single label P33
Single label biotin streptavidin
Dual label Cy3, Cy5
3GeneChip Probe Arrays
Hybridized Probe Cell
GeneChip Probe Array
Single stranded, labeled RNA target
Oligonucleotide probe
24µm
Millions of copies of a specific oligonucleotide
probe
1.28cm
gt200,000 different complementary probes
Image of Hybridized Probe Array
4GeneChip Expression Array Design
Gene Sequence
Probes designed to be Perfect Match
Probes designed to be Mismatch
5Procedures for Target Preparation
Cells
Labeled transcript
AAAA
IVT (Biotin-UTP Biotin-CTP)
L
L
L
L
Poly (A)/ Total RNA
cDNA
Fragment (heat, Mg2)
L
L
Wash Stain
Hybridize (16 hours)
L
L
Scan
Labeled fragments
6Microarray Technology
7Printing Arrays on 50 slides
8Ratio of expression of genes from two sources
Total or
9GSI Lumonics
10Cattle and Soy Controls
Beta Actin
PKG
HPRT
Beta 2 microglobulin
Rubisco
AB binding protein
Major latex protein homologue (MSG)
Array of cattle and soy spiking controls. 50 ug
of cattle brain total RNA was labeled with Cy3
(green). 1 ul each of in vitro transcribed soy
Rubisco (5 ng), AB binding protein (0.5 ng) and
MSG (0.05 ng) were labeled with Cy5. The two
labeled samples were cohybridized on superamine
slides (Telechem, Inc.). To the right of each
set of spots are five negative controls (water).
11Fetal Spleen-Cy3
Adult Spleen-Cy5
IgM
IgM
MYLK
MYLK
IgM heavy chain
IgM heavy chain
COL1A2
COL1A2
12GenePix Image Analysis Software
Placenta vs. Brain 3800 Cattle Placenta Array
cy3 cy5
13(No Transcript)
14Microarray Data Process
- Experimental Design
- Image Analysis raw data
- Normalization clean data
- Data Filtering informative data
- Model building
- Data Mining (clustering, pattern recognition, et
al) - Validation
15Scatterplot of Normalized Data
Fetal
Adult
16gt0.3
lt-0.3
17Characteristics of Data Data can be viewed as a
NxM matrix (N gtgt M) N is the number of genes M
is the number of data points for each gene Or
Nx(MK) K is the number of Features describing
each gene(genome location, functional
description, metabolic pathway et al)
18Model for Data Analysis
- Gene Expression is a Dynamic Process
- Each Microarray Experiment is a snap shot of the
process - Need basic biological knowledge to build model
- For Example
- Assumption In most of experiments, only a
small set of genes (100s/1000s) have been
affected significantly.
19Need for Data Mining
Data Mining
- Data volumes are too large for traditional
analysis methods - Large number of records and high dimensional
data - Only small portion of data is analyzed
- Decision support process becomes more complex
Functions of Data Mining
Use the data to build predictors prediction,
classification, deviation detection,
segmentation Generates more sophisticated
summaries and reports to aid understanding of the
data find clusters, partitions in data
20Data Mining Methods
Classification, Regression (Predictive
Modeling) Clustering (Segmentation) Association
Discovery (Summarization) Change and deviation
detection Dependency Modeling Information
Visualization
21Clustered display of data from time course of
serum stimulation of primary human fibroblasts.
Cholesterol Biosynthesis
Cell Cycle
Immediate Early Response
Signaling and Angiogenesis
Wound Healing and Tissue Remodeling
Eisen et al. Proc. Natl. Acad. Sci. USA 95
(1998) pg 14865
22(No Transcript)
23(No Transcript)
24Self Organizing Maps
25Molecular Classification of Cancer
26(No Transcript)
27Gene Expression Profile of Aging and Its
Retardation by Caloric Restriction Cheol-Koo
Lee, Roger G. Klopp, Richard Weindruch, Tomas A.
Prolla
28Expression Landscape of cell-cycle regulated
genes in yeast
29Multi-dimension data visualization
30(No Transcript)