Micro Array Analysis

About This Presentation

Title:

Micro Array Analysis

Description:

Elimination of damaged cells through apoptosis ... Alkylating agents: Diet, atmosphere, chemotherapy, smoking ... Prediction of viability using network interactions ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 78

Provided by: Sam5170

Category:

more less

Transcript and Presenter's Notes

Title: Micro Array Analysis

1
Micro Array Analysis

Lisi and Ivans data sets

2
Biological Background

Mechanisms protect from DNA damage
Prevention of DNA damage
Aiding in DNA repair
Elimination of damaged cells through apoptosis
Lack of DNA repair pathways -gt high risk for
cancer
DNA damage -gt cancer, aging, toxicity
Alkylating agents Diet, atmosphere,
chemotherapy, smoking
UV radiation, polymerase errors, spontaneous DNA
decomposition

3
Biological Background 2

7MeG damages N7 position of guanine -gt 7MeG not
mutagenic or lethal
Lesions O6MeG and 3MeA
O6MeG lesions caused by cancer chemotherapy,
MNNG, and BCNU
MMR can fix single base mismatches and
insertion/deletion loops -gt low mutation rate
Ecoli MMR Excision of hundreds of bases near
mismatch on daughter strand and replaces them

4
Biological Background 3

Mammalian MMR more complex mechanism,
strand-specific, lack of strand-discrimination
mechanism
MSH2, MSH3, MSH6 recognize heteroduplex DNA
MSH4 and MSH5 recombination
EXO1 does exonucleolytic reaction
PCNA mismatch recognition

5
Biological Background 4

Mutations in MMR -gt increase in spontaneous
mutation rate -gt micro satellite instability -gt
cancer (nonpolyposis colon cancer)
Explain normal mismatch repair

6
Microarray Technology

Measures simultaneously relative expression level
of thousands of genes within a specific tissue
The mRNA transcripts of a cell are isolated and
reverse transcribed to cDNA this is the cDNA
library of a cell
The cDNA representation of a cell hybridized to
labeled cDNA or to synthetic oligonucleotides
(short sequences of single stranded cDNA)
The cDNA or oligos on the chip are called probes,
while the cDNA of the cell is the target

7
cDNA Arrays

Selection of probes
Amplification of cellular mRNA -gt cDNA through
PCR
Each spot in cDNA array is a gene or EST
Extract total mRNA from two cell types, label
with green and red -gt relative abundance for each
gene in the two cells

8
Oligo arrays

Cellular mRNA -gt cDNA -gt cRNA
Photolithographic, Short cDNAs
Represent genes by fixed length independent
segments
Each probe is 25 bp, each gene represented by a
number of probe pairs
Well chosen probes to specify gene uniquely and
reduce chances of cross hybridization
Probe pair consists of perfect match and mismatch
Mismatch pair same as perfect match except for a
base inversion in a central position

9
Pros and Cons of cDNA arrays vs Oligo arrays
10
Analysis Overview

Background subtraction (Affymetrix)
Normalization (RMA express, the affect of the
number of chips?)
Filtering
P/A (all Ps, 3 of 4, 2 of 4, 0 of 4, etc)
Same/Same filters for duplications
Find correlation between duplicates and
triplicates
Find Log2 ratios
Find Up Down regulated genes (use cutoff of /-
1, arbitrary)

11
Analysis 2

Find Up and Down regulated by T-test
Make GO maps using annotation database
Clustering
Protein-Protein databases correlated with log
ratios (redo)
SIPnomes and log ratios

12
Future Analysis Plans

Find Up and Down regulated by LPE test and
Generalized Likelihood ratio test
Predictions of complexes and new pathways
Classification of unknown sample by class
discovery
Time series analysis (regulatory networks,
periodic expression, coregulation)
9 region graphs
PCA
Using e-value cutoffs for subnetworks

13
Future Analysis Plans 2

Long term outcome prediction
Prediction of viability using network
interactions
Phenotype data, Phosphorylation data,
localization data, protein abudances
Mathematical properties of the network
Clustering methods comparison
Robustness of network
mRNA degradation effect
Genetic diagnostic test
Promoter Mapping

14
Annotation Database

Probe Set ID,Title,Gene Symbol,Location
Unigene ID, LocusLink ID,Swissprot ID, Ensemble
ID
GO
Biological Process
Cellular Component
Molecular Function
Pathways
Etc
Verified annotation cross-referencing

15
My program

Inputs
P/A calls from affymetrix analysis
Annotation database
Normalized Cell files
Names of Cell files and their replicates
Names of one or more baseline Cell files
Fold cutoff

16
Names of Cell Files

TK6 Untreated,6-27-021.CEL,6-28-022.CEL
TK6 24 hr,6-28-024.CEL,6-28-025.CEL
TK6 48 hr,6-29-028.CEL,6-28-029.CEL
TK6 48 hr 2,6-27-028.CEL,6-28-029.CEL
----------------------------------------
TK6-MGMT Untreated,6-27-0210.CEL,6-28-0211.CEL
TK6-MGMT 24 hr,6-27-0214.CEL,6-28-0215.CEL
TK6-MGMT 48 hr,6-27-0217.CEL,6-28-0218.CEL
-----------------------------------------------
MT1 Untreated,6-27-0219.CEL,6-28-0220.CEL,6-28-0
221.CEL
MT1 24 hr,6-27-0222.CEL,6-28-0223.CEL,6-28-0224
.CEL
MT1 48 hr,6-27-0225.CEL,6-28-0226.CEL,6-28-0227
.CEL

17
Program Outputs

Makes new database of average over replicates,
log2 ratios
Upregulated and Downregulated Lists for any
combination of P/A filters and Same/Same filters
Number up or down regulated
List of probeset ID, gene symbol, title, GO
(Biological Process), and log 2 ratio
List of probesets up or downregulated ready for
GO analysis
Set Operations
Intersection
In A not B
In A not B,C,D,
9 regions (explain)
Cross-referencing of protein-protein data with
expression data
Subnetworks of up or down regulation

18
Correlation of Ivans Duplicates
Example of correlation between duplicate
experiments Similar correlation for all duplicates
19
Ivans Counts for TK6 24 hourswith Filtering
Shows No filtering is best
Used probeset IDs
20
TK6 24 hours no filtering
21
Ivans Counts no Filtering
22
Example Genes Up Regulated (TK6 at 24 hours)
23
Example Genes Downregulated(TK6 at 24 hours)
24
Correlation of Lisis Triplicates
25
Lisis Liver Counts no Filtering
26
Lisis Counts 2 no Filtering
27
Finding Up and Down Regulation by T-test

T test statistic for each gene
average over replicates for a single gene in
condition one average over replicates in
condition two / standard deviation of the gene
over both conditions
where standard deviation over both conditions is
1/n1 (standard deviation of the gene in
condition one) 2 1/n2 (standard deviation
of the gene in condition two) 2 (1/2)

28
More on T-test

2-sample T-test, determines if two population
means are equal
Paired or unpaired
Paired when samples are dependent
Degrees of freedom (s_12/ms_22/n)2/(s_12/m
)2/(m-1)(s_22/n)2/(n-1), round down to
nearest integer

29
Correlation between Log2 ratio and T-test

High correlation between Log2 ratio and T-value
for TK6, TK6-MGMT, and MT1 for Ivans data sets
Example of T-value on y-axis, Log2 ratio on
x-axis for TK6-24 hours

30
Correlation between Log2 cutoff and T-value
0.25 lt p lt 0.45
Need to find p-values corresponding to these
cutoff t-values
31
T-test for Liver MGMT
T-test versus Log2 ratio for Liver MGMT Untreated
32
Correlation between Log2 cutoff and T-value
0.25 lt p lt .45
33
T-test cutoff procedure

Normality method
Empirical method - Shuffle labels of conditions
and find empirical t-distribution
Find the areas of the tails in order to find the
t-cutoff for a specific p-value threshold
Find the p-values for log 2 ratios of /- 1

34
PCA background

Transforms a number of correlated variables into
a smaller number of uncorrelated variables called
principal components
Reduces dimensionality of the data set
Identification of underlying variables
First component accounts for as much variability
as possible
Each succeeding accounts for as much as possible
of the remaining variability

35
PCA algorithm and Example

Start with random vector x
Find its expectation
Form its covariance matrix
Find its eigenvalues and eigenvectors
Let A be the matrix of eigenvectors
Let A_k be the first K eigenvectors as rows
Then use the 2 transformations
Y A_k (x Ex)
X A_kT y Ex

36
GO analysis Upregulated Aag Brain
37
GO analysisDownregulated Aag Brain
38
Database of Interacting Proteins(DIB) and
SIPnomes

Cross-referenced DIB database with Log2 ratios
Red below -1.0, Green greater than 1.0
Cross-referenced SIPnome with Log2 ratio
Need to filter by e-value (Later)
Too little data

39
Background on Clustering

Options
Hierarchical clustering
Wards method
Single linkage
Complete linkage
UPGMA
WPGMA
Self-organizing maps
K-means clustering
Choose a clustering metric
Euclidean, Manhattan

40
Cluster Validation

Are the clusters real?
Internal (sub-sampling)
External validation (match to known categories)
Internal methods
Measure the similarity between two sets of
clusters
Use label matrices Lij 1 if i and j are in the
same cluster
Compare the label matrices of the clusters found
using all of the data with the clusters found
using 80 of the data
High confidence in original clustering if there
is high similarity between the label matrices

41
K-means algorithm and Example

Ask for the number of clusters, k
Randomly guess k centers of clusters
Each data point finds the closest center
Each center finds the centroid of the points it
owns
Center jumps to the centroid
Repeat

42
Hierarchical clustering and Example

Let each point be a cluster
Find the most similar pair of clusters through a
cluster distance
Merge into a parent cluster
Repeat until all data merged into one cluster

43
Cluster similarity

Complete linkage Maximum distance between points
in clusters
Single linkage Minimum distance between points
in clusters
Average linkage Average distance between points
in clusters

44
UPGMA

Unweighted Paired Group Method with Arithmetic
Mean
Start with distance matrix for each pair of data
Find the smallest distance, and cluster these,
the branching point is half the distance
Find a new distance matrix
Repeat the last two steps
UPGMA vs WPGMA
WPGMA weighted paired group method analysis
Difference is in the calculation of a new
distance matrix

45
Self-organizing Maps
46
Advantages and Disadvantages of each clustering
method
47
Clustering results Ivan
Used Wards method Note similar experiments
cluster together Green Down Red Up
48
Clustering Lisi
49
Aag and MMS liver counts
Probeset IDs
50
Finding Hubs in pp networks
Mouse data hubs
Ivans data the hubs
51
Hub examples
52
of proteins (y) having x neighbors
Human
Mouse
53
SIPnome protein/protein, and protein/splice
variant connections
54
SIP and splice variants

For each up or down regulated splice variant
Find its parent protein, A
Find which proteins that protein A is connected
to, call this set B
Find the splice variants of set B
If both splice variants are regulated, then
success
Results None Found

55
LPE test

Two sample t-test
Large p-values
Large false positive rate
Assumes many replicates but we have only 3
Therefore use LPE test
LPE test
Local pooled error test
Add more
Independence? Problematic for time course data

56
Time Course Analysis

Correlation method
Edge detection method
Bayesian networks
Event method

57
Event method

Smooth the data
Events for each instant (falling, rising,
constant)
Sequence alignment
Best match of two event strings taking time into
account
Use global sequence alignment

58
Sample Correlation Analysis (no time lag)
Top Gene Pair Correlation
Top Gene Pairs AntiCorrelation
59
Sample Time Series (No time lag)
60
Sample Time Series with Significant Fold Change
(no time lag)
61
Correlation method

Correlate two profiles with 6 hr time lag
Check all 144 million probeset pairs for
correlation
Require 98 correlation
Require one time point for both series to be fold
change gt 1.7
1006 matches out of 144 million pairs

62
Sample Time Series Significant Fold change (with
Time lag)
63
(No Transcript)
64
Combining SIP and activators

For each gene pair in the SIPnome
Loop through each splice variant of geneA
Loop through each splice variant of geneB
Find the time lagged correlation of these two
splice variants
If the correlation gt 0.98 and both splice
variants show 1.7 fold change then keep the
SIPnome interaction
Results None found

65
Pathway Mapping

Take GenMapp pathways
Combine with microarray data
Color by up or down regulation

66
Combining pathways and Chip Data for 6 Aag liver
67
TGF Beta Signaling
68
Promoter Mapping

For each accession number, find 2500 bases
upstream, and 50 downstream, use PromoSer
For each of the these 4000 sequences of 2550 bp,
identify possible TF binding sites using TFsearch
in the forward strand
Find how many times Oct-1 (for example) occurs at
random vs aag promotors
Random samples use 100 samples of size 100
If Oct-1 occurs more than once in the 2500 bp
upstream, count it as one

69
Promoter results

Oct-1, YY1, S8, C/EBpb, Oct-x, TATA, Sox-5,
HNF-3a, C/EBpa are overrepresented for up
regulated AagNull Brain, 2 std above random
Aag6MMSWTULIV has Oct-1 overrepresented by gt 2
std
AagBM has GR,NGFI-C,CRE-BP overrepresented gt 2
std
WT6MMSWTULIV has Oct-1 overrepresented gt 2 std
All have p-values lt 0.02
Found overrepresented by computing average and
standard deviation of random sample and compared
to observed sample

70
Promoter pair results
71
Map of DNA binding sites for one Aag gene
72
Distribution of TF binding sites for Aag
YY1
S8
Positions are distributed randomly ?
73
Promoter Alignment
74
Promoter Part 2
75
Works read

Statistical Challenges in Functional Genomics,
P. Sebastiani, E. Gussoni, I. Kohane, M. Ramoni,
Statistical Science, Vol. 18, No. 1, 33-70
Mark Hickmans thesis pgs. 1-41
Probability and Statistics for Engineering and
the Sciences, J. Devore, Thomson Learning, 2004
A generalized likelihood ratio test to identify
differentially expressed genes from microarray
data, S. Wang, S. Ethier, Bio-infomatics, Vol.
20, no. 1, 2004, pgs. 100-104
Global snapshot of a protein interaction
network a percolation based approach, C. Chin,
M. Samanta, Vol. 19, no. 18, 2003, pgs. 2413 -
2419
Essentiality and damage in metabolic networks,
N. Lemke, F. Heredia, et al. Vol. 20, no. 1,
2004, pgs. 115-119.
Self-Organizing Maps, http//davis.wpi.edu/matt
/courses/soms

76
Works Read

8. Using Structured Self-Organizing Maps in
News Integration Websites, I. Perelomov, A.
Azcarraga, J. Tan, et al.
Inference of transcriptional regulation
relationships from gene expression data, A.
Kwon, H. Hoos, R. Ng, Bioinfomatics, Vol. 19, no.
8, 2003, pgs. 905-912
Modified nonparametric approaches to detecting
differentially expressed genes in replicated
microarray experiments, Y. Zhao and W. Pan,
Bioinfomatics, Vol. 19, no. 9, 2003, pgs.
1046-1054
Metabolic pathways in three dimensions, I.
Rojdestvenski, BioInfomatics, Vol. 19, no. 18,
2003, pgs.2436-2441
Local pooled error test for identifying
differentially expressed genes with a small
number of replicates
Linking gene expression data with patient
survival times using partial least squares, P.
Park, L. Tian, I. Kohane, Bioinfomatics, Vol. 18,
Suppl. 1, 2002, pgs. S120-S127