Title: Micro Array Analysis
1Micro Array Analysis
2Biological Background
- Mechanisms protect from DNA damage
- Prevention of DNA damage
- Aiding in DNA repair
- Elimination of damaged cells through apoptosis
- Lack of DNA repair pathways -gt high risk for
cancer - DNA damage -gt cancer, aging, toxicity
- Alkylating agents Diet, atmosphere,
chemotherapy, smoking - UV radiation, polymerase errors, spontaneous DNA
decomposition
3Biological Background 2
- 7MeG damages N7 position of guanine -gt 7MeG not
mutagenic or lethal - Lesions O6MeG and 3MeA
- O6MeG lesions caused by cancer chemotherapy,
MNNG, and BCNU - MMR can fix single base mismatches and
insertion/deletion loops -gt low mutation rate - Ecoli MMR Excision of hundreds of bases near
mismatch on daughter strand and replaces them
4Biological Background 3
- Mammalian MMR more complex mechanism,
strand-specific, lack of strand-discrimination
mechanism - MSH2, MSH3, MSH6 recognize heteroduplex DNA
- MSH4 and MSH5 recombination
- EXO1 does exonucleolytic reaction
- PCNA mismatch recognition
5Biological Background 4
- Mutations in MMR -gt increase in spontaneous
mutation rate -gt micro satellite instability -gt
cancer (nonpolyposis colon cancer) - Explain normal mismatch repair
6Microarray Technology
- Measures simultaneously relative expression level
of thousands of genes within a specific tissue - The mRNA transcripts of a cell are isolated and
reverse transcribed to cDNA this is the cDNA
library of a cell - The cDNA representation of a cell hybridized to
labeled cDNA or to synthetic oligonucleotides
(short sequences of single stranded cDNA) - The cDNA or oligos on the chip are called probes,
while the cDNA of the cell is the target
7cDNA Arrays
- Selection of probes
- Amplification of cellular mRNA -gt cDNA through
PCR - Each spot in cDNA array is a gene or EST
- Extract total mRNA from two cell types, label
with green and red -gt relative abundance for each
gene in the two cells
8Oligo arrays
- Cellular mRNA -gt cDNA -gt cRNA
- Photolithographic, Short cDNAs
- Represent genes by fixed length independent
segments - Each probe is 25 bp, each gene represented by a
number of probe pairs - Well chosen probes to specify gene uniquely and
reduce chances of cross hybridization - Probe pair consists of perfect match and mismatch
- Mismatch pair same as perfect match except for a
base inversion in a central position
9Pros and Cons of cDNA arrays vs Oligo arrays
10Analysis Overview
- Background subtraction (Affymetrix)
- Normalization (RMA express, the affect of the
number of chips?) - Filtering
- P/A (all Ps, 3 of 4, 2 of 4, 0 of 4, etc)
- Same/Same filters for duplications
- Find correlation between duplicates and
triplicates - Find Log2 ratios
- Find Up Down regulated genes (use cutoff of /-
1, arbitrary)
11Analysis 2
- Find Up and Down regulated by T-test
- Make GO maps using annotation database
- Clustering
- Protein-Protein databases correlated with log
ratios (redo) - SIPnomes and log ratios
12Future Analysis Plans
- Find Up and Down regulated by LPE test and
Generalized Likelihood ratio test - Predictions of complexes and new pathways
- Classification of unknown sample by class
discovery - Time series analysis (regulatory networks,
periodic expression, coregulation) - 9 region graphs
- PCA
- Using e-value cutoffs for subnetworks
13Future Analysis Plans 2
- Long term outcome prediction
- Prediction of viability using network
interactions - Phenotype data, Phosphorylation data,
localization data, protein abudances - Mathematical properties of the network
- Clustering methods comparison
- Robustness of network
- mRNA degradation effect
- Genetic diagnostic test
- Promoter Mapping
14Annotation Database
- Probe Set ID,Title,Gene Symbol,Location
- Unigene ID, LocusLink ID,Swissprot ID, Ensemble
ID - GO
- Biological Process
- Cellular Component
- Molecular Function
- Pathways
- Etc
- Verified annotation cross-referencing
15My program
- Inputs
- P/A calls from affymetrix analysis
- Annotation database
- Normalized Cell files
- Names of Cell files and their replicates
- Names of one or more baseline Cell files
- Fold cutoff
16Names of Cell Files
- TK6 Untreated,6-27-021.CEL,6-28-022.CEL
- TK6 24 hr,6-28-024.CEL,6-28-025.CEL
- TK6 48 hr,6-29-028.CEL,6-28-029.CEL
- TK6 48 hr 2,6-27-028.CEL,6-28-029.CEL
- ----------------------------------------
- TK6-MGMT Untreated,6-27-0210.CEL,6-28-0211.CEL
- TK6-MGMT 24 hr,6-27-0214.CEL,6-28-0215.CEL
- TK6-MGMT 48 hr,6-27-0217.CEL,6-28-0218.CEL
- -----------------------------------------------
- MT1 Untreated,6-27-0219.CEL,6-28-0220.CEL,6-28-0
221.CEL - MT1 24 hr,6-27-0222.CEL,6-28-0223.CEL,6-28-0224
.CEL - MT1 48 hr,6-27-0225.CEL,6-28-0226.CEL,6-28-0227
.CEL
17Program Outputs
- Makes new database of average over replicates,
log2 ratios - Upregulated and Downregulated Lists for any
combination of P/A filters and Same/Same filters - Number up or down regulated
- List of probeset ID, gene symbol, title, GO
(Biological Process), and log 2 ratio - List of probesets up or downregulated ready for
GO analysis - Set Operations
- Intersection
- In A not B
- In A not B,C,D,
- 9 regions (explain)
- Cross-referencing of protein-protein data with
expression data - Subnetworks of up or down regulation
18Correlation of Ivans Duplicates
Example of correlation between duplicate
experiments Similar correlation for all duplicates
19Ivans Counts for TK6 24 hourswith Filtering
Shows No filtering is best
Used probeset IDs
20TK6 24 hours no filtering
21Ivans Counts no Filtering
22Example Genes Up Regulated (TK6 at 24 hours)
23Example Genes Downregulated(TK6 at 24 hours)
24Correlation of Lisis Triplicates
25Lisis Liver Counts no Filtering
26Lisis Counts 2 no Filtering
27Finding Up and Down Regulation by T-test
- T test statistic for each gene
- average over replicates for a single gene in
condition one average over replicates in
condition two / standard deviation of the gene
over both conditions - where standard deviation over both conditions is
- 1/n1 (standard deviation of the gene in
condition one) 2 1/n2 (standard deviation
of the gene in condition two) 2 (1/2)
28More on T-test
- 2-sample T-test, determines if two population
means are equal - Paired or unpaired
- Paired when samples are dependent
- Degrees of freedom (s_12/ms_22/n)2/(s_12/m
)2/(m-1)(s_22/n)2/(n-1), round down to
nearest integer
29Correlation between Log2 ratio and T-test
- High correlation between Log2 ratio and T-value
for TK6, TK6-MGMT, and MT1 for Ivans data sets - Example of T-value on y-axis, Log2 ratio on
x-axis for TK6-24 hours
30Correlation between Log2 cutoff and T-value
0.25 lt p lt 0.45
Need to find p-values corresponding to these
cutoff t-values
31T-test for Liver MGMT
T-test versus Log2 ratio for Liver MGMT Untreated
32Correlation between Log2 cutoff and T-value
0.25 lt p lt .45
33T-test cutoff procedure
- Normality method
- Empirical method - Shuffle labels of conditions
and find empirical t-distribution - Find the areas of the tails in order to find the
t-cutoff for a specific p-value threshold - Find the p-values for log 2 ratios of /- 1
34PCA background
- Transforms a number of correlated variables into
a smaller number of uncorrelated variables called
principal components - Reduces dimensionality of the data set
- Identification of underlying variables
- First component accounts for as much variability
as possible - Each succeeding accounts for as much as possible
of the remaining variability
35PCA algorithm and Example
- Start with random vector x
- Find its expectation
- Form its covariance matrix
- Find its eigenvalues and eigenvectors
- Let A be the matrix of eigenvectors
- Let A_k be the first K eigenvectors as rows
- Then use the 2 transformations
- Y A_k (x Ex)
- X A_kT y Ex
36GO analysis Upregulated Aag Brain
37GO analysisDownregulated Aag Brain
38Database of Interacting Proteins(DIB) and
SIPnomes
- Cross-referenced DIB database with Log2 ratios
- Red below -1.0, Green greater than 1.0
- Cross-referenced SIPnome with Log2 ratio
- Need to filter by e-value (Later)
- Too little data
39Background on Clustering
- Options
- Hierarchical clustering
- Wards method
- Single linkage
- Complete linkage
- UPGMA
- WPGMA
- Self-organizing maps
- K-means clustering
- Choose a clustering metric
- Euclidean, Manhattan
40Cluster Validation
- Are the clusters real?
- Internal (sub-sampling)
- External validation (match to known categories)
- Internal methods
- Measure the similarity between two sets of
clusters - Use label matrices Lij 1 if i and j are in the
same cluster - Compare the label matrices of the clusters found
using all of the data with the clusters found
using 80 of the data - High confidence in original clustering if there
is high similarity between the label matrices
41K-means algorithm and Example
- Ask for the number of clusters, k
- Randomly guess k centers of clusters
- Each data point finds the closest center
- Each center finds the centroid of the points it
owns - Center jumps to the centroid
- Repeat
42Hierarchical clustering and Example
- Let each point be a cluster
- Find the most similar pair of clusters through a
cluster distance - Merge into a parent cluster
- Repeat until all data merged into one cluster
43Cluster similarity
- Complete linkage Maximum distance between points
in clusters - Single linkage Minimum distance between points
in clusters - Average linkage Average distance between points
in clusters
44UPGMA
- Unweighted Paired Group Method with Arithmetic
Mean - Start with distance matrix for each pair of data
- Find the smallest distance, and cluster these,
the branching point is half the distance - Find a new distance matrix
- Repeat the last two steps
- UPGMA vs WPGMA
- WPGMA weighted paired group method analysis
- Difference is in the calculation of a new
distance matrix
45Self-organizing Maps
46Advantages and Disadvantages of each clustering
method
47Clustering results Ivan
Used Wards method Note similar experiments
cluster together Green Down Red Up
48Clustering Lisi
49Aag and MMS liver counts
Probeset IDs
50Finding Hubs in pp networks
Mouse data hubs
Ivans data the hubs
51Hub examples
52 of proteins (y) having x neighbors
Human
Mouse
53SIPnome protein/protein, and protein/splice
variant connections
54SIP and splice variants
- For each up or down regulated splice variant
- Find its parent protein, A
- Find which proteins that protein A is connected
to, call this set B - Find the splice variants of set B
- If both splice variants are regulated, then
success - Results None Found
-
55LPE test
- Two sample t-test
- Large p-values
- Large false positive rate
- Assumes many replicates but we have only 3
- Therefore use LPE test
- LPE test
- Local pooled error test
- Add more
- Independence? Problematic for time course data
-
56Time Course Analysis
- Correlation method
- Edge detection method
- Bayesian networks
- Event method
57Event method
- Smooth the data
- Events for each instant (falling, rising,
constant) - Sequence alignment
- Best match of two event strings taking time into
account - Use global sequence alignment
58Sample Correlation Analysis (no time lag)
Top Gene Pair Correlation
Top Gene Pairs AntiCorrelation
59Sample Time Series (No time lag)
60Sample Time Series with Significant Fold Change
(no time lag)
61Correlation method
- Correlate two profiles with 6 hr time lag
- Check all 144 million probeset pairs for
correlation - Require 98 correlation
- Require one time point for both series to be fold
change gt 1.7 - 1006 matches out of 144 million pairs
62Sample Time Series Significant Fold change (with
Time lag)
63(No Transcript)
64Combining SIP and activators
- For each gene pair in the SIPnome
- Loop through each splice variant of geneA
- Loop through each splice variant of geneB
- Find the time lagged correlation of these two
splice variants - If the correlation gt 0.98 and both splice
variants show 1.7 fold change then keep the
SIPnome interaction - Results None found
65Pathway Mapping
- Take GenMapp pathways
- Combine with microarray data
- Color by up or down regulation
66Combining pathways and Chip Data for 6 Aag liver
67TGF Beta Signaling
68Promoter Mapping
- For each accession number, find 2500 bases
upstream, and 50 downstream, use PromoSer - For each of the these 4000 sequences of 2550 bp,
identify possible TF binding sites using TFsearch
in the forward strand - Find how many times Oct-1 (for example) occurs at
random vs aag promotors - Random samples use 100 samples of size 100
- If Oct-1 occurs more than once in the 2500 bp
upstream, count it as one
69Promoter results
- Oct-1, YY1, S8, C/EBpb, Oct-x, TATA, Sox-5,
HNF-3a, C/EBpa are overrepresented for up
regulated AagNull Brain, 2 std above random - Aag6MMSWTULIV has Oct-1 overrepresented by gt 2
std - AagBM has GR,NGFI-C,CRE-BP overrepresented gt 2
std - WT6MMSWTULIV has Oct-1 overrepresented gt 2 std
- All have p-values lt 0.02
- Found overrepresented by computing average and
standard deviation of random sample and compared
to observed sample
70Promoter pair results
71Map of DNA binding sites for one Aag gene
72Distribution of TF binding sites for Aag
YY1
S8
Positions are distributed randomly ?
73Promoter Alignment
74Promoter Part 2
75Works read
- Statistical Challenges in Functional Genomics,
P. Sebastiani, E. Gussoni, I. Kohane, M. Ramoni,
Statistical Science, Vol. 18, No. 1, 33-70 - Mark Hickmans thesis pgs. 1-41
- Probability and Statistics for Engineering and
the Sciences, J. Devore, Thomson Learning, 2004 - A generalized likelihood ratio test to identify
differentially expressed genes from microarray
data, S. Wang, S. Ethier, Bio-infomatics, Vol.
20, no. 1, 2004, pgs. 100-104 - Global snapshot of a protein interaction
network a percolation based approach, C. Chin,
M. Samanta, Vol. 19, no. 18, 2003, pgs. 2413 -
2419 - Essentiality and damage in metabolic networks,
N. Lemke, F. Heredia, et al. Vol. 20, no. 1,
2004, pgs. 115-119. - Self-Organizing Maps, http//davis.wpi.edu/matt
/courses/soms
76Works Read
- 8. Using Structured Self-Organizing Maps in
News Integration Websites, I. Perelomov, A.
Azcarraga, J. Tan, et al. - Inference of transcriptional regulation
relationships from gene expression data, A.
Kwon, H. Hoos, R. Ng, Bioinfomatics, Vol. 19, no.
8, 2003, pgs. 905-912 - Modified nonparametric approaches to detecting
differentially expressed genes in replicated
microarray experiments, Y. Zhao and W. Pan,
Bioinfomatics, Vol. 19, no. 9, 2003, pgs.
1046-1054 - Metabolic pathways in three dimensions, I.
Rojdestvenski, BioInfomatics, Vol. 19, no. 18,
2003, pgs.2436-2441 - Local pooled error test for identifying
differentially expressed genes with a small
number of replicates - Linking gene expression data with patient
survival times using partial least squares, P.
Park, L. Tian, I. Kohane, Bioinfomatics, Vol. 18,
Suppl. 1, 2002, pgs. S120-S127
77Acknowledgements
- Professor Samson
- Coworkers
- Katherine
- Lisi
- Tom
- Rebecca
- Mark
-