Title: APO-SYS workshop on data analysis and pathway charting
1APO-SYS workshop on data analysis and pathway
charting
- Igor Ulitsky
- Ron Shamirs Computational Genomics Group
-
2Part I Presentations
- EXPANDER
- AMADEUS
- SPIKE
- MATISSE
3Part II Hands-on Session
4EXPression ANalyzer and DisplayER
- Adi Maron-Katz
- Chaim Linhart
- Amos Tanay
- Rani Elkon
- Israel Steinfeld
Seagull Shavit Igor Ulitsky Roded Sharan Yossi
Shiloh Ron Shamir
http//acgt.cs.tau.ac.il/expander
5EXPANDER
- Low level analysis
- Missing data estimation (KNN or manual)
- Normalization quantile, loess
- Filtering fold change, variation, t-test
- Standardization mean 0 std 1, take log, fixed
norm - High level gene partition analysis
- Clustering
- Biclustering
- Ascribing biological meaning to patterns
- Enriched functional categories (Gene Ontology)
- Identify transcriptional regulators promoter
analysis - Built-in support for 9 organisms
- human, mouse, rat, chicken, zebrafish, fly, worm,
arabidopsis, yeast
6Input data
Normalization/ Filtering
Links to public annotation databases
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
7EXPANDER - Preprocessing
- Input data
- Expression matrix (probe-row condition-column)
- One-channel data (e.g., Affymetrix)
- Dual-channel data (cDNA microarrays, data are
(log) ratios between the Red and Green channels) - .cel files
- ID conversion file map probes to genes
- Gene sets data
- Data definitions
- Defining condition subsets
- Data type scale (log)
8EXPANDER Preprocessing (II)
- Data Adjustments
- Missing value estimation (KNN or arbitrary)
- Merging conditions
- Normalization removal of systematic biases from
the analyzed chips - Implemented methods quantile, lowess
- Visualization box plots, scatter plots (simple,
M vs. A) -
9EXPANDER Preprocessing (III)
- Filtering Focus downstream analysis on the set
of responding genes - Fold-Change
- Variation
- Statistical tests (T-test)
- Standardization Create a common scale
- For each probe Mean0, STD1
- Log data (base 2)
- Fixed Norm (divide by norm of probe vector)
-
10Input data
Normalization/ Filtering
Links to public annotation databases
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
11Cluster Analysis
- Partition the responding genes into distinct
sets, each with a particular expression pattern - Identify major patterns in the data reduce the
dimensionality of the problem - co-expression ? co-function
- co-expression ? co-regulation
- Partition the genes to achieve
- Homogeneity genes inside a cluster show highly
similar expression pattern. - Separation genes from different clusters have
different expression patterns.
12Cluster Analysis (II)
- Implemented algorithms
- CLICK, K-means, SOM, Hierarchical
- Visualization
- Mean expression patterns
- Heat-maps
13Example study responses to ionizing radiation
Ionizing Radiation
Double Strand Breaks
14Example study experimental design
- Genotypes Atm-/- and control w.t. mice
- Tissue Lymph node
- Treatment Ionizing radiation
- Time points 0, 30 min, 120 min
- Microarrays Affymetrix U74Av2 (12k probesets)
15Test case - Data Analysis
- Dataset six conditions (2 genotypes, 3 time
points) - Normalization
- Filtering step define the responding genes
set - genes whose expression level is changed by at
least 1.75 fold - Over 700 genes met this criterion
- The set contains genes with various response
patterns we applied CLICK to this set of genes
16Major Gene Clusters Irradiated Lymph node
Atm-dependent early responding genes
17Major Gene Clusters Irradiated Lymph node
Atm-dependent 2nd wave of responding genes
18Input data
Normalization/ Filtering
Links to public annotation databases
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
19Ascribe Functional Meaning to the Clusters
- Gene Ontology (GO) annotations for human, mouse,
rat, chicken, fly, worm, Arabidopsis, Zebrafish
and yeast. - TANGO Apply statistical tests that seek
over-represented GO functional categories in the
clusters.
20Enriched GO Functional Categories
- Hierarchical structure ? highly dependent
categories. - Problems
- High redundancy
- Multiple testing corrections assume independent
tests - TANGO
21Functional Enrichment - Visualization
22 Functional Categories
cell cycle control (plt1x10-6 )
23 Functional Categories
Cell cycle control (plt5x10-6) Apoptosis (p0.001)
24Input data
Normalization/ Filtering
Links to public annotation databases
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
25Clues are in the promoters
Identify Transcriptional Regulators
ATM
Hidden layer
?
?
?
?
?
p53
TF-C
TF-B
TF-A
NEW
Observed layer
g3
g13
g12
g10
g9
g1
g8
g7
g6
g5
g4
g11
g2
26Reverse engineering of transcriptional networks
- Infers regulatory mechanisms from gene expression
data - Assumption
- co-expression ? transcriptional co-regulation ?
common cis-regulatory promoter elements - Step 1 Identification of co-expressed genes
using microarray technology (clustering algs) - Step 2 Computational identification of
cis-regulatory elements that are over-represented
in promoters of the co-expressed gene
27PRIMA general description
- Input
- Target set (e.g., co-expressed genes)
- Background set (e.g., all genes on the chip)
- Analysis
- Identify transcription factors whose binding site
signatures are enriched in the Target set with
respect to the Background set. - TF binding site models TRANSFAC DB
- Default From -1000 bp to 200 bp relative the TSS
28Promoter Analysis - Visualization
29PRIMA - Results
30PRIMA Results
P-value Enrichment factor Transcription factor
6.0x10-5 2.6 CREB
P-value Enrichment factor Transcription factor
NF-?B
5.1
3.8x10-8
p53
4.2
9.6x10-7
STAT-1
3.2
5.4x10-6
Sp-1
1.7
6.5x10-4
31Input data
Normalization/ Filtering
Links to public annotation databases
Visualization utilities
Clustering (CLICK, SOM, K-means, Hierarchical)
Biclustering (SAMBA)
Functional enrichment (TANGO)
Promoter signals (PRIMA)
32Biclustering
- Clustering becomes too restrictive on large
datasets - Seeks global partition of genes according to
similarity in their expression across ALL
conditions - Relevant knowledge can be revealed by identifying
genes with common pattern across a subset of the
conditions - Biclustering algorithmic approach
33A. Tanay, R. Sharan, R. Shamir RECOMB 02
Biclustering SAMBAStatistical Algorithmic
Method for Bicluster Analysis
- Bicluster (module) subset of genes with
similar behavior in a subset of conditions - Computationally challenging has to consider
many combinations of sub-conditions
34Biclustering Visualization
35Expression Data Input File
conditions
probes
36ID Conversion File
37Normalization Box plots
38Standardization of Expression Levels
39Cluster Analysis Visualization (I)
40Cluster Analysis - Visualization (II)