Title: Aracne
1Aracne
- Jorge Viveros
- Summer 2006 Workshop
- June 29th, 2006
2Contents
- Overview (the problem, the alternatives, ARACNEs
arlgorithm central idea) - Demo (reconstruction of gene regulatory networks
for affymatrix gene expression data) - Algorithm details (approximating the mutual
information, comparative study results, ARACNE vs
Bayesian and Relevance Networks) - Conclusions
- Bibliography
31. Overview ARACNE
- Algorithm for the Reconstruction of Accurate
Cellular Networks - Reverse engineering or
deconvolution problem
Samples
ga
gb
ga gb gc gd ge
Information-theory max
entropy methods
gc
gd
ge
Gene regulatory network
4(overview, contd) Authors
- A.A. Margolin 1,2, I. Nemenman 2, K. Basso
3, C. Wiggings 2,4, G. Stolovitzky 5, R.
Dalla-Favera 3, A. Califano 1,2 - 1 Dept. Biomedical informatics, 2 Joint
Centers for Sys Biology, 3 Institute for Cancer
Genetics, 4 Dept. of Appl. Physics and Appl.
Math. - Columbia University
- 5 IBM T.J. Watson Research Center.
Main reference http//www.arxiv.org/abs/q-bio/041
0037 BMC Bioinformatics 2006, 7(Suppl 1)S7
5(overview, contd)
Goal
- Understand mammalian normal cell physiology and
complex pathologic - phenotypes through elucidating gene
transcriptional regulatory networks. - Thesis
- Statistical associations between mRNA abundance
levels helps to - uncover gene regulatory mechanisms.
6(overview alternatives) ARACNE vs
Clustering
- ARACNE recovers specific transcriptional
interactions but does not attempt to - recover all of them (too complex a problem).
- Genome-wide clustering of gene expression
profiles cannot discern direct - (irreducible) from cascade transcriptional gene
interactions.
ga gb gc gd ge
a
b
clustering
ARACNE
c
d
e
ga,gb gc,gd ge
7(central idea) Gene network
inference
-
- edge (direct) statistical dependency
- direct regulatory interaction
-
- nodes genes
- Temporal gene expression data for higher
eukaryotes, difficult to obtain. - Only steady-state statistical dependencies are
studied.
gi
gj
8Accounting for dependence definition and
measurement
- Gene expression values samples from a
joint probability distribution - Consider the multi-information average
log-deviation of the joint probability
distribution (JPD) from the product of its
marginals (also Kullback-Leibler divergence
(KL-div)). - Use maximum entropy methods to approximate JPD by
an element of its m-way marginal Frechet class
(m-way maximum-entropy estimate m-MEE) - Use m-MEE to define mth-order connected
information (m-cinfo) to account for m-way
statistical dependencies (only!). - Multi-info sum of all m-cinfos.
9The multi-information
- Multi-information (KL-div)
JPD
nodes, expressions or genes
Integral if conts case sum if discrete case
Entropy of P(x)
JPD not known, approximate it!
10m-way max entropy estimate of JPD
- m-MEE , , has the same m-marginals as
Lagrange multipliers
m-MEE has the following form
Have no analytical solution BUT can be obtained
via an iterative Proportional fitting proc (IPFP)
11Connected and Multi informations
mth-order connected information
Multi-information
Compensate for the lack of knowledge of JPD by
using the (truncated!) multi-info to establish
and quantify statistical dependencies
12Detecting a particular m-way interaction
- M-way interaction
contributes to multi-info, iff minimum of
interaction multi-information (inter multi-info)
over -specific Frechet class is positive. - Inter multi-info
- and are m-MEE sharing same
m-way marginals except for, perhaps,
Positivity of minimal inter multi-info ?
is an irreducible (direct) interaction Thus draw
edges coming from nodes and meeting at
m-edge vertex.
13Examples
Regulatory cascade (Markov chain)
Information processing inequalty
generically dependent (similarly, )
generically independent
No triplet interactions (coregulation)
14(examples, contd) Other
dependencies
2 regulates 1 and 3 OR 1 and 3 regulate 2
jointly
does not factor but pairwise marginals
do
152. Demo
- Platforms
- caWorkBench2.0 (downloadable through web site)
(JAVA) - Most developed features microarray
data analysis, pathway analysis and reverse
engineering, sequence analysis, transcription
factor binding site analysis, pattern discovery. - http//amdec-bioinfo.cu-genome.org/html/caWorkBe
nch.htm - Cygwin (for windows). Windows and Linux versions
available in web site -
16(Demo) Sample input data
file
- Input_file_name.exp
- N 3 genes
- M 2 microarrays
- Input file has N14 lines
- each lines has M2 (2M2) fields
-
- AffyID HG_U95Av2 SudHL6.CHP ST486.CHP
- G1 G1 16.477367 0.69939363 20.150969 0.5297595
- G2 G2 7.6989274 0.55935365 26.04019 0.5445875
- G3 G3 8.8098955 0.5445875 21.554955 0.31372303
Microarray chip names
annotation name
header line
(value,p-value)-chip1
17(Demo, contd)
Syntax (Cygwin)
- ARACNE algorithm for gene regulatory network
computation given - microarray data.
Usage aracne aracne GeneExpressionFile -a
-k -s -t -e -f aracne -adj
GeneExpressioFile AdjacencyFile -t -e -a
accurate fast default accurate -k gaussian
kernel width accurate method only default
0.15 -s Averaging Window step size fast method
only default 6 -t Mutual Info. threshold
default 0 -e DPI tolerance (btw 0 and 1)
default 1 -f mean stdev default no
filtering
18(Demo, contd) Sample output data file
- input_data_file_namenon-default_param_vals.adj
- lines N genes
- G10 8 0.064729
- G21 2 0.0298643 7 0.0521425
- G32 1 0.0298643
- G43 8 0.0427217
- G54 5 0.403516
- G65 4 0.403516 6 0.582265
- G76 5 0.582265 9 0.38039
- G87 1 0.0521425 8 0.743262
- G98 0 0.064729 3 0.0427217 7 0.743262 9 0.333104
- G109 6 0.38039 8 0.333104
5
AffyID
ID
Associated gene ID
MI value
1
4
6
9
7
8
10
2
3
19 3.
Algorithm details
- Incorporate information-theoretic ideas (Markov
networks) to model statistical dependencies (cf.
2) -
- joint prob dist function of
stationary expressions of all genes (i1,,N) - N genes, Z partition fun (normalization
factor), Hamiltonian, - , , , interaction potentials
(e.g., genes i,j,k do not interact in the - model iff 0.
- Aim identify nonzero potentials.
20(Algorithm details) Aracnes
model
- First-order approximation genes are independent
- 1st order potentials obtained from marginal
probabilities (estimated
experimentally). - ARACNEs approximation truncate joint prob dist
fun to pairwise potentials - In this model
non-interacting genes (includes statistically - independent genes
and genes that do not interact directly, - i.e., but
). - Reduce number of potential pairwise interactions
via realistic biological - assumptions.
21(algorithm details, contd) MI estimation
- Assume two-way interaction pairwise potentials
determine all statistical dependencies. - Mutual information (MI) measure of relatedness
- 0 iff
- MI approximation
- G
bivariate standard Gaussian density - h kernel width
22(algorithm details, contd)
- Some details and technicalities
- Transform x, y so and
their marginal distributions seem uniform - There is not a universal way of choosing h,
however the ranking of the MIs - depends only weakly on them.
23(algorithm details, contd) Establishing
the network
- Define threshold IO to discard MIs
(lower-bound interaction) - Shuffle genes across microarray profiles
evaluate MIs for seemingly - independent genes, choose IO based on what
fraction of MIs falls below the - threshold.
- Data processing inequality if genes g1 and g2
interact thorugh g3 then - ARACNE starts with network so for
every edge - look at gene triplets and remove
edge with smallest MI
24(algorithm details, contd) Establishing the
network
ARACNEs algorithm complexity
N number of genes, M number of samples
DPI analysis MI estimation (order
of pairwise interactions
)
25 Perfect network reconstruction
theorems
- Thm 1 If MIs are estimated with no errors and
true underlying interaction network is a tree
with only pairwise interactions then ARACNE will
reconstruct it. - Thm 2 If Chow-Liu maximum MI info tree is
subnetwork of ARACNEs network then this is the
true network. - Thm 3 ARACNE will reconstruct tree-network
topologies exactly.
26Comparative study results
- Reconstruction of class of synthetic
transcriptional networks by Mendes et al - (cf. 1) and human B lymphocyte genetic network
from gene expressions - profile data.
- Performance of ARACNE compared against Bayesian
Networks (use LibB - package) and Relevance networks (similar to
ARACNE but has less accurate - MI estimation procedure and less-developed of
assigning statistical - significance).
27(results) Synthetic
networks
- 100 genes, 200 interactions organized in two
types of networks - 1. Erdos-Renyi each vertex interaction is
equally likely -
- 2. Scale-free topology distribution of vertex
connections obeys a power law
28(results) Performance metrics
- Pairwise gene interaction is
- (True) positive if their statistical regulatory
interaction is directly linked. - (True) negative if their interaction is not
direct. - Precision fraction
of true interactions correctly inferred -
(expected success rate in experimental validation
of -
predicted interactions) - Recall
fraction of true interactions among all inferred
ones - Performance to be assessed via Precision-Recall
curves (PRCs)
29(results contd) PRCs for synthetic
data
1
2
ARACNEs performance above 40 for both models
30(result contd) Quantitative results on
synthetic data
ARACNE recovers far more true connections and
predicts far less false ones
31(results contd) Results on Human
B cells
- Assembled expression profile data set of 340 B
lymphocytes from normal, tumor-related and
experimentally manipulated populations. - Data set was deconvoluted by ARACNE to generate
B-cell specific regulatory network of 129,000
interactions. - Validation of the networks quality was done by
comparing inferred interactions - with those identified through biochemical
methods. - See cf 3.
32Conclusions and Discussions
- Algorithm is robust enough for its application in
other network reconstruction problems in biology
and the social and engineering fields. - Pairwise interaction model ? higher-order
potential interactions will not be accounted for
(ARACNEs algorithm will open 3-gene loops). - A two-gene interaction will be detected iff there
are no alternate paths. - To keep three-gene loops, modify tolerance for
edge-removal by introducing tolerance parameter,
. - ARACNEs performance deteriorates as local (true)
network topology deviates from a tree (tight
loops may be a problem). - ARACNE achieved high precision and substantial
recall even for few data points when compared to
BN and RN (synthetic data). - ARACNE cannot predict the orientation of the
edges of the networks. - The algorithm is suited for more complex
(mammalian) networks.
33Bibliography
- P. Mendes, W. Sha, K. Ye. Artificial gene
networks for objective comparison of analysis
algorithms. Bioinformatics 2003, 19 Suppl 2
II122-II129. - I. Nemenman. Information theory, multivariate
dependence and genetic network inference.
Technical report arXivq-bio/0406015 2004. - K. Basso, A.A. Margolin, G. Stolovitzky, U.
Klein, R. Dalla-Favera, A. Califano. Reverse
engineering of regulatory networks in human B
cells. Nature Genetics, 2005, 37(4)382-390.
34Main web site
- Important documentation and relevant
publications, application download and support. -
- AMDeC Bionformatics Core Facility at the
Columbia Genome Center - AMDeC (Academic Medicine Development Company)
- http//amdec-bioinfo.cu-genome.org/html/ARACNE.
htm