Title: Evolutionary Systems Biology
1Evolutionary Systems Biology
- Eugene V. Koonin
- National Center for Biotechnology Information
- National Library of Medicine
- NIH, Bethesda, MD
Nothing in (systems) biology makes sense except
in the light of evolution After Theodosius Dobzha
nsky (1970)
2 Molecular evolution 1962-
- Zuckerkandl, E., Pauling, L. 1962. Molecular
evolution. In Horizons in Biochemistry, pp.
189-225
Majority of studies focus on sequence evolution
Phylogeny and taxonomy Genomic effects of n
atural selection Mechanisms of heredity, e.g.,
horizontal gene transfer
3From Molecular Evolution to Evolutionary Systems
Biology 2001-
Every complex system can be abstracted in a netw
ork
(I. King Jordan, pers. com.)
Systems biology offers an opportunity to study
how the phenotype is generated from the genotype
and with it a glimpse of how evolution has
crafted the phenotype. M. Kirschner, Cell, 2005
4 The links between evolution of mammalian
gene sequences
and gene expression networks
5Gene sequences and expression in evolution
Evolution of expression regulation may be even
more important than sequence
evolution
Britten, R. J. Davidson, E. H. Repetitive and
non-repetitive DNA sequences and a speculation
on the origins of evolutionary noveltyQ Rev
Biol (1971) 46111-138
King, M-C. Wilson, A. C. "Evolution at
two levels in humans and chimpanzees
Science (1975) 188 107-116
6 Gene sequences and expression in evolution
Now we can study both on genome scale and at high
resolution
Schena, M. et al. Quantitative monitoring
of gene expression patterns with a
complementary DNA Microarray
Science (1995) 270 467-470
Fleischmann, R. D. et al. Whole-genome
random sequencing and assembly
of Haemophilus influenzae Rd.
Science (1995) 269 496-512
7 Motivation
Gene sequence and expression divergence
? comparative analysis of human rodent gene
sequences substitution rates selection ?
comparative analysis of human rodent gene
expression profiles correlations coexpress
ion network topology organization
Questions 1 - Are sequence evolutionary rates
related to expression levels and patterns?
2 - What are the topological properties of gene
co-expression networks? 0 3 How does gene co-e
xpression network topology relate to sequence
evolution? 4 - How does natural selection affec
t gene expression divergence (convergence)?
8 Sequence analysis
Mouse
Human
- Identify orthologs
- 2. Align sequences
- 3. Calculate substitution rates
gi13376510refNM_025000.1NP_079276.1
CTTTGAGGTGTCATCCCTTTTGGAGAATGCTTTTCAGATTGGAGGCCATC
---CTTGGCACTACATCGTC... gi37574097refNM_19800
5.1NP_932122.1 CTTTGAAGTTTCATCTCT---GGAGAACGCATT
CCAGATCGGAGGCCATCAAACTTGGCACTACATCATC...
CDS dN nonsynonymous nucleotide substitution r
ate dS synonymous nucleotide substitution rate
5 3 UTRs d nucleotide substitution
rate 5 promoters nt word counts cis-regula
tory sequences
9- Expression profile analysis
Su, A. I. et al. Large scale analysis of the
human and mouse transcriptomes
PNAS (2002) 99 4465 Su, A. I. et al. A gene
atlas of the mouse and human protein-encoding
transcriptomes PNAS (2004) 101 6062
10- Expression profile analysis
Pearson correlation coefficient used as the
measure of similarity/distance between expression
profiles
11 Results
12 Results
Expression breadth vs. evolutionary rate
expression breadth
P 5.5e-7 P 2.1e-5
P ns P 4.9e-5
13 Results
Expression level vs. evolutionary rate
expression level
P 5.5e-7 P 1.1e-4
P ns P 4.9e-5
14 Expression vs. sequence divergence
- ? 19 tissues shared between human mouse
- species-specific profiles of orthologous genes
- are strongly correlated
within
between
cumulative frequency
r
15 Results
Expression divergence vs. sequence divergence
correlation
P ns P ns
P ns P ns
However, divergence of expression profiles is not
correlated
to sequence divergence of orthologs
16 Sequence versus expression divergence
- divergence of sequence and expression pattern
not correlated
- but expression pattern divergence not neutral
- are distinct evolutionary mechanisms
responsible?
sequence - purifying selection
divergence from CA
expression - adaptive selection
convergence to coexpression
A B C
A C B
CA
CA
CA
CA
17 Results
Network parameters
1 node degree (k) number of links p
er node 2 degree distribution P(k) probabil
ity that a node has k links 3 cluster
ing coefficient (C) ratio of the number
of observed links connecting the kI
neigbors of node i to the possible number
of links
Network models
Barabasi Oltvai (2004) Nat Rev Genet. 5101
18 Results
19http//www.cytoscape.org
20 Results
Network models
Node degree distribution
- The co-expression network is
- scale-free
Barabasi Oltvai (2004) Nat Rev Genet. 5101
21 Results
Network models
Clustering coefficient (C) x node degree
C 22x stronger than expected
- The co-expression network is
- not hierarchical
Barabasi Oltvai (2004) Nat Rev Genet. 5101
22 Results
Node degree vs. evolutionary rate
co-expressed genes
P 4.9e-5 P 1.7e-2
P ns P 5.8e-2
- Co-expression network hubs evolve slowly
23- Network comparison degree distribution
Human
Mouse
158K edges 7,208 non-zero degree vertices (avg
degree 44)
178K edges 7,730 non-zero degree vertices (avg
degree 46)
24- Network comparison clustering coefficient
Human 0.41
Mouse 0.44
? Networks contain very dense areas but no
evidence of
hierarchical structure
25- Network comparison global vs. local properties
-Globally, the mouse and human networks are very
similar for all networks we observe
1 - power-law degree distributions
2 - high clustering coefficient
3 - many densely connected components
independent of measure used to connect n
odes Euclidean dist, Manhattan dist, D
ot product etc -Question how similar are they
at the local level?
26 Network comparison intersection graph
(edges connecting orthologs)
Pearson correlation, Cosine Euclidean, Manhatte
n, Jensen-Shannon
- only a small percentage of the edges is
preserved
27 Results
- Network comparison intersection graph
(significance)
test against randomly re-wired networks (with
degree distribution preserved)
PCC
Euclidean dist
- the observed conservation is highly significant
- - even if low
28Conclusions
- Sequence evolution is strongly linked to
expression network topology e.g, highly
connected genes evolve slowly
- However, this relationship is not simple
- -sequence divergence between orthologs is not
correlated to expression divergence
- -expression networks seem to rewire rapidly
- during evolution despite conservation of the
- global structure
- Thus, as opposed to sequence evolution, which is
- dominated by divergence, expression networks
may evolve
- primarily by convergence of regulatory elements
Jordan IK, Marino-Ramirez L, Wolf YI, Koonin
EV.Conservation and coevolution in the
scale-free human gene coexpression network.Mol
Biol Evol. 2004 Nov21(11)2058-70
Jordan IK, Marino-Ramirez L, Koonin
EV.Evolutionary significance of gene expression
divergence.Gene. 2005 Jan 17345(1)119-26
Tsaparos, P, Jordan, IK, Koonin EV, in preparation
29Unifying measures of gene function and evolution
30Evolutionary systems biology
- In principle, we address the classical problem
the relationship between the (largely neutral?)
evolution of the genome and the (largely
adaptive) evolution of the phenotype - In practice, the progress of genomics other
OMICS allows us to measure, on whole-genome
scale, the effects of all kinds of molecular
phenotypic characteristics (expression level,
protein-protein interactions etc etc) on
evolutionary rates - Can we synthesize these measurements to produce
- a coherent picture of the links between
phenotypic and genomic evolution?
31The Cautionary Tale
"It was six men of Indostan / To learning much
inclined, Who went to see the Elephant / (Though
all of them were blind), That each by observati
on / Might satisfy his mind " (J.G. Saxe)
32The Cautionary Tale
"each was partly in the right / And all were in
the wrong"
(J.G. Saxe)
33Evolution Rate Fitness Effect
1974 Kimura Ohta there should be a correlation
between 1976 Zuckerkandl the evolution rate and
importance 1977 Wilson et al. of a gene (knocko
ut fitness effect)
1999 Hurst Smith no, there isn't (mammalian
data)
2001 Hirsh Fraser yes, there is (the other guys
had
2002 Jordan et al. small biased dataset)
2003 Pal et al. no, there isn't (expression
level 2004 Rocha Danchin determines both ER and
KE)
2003 Hirsh Fraser yes, there is (we have
double- 2003 Krylov et al. checked and still ther
e is)
2005 Wall et al. there is weak but highly
significant
2005 Zhang He 2005 Drummond et al.
Consensus, finally?
34Different Faces of the Hypercube?
Pairwise correlations
Synthesis
35Analysis of Multidimensional Data
36Analysis of Multidimensional Data
37Analysis of Multidimensional Data
PC1
PC3
PC2
Principal Components Analysis (PCA) introduces a
new orthogonal coordinate system where axes are
ranked by the fraction of original variance
accounted for.
38The Data Set KOGs
- Ideally, we would like to obtain and synthesize
data for individual genes in precise space-time
coordinates (e.g., instant evolutionary rates)
- However
- some of the parameters (variables) are not easily
measurable (if defined at all) for genes in
extant species e.g. rate of evolution
- much of the data are inherently noisy, either due
to technical problems or true biological
variation e.g. fitness effect of gene
disruption. - Thus, we analyze orthologous protein sets, using
the proteins from different species to derive
some data and smooth out variations in other.
- Practically, this means using the KOG dataset
(with additions) 10058 KOGs from 15 species.
- Koonin et al. A comprehensive evolutionary
classification of proteins encoded in complete
eukaryotic genomes.Genome Biol. 20045(2)R7
39The Data Set KOGs
plants
Amoebozoa
Fungi
Animals
Original KOGs for some species, "index orthologs"
for other.
40Variables Gene Loss
Propensity for Gene Loss (PGL), introduced by
Krylov et al. (Genome Res. 13, 2229-2235, 2003).
Computed from KOG phyletic pattern.
Originally an empirical measure (Dollo parsimony
reconstruction of events ratio of branch
lengths). In this work employs an Expectation
Maximization algorithm.
41Variables Gene Duplication
Number of Paralogs, average number observed for a
given KOG. Example KOG0417 (Ubiquitin-protein li
gase) and KOG0424 (Ubiquitin-protein ligase).
42Variables Evolution Rate
Select a taxon Build an alignment (MUSCLE) Comp
ute distance matrix (PAML) Select minimum distan
ce between members of the two subtrees of the
group.
Ascomycota Sordariomycetes vs. Yeasts
43Variables Expression Level
Expression Level data for S. cerevisiae, D.
melanogaster and H. sapiens were downloaded from
UCSC Table Browser (hgFixed).
Organism Table exp. probes KOGs
Sacce yeastChoCellCycle 17 6602 3030
Drome arbFlyLifeAll 162 4921 2617
Homsa gnfHumanAtlas2All 158 10197 3872
Standardized (?0 ?1) log values maximum
expression level among paralogs was used to
represent a KOG.
44Variables Interactions
Physical Protein Protein and Genetic Interactions
(PPI and GI) data for S. cerevisiae, C. elegans
and D. melanogaster were downloaded from GRID FTP
site. Maximum number of interaction partners amo
ng paralogs was used to represent a KOG.
45Variables Lethality
Lethality of Gene Knockout data for S. cerevisiae
were downloaded from MIPS FTP site (0/1 values).
Embryonic Lethality of RNAi Interference data for
C. elegans were taken from Kamath et al., 2003
(0/1 values).
46Missing Data
Total 32 variables in 10058 KOGs lots of
missing data. Complete data (all 34 parameters av
ailable) 381 KOGs too few. Combined data 7 va
riables, 3724 KOGs (after removal of outliers).
Example evolution rate.
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF
KOG0009 - 0.168 0.300 - 0.405 KOG0010 0.671 1.252
0.606 0.087 1.492 KOG0011 0.905 1.698 0.428 0.07
3 1.547 KOG0012 - 2.238 0.665 0.244 - KOG0013 0.
355 - - 0.014 1.343 KOG0014 1.913 4.041 - 0.126 2
.840 KOG0015 - 2.286 0.400 0.027 - KOG0016 - - 0
.506 0.380 - 0.667 1.864 0.521 0.075 1.910
At.Os Sc.Ca Mg.Nc Hs.Mm. Pl.MF
- 0.090 0.575 - 0.212 1.006 0.672 1.162 1.166 0
.781 1.358 0.911 0.821 0.984 0.810 - 1.201 1.2
75 3.275 - 0.532 - - 0.181 0.703
2.869 2.168 - 1.692 1.487 - 1.227 0.767 0.365 -
- - 0.970 5.087 -
Average 0.293 0.957 0.977 1.917 0.472 2.054
0.786 3.028
47Correlations between variables
NP PPI GI PGL ER EL KE NP - PPI 0.205 - GI 0.0
70 0.041 - PGL 0.001 -0.117 0.006 - ER -0.065 -0
.185 0.047 0.147 - EL 0.329 0.222 -0.040 -0.113 -
0.278 - KE 0.017 0.219 -0.090 -0.193 -0.157 0.137
-
48Two Classes of Variables
Observation on the pattern of pairwise
relationships in the data "phenotypic" and
"evolutionary" variables.
49Structure of the correlation tablephenotypic
and evolutionary variables
NP PPI GI PGL ER EL KE NP - PPI 0.205 - GI 0.0
70 0.041 - PGL 0.001 -0.117 0.006 - ER -0.065 -0
.185 0.047 0.147 - EL 0.329 0.222 -0.040 -0.113 -
0.278 - KE 0.017 0.219 -0.090 -0.193 -0.157 0.137
-
GI redundant pathways, backup
KE essential function, no backup
50PCA of the Data Space
PC.1 PC.2 PC.3 NP 0.35 0.59 0.16 PPPI 0.46 0.09
-0.21 GPPI -0.04 0.46 -0.79 PGL -0.30 0.40 0.44
ER -0.43 0.16 -0.12 EL 0.51 0.23 0.28 KE 0.37
-0.44 -0.15 -------------------------------------
---- Var. 26.10 16.95 14.26
51PCA of the Data Space
1st vs 2nd PC
52PCA of the Data Space
2nd vs 3rd PC
53Positive and negative contributions to PC1
PC1 status/importance of a gene
54Positive and negative contributions to PC2
adaptable"
"rigid"
PC2 genes adaptability
55Positive and negative contributions to PC3
"edge"
"core"
PC3 a different (non-interactive) kind of
adaptability
56Interpretation of the first 3 PCs
PC3 Adaptability 2"
PC2 "Adaptability1"
PC1 "Status"
57Prediction of the adaptability model expression
profile skew
Skew 0
Skew 0
Human expression profiles
PC2 LOW PC2 HIGH P Status LOW 1.9 2.3 2.9E
-07
Status HIGH 2.1 2.6 3.6E-12
PC2 is a measure of "Adaptability"
58The status-adaptability model of the
phenotype-genotype-evolution relationship
59Status and Adaptability of Gene Classes
Classification of KOGs into 4 major categories
60Status and Adaptability of Gene Classes
High Adaptability
neutral
High Status Low Adaptability
Low Status
Classification of KOGs into 4 major categories
61Status and Adaptability of Genes
Low Status
High Status
Replication Repair KOGs
62Status and Adaptability of Genes
Variable Repair
Core Replication
Replication Repair KOGs
63Status and Adaptability of Genes
Cytoplasmic and Mitochondrial ribosomal proteins
64Status and Adaptability of Genes
Replication Licensing Complex and Histones
65Conclusions
- Two composite variables "status" and
"adaptability" dominate the multidimensional
parameter space of quantitative genomics
- The notion of status provides biologically
relevant null hypotheses regarding the
connections between various phenotypic and
evolutionary variables - Breaks in the pattern may indicate non-trivial
links - targets for further investigation
- Functional groups of genes show distinctive
patterns of status and adaptability
- Wolf YI, Carmel L, Koonin EV, submitted
66The Cautionary Tale
?
Wrong again?!
67 Acknowledgements
I. King Jordan
Liran Carmel
Yuri I. Wolf
Artwork Olga Karengina (LSDN, Moscow)
Panayiotis Tsaparas
Leonardo Mariño-RamÃrez