Title: Bioinformatics III
1Bioinformatics III Systems biology,Integrative
cell biology
Course will address two areas 25 genomics
single protein phylogenies versus genome
rearrangement, comparative genomics 75
integrated view of cellular networks
2Content
Week1 scale-free networks in biology Week2 transcr
iption, regulatory networks Week3 protein
complexes (Cellzome, Aloy et al.
2004) Week4 protein networks exp. data (Y2H
MS), computational data (Rosetta) Week5 protein
networks graphical layout (force
minimization) Week6 protein networks quality
check (Bayesian analysis) Week7 protein networks
modularity? Week8 phylogeny Week9 genome
rearrangement (breakpoint analysis) Week1011
metabolic networks metabolic flux analysis,
extreme pathways, elementary modes, C13
method Week12 mathematical modelling of signal
transduction networks Week13 integration of
protein networks with metabolic
pathways Week14 exam
3Literature
lecture slides will be available 1-2 days prior
to lecture suggested reading links will be put
up on course website http//gepard.bioinformatik.u
ni-saarland.de/teaching...
4assignments
12 weekly assignments planned Homework
assignments are handed out in the Thursday
lectures and are available on the course website
on the same day. Solutions need to be returned
until Thursday of the following week 14.00 to
Tihamer Geyer in room 1.09 Geb. 17.1, first
floor, or handed in prior (!) to the lecture
starting at 14.15. 2 students may submit one
joint solution. Also possible submit solution by
e-mail as 1 printable PDF-file to tihamer.geyer_at_bi
oinformatik.uni-saarland.de. Tutorial
participation is recommended but not mandatory.
Tue 11-13. Homeworks submitted on Thursdays will
be discussed on the following Tuesday. In case
of illness please send E-mail to kerstin.gronow-p
_at_bioinformatik.uni-saarland.de and provide a
medical certificate.
5Schein successful written exam
The successful participation in the lecture
course (Schein) will be certified upon
successful completion of the written exam in
February 2005. Participation at the exam is open
to those students who have received ? 50 of
credit points for the 12 assignments. Unless
published otherwise on the course website until 3
weeks prior to exam, the exam will be based on
all material covered in the lectures and in the
assignments. In case of illness please send
E-mail to kerstin.gronow-p_at_bioinformatik.uni-saar
land.de and provide a medical certificate. A
second and final chance exam will be offered in
April 2005.
6tutor
Dr. Tihamer Geyer assignments Geb. 17.1, room
1.09 tihamer.geyer_at_bioinformatik.uni-saarland.de
7Systems biology
Biological research in the 1900s followed a
reductionist approach detect unusual phenotype
? isolate/purify 1 protein/gene, determine its
function However, it is increasingly clear that
discrete biological function can only rarely be
attributed to an individual molecule. ? new task
of understanding the structure and dynamics of
the complex intercellular web of interactions
that contribute to the structure and function of
a living cell.
8Systems biology
Development of high-throughput data-collection
techniques, e.g. microarrays, protein chips,
yeast two-hybrid screens allow to simultaneously
interrogate all cell components at any given
time. ? there exists various types of
interaction webs/networks - protein-protein
interaction network - metabolic network -
signalling network - transcription/regulatory
network ... These networks are not independent
but form network of networks.
9DOE initiative Genomes to Life a coordinated
effort
slides borrowed from talk of Marvin Frazier Life
Sciences Division U.S. Dept of Energy
10Facility IProduction and Characterization of
ProteinsEstimating Microbial Genome Capability
- Computational Analysis
- Genome analysis of genes, proteins, and operons
- Metabolic pathways analysis from reference data
- Protein machines estimate from PM reference data
- Knowledge Captured
- Initial annotation of genome
- Initial perceptions of pathways and processes
- Recognized machines, function, and homology
- Novel proteins/machines (including
prioritization) - Production conditions and experience
11Facility II Whole Proteome AnalysisModeling
Proteome Expression, Regulation, and Pathways
- Analysis and Modeling
- Mass spectrometry expression analysis
- Metabolic and regulatory pathway/ network
analysis and modeling - Knowledge Captured
- Expression data and conditions
- Novel pathways and processes
- Functional inferences about novel
proteins/machines - Genome super annotation regulation, function,
and processes (deep knowledge about cellular
subsystems)
12Facility III Characterization and Imaging of
Molecular MachinesExploring Molecular Machine
Geometry and Dynamics
- Computational Analysis, Modeling and Simulation
- Image analysis/cryoelectron microscopy
- Protein interaction analysis/mass spec
- Machine geometry and docking modeling
- Machine biophysical dynamic simulation
- Knowledge Captured
- Machine composition, organization, geometry,
- assembly and disassembly
- Component docking and dynamic simulations
- of machines
13Facility IVAnalysis and Modeling of Cellular
Systems Simulating Cell and Community Dynamics
- Analysis, Modeling and Simulation
- Couple knowledge of pathways, networks, and
- machines to generate an understanding of
- cellular and multi-cellular systems
- Metabolism, regulation, and machine simulation
- Cell and multicell modeling and flux
visualization - Knowledge Captured
- Cell and community measurement data sets
- Protein machine assembly time-course data sets
- Dynamic models and simulations of cell processes
14Genomes To Life Computing Roadmap
Protein machine Interactions
?
Molecule-based cell simulation
Computing and Information Infrastructure
Capabilities
Molecular machine classical simulation
Cell, pathway, and network simulation
Community metabolic regulatory, signaling
simulations
Constrained rigid docking
Constraint-Based Flexible Docking
Current U.S. Computing
Genome-scale protein threading
?
Comparative Genomics
Biological Complexity
15First breakthrough scale-free metabolic networks
(d) The degree distribution, P(k), of the
metabolic network illustrates its scale-free
topology. (e) The scaling of the clustering
coefficient C(k) with the degree k illustrates
the hierarchical architecture of metabolism (The
data shown in d and e represent an average over
43 organisms). (f) The flux distribution in the
central metabolism of Escherichia coli follows a
power law, which indicates that most reactions
have small metabolic flux, whereas a few
reactions, with high fluxes, carry most of the
metabolic activity. It should be noted that on
all three plots the axis is logarithmic and a
straight line on such loglog plots indicates a
power-law scaling. CTP, cytidine triphosphate
GLC, aldo-hexose glucose UDP, uridine
diphosphate UMP, uridine monophosphate UTP,
uridine triphosphate.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
16Second breakthrough Yeast protein interaction
networkfirst example of a scale-free network
A map of proteinprotein interactions in
Saccharomyces cerevisiae, which is based on early
yeast two-hybrid measurements, illustrates that a
few highly connected nodes (which are also known
as hubs) hold the network together. The largest
cluster, which contains 78 of all proteins, is
shown. The colour of a node indicates the
phenotypic effect of removing the corresponding
protein (red lethal, green non-lethal, orange
slow growth, yellow unknown).
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
17Characterising metabolic networks
To study the network characteristics of the
metabolism a graph theoretic description needs to
be established. (a) illustrates the graph
theoretic description for a simple pathway
(catalysed by Mg2-dependant enzymes). (b) In the
most abstract approach all interacting
metabolites are considered equally. The links
between nodes represent reactions that
interconvert one substrate into another. For many
biological applications it is useful to ignore
co-factors, such as the high-energy-phosphate
donor ATP, which results (c) in a second type of
mapping that connects only the main source
metabolites to the main products.
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
18Degree
The most elementary characteristic of a node is
its degree (or connectivity), k, which tells us
how many links the node has to other nodes. a
In the undirected network, node A has k 5. b
In networks in which each link has a selected
direction there is an incoming degree, kin, which
denotes the number of links that point to a node,
and an outgoing degree, kout, which denotes the
number of links that start from it. E.g., node
A in b has kin 4 and kout 1. An undirected
network with N nodes and L links is characterized
by an average degree ltkgt 2L/N (where ltgt denotes
the average).
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
19Degree distribution
The degree distribution, P(k), gives the
probability that a selected node has exactly k
links. P(k) is obtained by counting the number o
f nodes N(k) with k 1,2... links and dividing
by the total number of nodes N. The degree
distribution allows us to distinguish between
different classes of networks. For example, a
peaked degree distribution, as seen in a random
network, indicates that the system has a
characteristic degree and that there are no
highly connected nodes (which are also known as
hubs). By contrast, a power-law degree
distribution indicates that a few hubs hold
together numerous small nodes.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
20Random networks
Aa The ErdösRényi (ER) model of a random network
starts with N nodes and connects each pair of
nodes with probability p, which creates a graph
with approximately pN (N-1)/2 randomly placed
links. Ab The node degrees follow a Poisson
distribution, where most nodes have approximately
the same number of links (close to the average
degree ltkgt). The tail (high k region) of the
degree distribution P(k ) decreases
exponentially, which indicates that nodes that
significantly deviate from the average are
extremely rare. Ac The clustering coefficient
is independent of a node's degree, so C(k)
appears as a horizontal line if plotted as a
function of k. The mean path length is
proportional to the logarithm of the network
size, l log N, which indicates that it is
characterized by the small-world property.
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
21Origin of scale-free topology and hubs in
biological networks
The origin of the scale-free topology in complex
networks can be reduced to two basic mechanisms
growth and preferential attachment. Growth means
that the network emerges through the subsequent
addition of new nodes, such as the new red node
that is added to the network that is shown in
part a . Preferential attachment means that new
nodes prefer to link to more connected nodes. For
example, the probability that the red node will
connect to node 1 is twice as large as connecting
to node 2, as the degree of node 1 (k14) is
twice the degree of node 2 (k2 2). Growth and
preferential attachment generate hubs through a
'rich-gets-richer' mechanism the more connected
a node is, the more likely it is that new nodes
will link to it, which allows the highly
connected nodes to acquire new links faster than
their less connected peers.
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
22Scale-free networks
Scale-free networks are characterized by a
power-law degree distribution the probability
that a node has k links follows P(k) k- -?,
where ? is the degree exponent. The probability
that a node is highly connected is statistically
more significant than in a random graph, the
network's properties often being determined by a
relatively small number of highly connected nodes
(hubs, see blue nodes in Ba). In the
BarabásiAlbert model of a scale-free network, at
each time point a node with M links is added to
the network, it connects to an already existing
node I with probability ?I kI/?JkJ, where kI is
the degree of node I and J is the index denoting
the sum over network nodes. The network that is
generated by this growth process has a power-law
degree distribution with ? 3. Bb Such
distributions are seen as a straight line on a
loglog plot. The network that is created by the
BarabásiAlbert model does not have an inherent
modularity, so C(k) is independent of k. (Bc).
Scale-free networks with degree exponents 2lt ?
lt3, a range that is observed in most biological
and non-biological networks, are ultra-small,
with the average path length following l log
log N, which is significantly shorter than log N
that characterizes random small-world networks.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
23Network measures
Scale-free networks and the degree exponent Most
biological networks are scale-free, which means
that their degree distribution approximates a
power law, P(k) k-? , where ? is the degree
exponent and indicates 'proportional to'. The
value of ? determines many properties of the
system. The smaller the value of ? , the more
important the role of the hubs is in the network.
Whereas for ? gt3 the hubs are not relevant, for
2gt ? gt3 there is a hierarchy of hubs, with the
most connected hub being in contact with a small
fraction of all nodes, and for ? 2 a
hub-and-spoke network emerges, with the largest
hub being in contact with a large fraction of all
nodes. In general, the unusual properties of
scale-free networks are valid only for ? lt 3,
when the dispersion of the P(k) distribution,
which is defined as ?2 ltk2gt - ltkgt2, increases
with the number of nodes (that is, ? diverges),
resulting in a series of unexpected features,
such as a high degree of robustness against
accidental node failures. For ? gt3, however, most
unusual features are absent, and in many respects
the scale-free network behaves like a random one.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
24Shortest path and mean path length
Distance in networks is measured with the path
length, which tells us how many links we need to
pass through to travel between two nodes. As
there are many alternative paths between two
nodes, the shortest path the path with the
smallest number of links between the selected
nodes has a special role. In directed
networks, the distance lAB from node A to node B
is often different from the distance lBA from B
to A. E.g. in b , lBA 1, whereas lAB 3. Often
there is no direct path between two nodes. As
shown in b, although there is a path from C to A,
there is no path from A to C. The mean path
length, ltlgt, represents the average over the
shortest paths between all pairs of nodes and
offers a measure of a network's overall
navigability.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
25Clustering coefficient
In many networks, if node A is connected to B,
and B is connected to C, then it is highly
probable that A also has a direct link to C. This
phenomenon can be quantified using the clustering
coefficient CI 2nI/k(k-1), where nI is the
number of links connecting the kI neighbours of
node I to each other. In other words, CI gives
the number of 'triangles' that go through node I,
whereas kI (kI -1)/2 is the total number of
triangles that could pass through node I, should
all of node I's neighbours be connected to each
other. For example, only one pair of node A's
five neighbours in a are linked together (B and
C), which gives nA 1 and CA 2/20. By
contrast, none of node F's neighbours link to
each other, giving CF 0. The average clustering
coefficient, ltC gt, characterizes the overall
tendency of nodes to form clusters or groups. An
important measure of the network's structure is
the function C(k), which is defined as the
average clustering coefficient of all nodes with
k links. For many real networks C(k) ? k-1,
which is an indication of a network's
hierarchical character. The average degree ltkgt,
average path length ltlgt and average clustering
coefficient ltCgt depend on the number of nodes and
links (N and L) in the network. By contrast, the
P(k) and C(k ) functions are independent of the
network's size and they therefore capture a
network's generic features, which allows them to
be used to classify various networks.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
26Hierarchical networks
To account for the coexistence of modularity,
local clustering and scale-free topology in many
real systems it has to be assumed that clusters
combine in an iterative manner, generating a
hierarchical network. The starting point of
this construction is a small cluster of 4 densely
linked nodes (4 central nodes in Ca). Next, 3
replicas of this module are generated and the 3
external nodes of the replicated clusters
connected to the central node of the old cluster,
which produces a large 16-node module. 3
replicas of this 16-node module are then
generated and the 16 peripheral nodes connected
to the central node of the old module, which
produces a new module of 64 nodes. The
hierarchical network model seamlessly integrates
a scale-free topology with an inherent modular
structure by generating a network that has a
power-law degree distribution with degree
exponent ? 1 ln4/ln3 2.26 (Cb) and a
large, system-size independent average clustering
coefficient ltCgt 0.6. The most important
signature of hierarchical modularity is the
scaling of the clustering coefficient, which
follows C(k) k-1 a straight line of slope -1 on
a loglog plot (Cc). A hierarchical architecture
implies that sparsely connected nodes are part of
highly clustered areas, with communication
between the different highly clustered
neighbourhoods being maintained by a few hubs
(Ca).
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
27First breakthrough scale-free metabolic networks
(d) The degree distribution, P(k), of the
metabolic network illustrates its scale-free
topology. (e) The scaling of the clustering
coefficient C(k) with the degree k illustrates
the hierarchical architecture of metabolism (The
data shown in d and e represent an average over
43 organisms). (f) The flux distribution in the
central metabolism of Escherichia coli follows a
power law, which indicates that most reactions
have small metabolic flux, whereas a few
reactions, with high fluxes, carry most of the
metabolic activity. It should be noted that on
all three plots the axis is logarithmic and a
straight line on such loglog plots indicates a
power-law scaling. CTP, cytidine triphosphate
GLC, aldo-hexose glucose UDP, uridine
diphosphate UMP, uridine monophosphate UTP,
uridine triphosphate.
Barabasi Oltvai, Nature Reviews Genetics 5, 101
(2004)
28Second breakthrough Yeast protein interaction
networkfirst example of a scale-free network
A map of proteinprotein interactions in
Saccharomyces cerevisiae, which is based on early
yeast two-hybrid measurements, illustrates that a
few highly connected nodes (which are also known
as hubs) hold the network together. The largest
cluster, which contains 78 of all proteins, is
shown. The colour of a node indicates the
phenotypic effect of removing the corresponding
protein (red lethal, green non-lethal, orange
slow growth, yellow unknown).
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)
29Summary
Many cellular networks show properties of
scale-free networks - protein-protein interaction
networks - metabolic networks - genetic
regulatory networks (where nodes are individual
genes and links are derived from expression
correlation e.g. by microarray data) - protein
domain networks However, not all cellular
networks are scale-free. E.g. the transcription
regulatory networks of S. cerevisae and E.coli
are examples of mixed scale-free and exponential
characteristics. Next lecture - mathematical
properties of networks - origin of scale-free
topology - topological robustness
Barabasi Oltvai, Nature Rev Gen 5, 101 (2004)