Title: CS/CBB 545 - Data Mining Function Prediction
1CS/CBB 545 - Data MiningFunction Prediction
Networks as an application of Mining
- Mark Gerstein, Yale University
- gersteinlab.org/courses/545
- (class 2007,02.20 1430-1545)
2Specific Applications Function Prediction
3The problem Grappling with Function on a Genome
Scale?
.
530
- 250 of 530 originally characterized on chr. 22
Dunham et al. - gt25K Proteins in Entire Human Genome
- (with alt. splicing)
4Traditional single molecule way to integrate
evidence describe function
EF2_YEAST
Descriptive Name Elongation Factor 2
Lots of references to papers
Summary sentence describing functionThis
protein promotes the GTP-dependent translocation
of the nascent protein chain from the A-site to
the P-site of the ribosome.
5Functional Classification
ENZYME (SwissProt Bairoch/Apweiler,just
enzymes, cross-org.)
COGs(cross-org., just conserved, NCBI
Koonin/Lipman)
GenProtEC(E. coli, Riley)
Also Other SwissProt Annotation WIT, KEGG (just
pathways) TIGR EGAD (human ESTs) SGD (yeast)
Fly (fly, Ashburner)now extended to GO
(cross-org.)
MIPS/PEDANT(yeast, Mewes)
6Hierarchy of Protein Functions
7Some obvious issues in scaling single molecule
definition to a genomic scale
- Fundamental complexities
- Often gt2 proteins/function
- Multi-functionality 2 functions/protein
- Role Conflation molecular, cellular, phenotypic
8Some obvious issues in scaling single molecule
definition to a genomic scale
- Fundamental complexities
- Often gt2 proteins/function
- Multi-functionality 2 functions/protein
- Role Conflation molecular, cellular, phenotypic
- Fun terms but do they scale?....
- Starry night (P Adler, 94)
- Lush cheapdate (former wants alcohol, later
makes susceptible) - Vulcan Klingon
- Sonic Kryptonite
9Toward Systematic Ontologies for Function, using
Networks
Hierarchies DAGs Enzyme, Bairoch GO,
Ashburner MIPS, Mewes, Frishman
General Networks Eisenberg et al.
Interaction Vectors Lan et al, IEEE 901848
10Gene Expression Information and Protein Features
11Typical Predictors and Response for Yeast
12Prediction of Function on a Genomic Scale from
Array Data Sequence Features
Core
Different Aspects of function molecular action,
cellular role, phenotypic manifestationAlso
localization, interactions, complexes
13Specific Applications Networks -- What are the
types of Biological Networks
14- Graph a pair of sets GP,E where P is a set of
nodes, and E is a set of edges that connect 2
elements of P.
- Directed, undirected graphs
- Large, complex networks are ubiquitous in the
world - Genetic networks
- Nervous system
- Social interactions
- World Wide Web
15Biological Networks
- Protein-protein interaction networks
- Regulatory networks
- Expression networks
- Metabolic networks
- more biological networks
- Other types of networks
16Expression networks
Qian, et al, J. Mol. Bio., 3141053-1066
17Format of Gene Expression Data
MCM3 MCM6 CDC47 MCM2 CDC46 CDC54
DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4
ORC2 ORC6 ORC5 ORC4 ORC3 ORC1
18Clusteringthe yeast cell cycle to uncover
interacting proteins
Brown, Davis
Extra
Microarray timecourse of 1 ribosomal protein
19Clusteringthe yeast cell cycle to uncover
interacting proteins
Extra
Random relationship from 18M
20Clusteringthe yeast cell cycle to uncover
interacting proteins
Botstein Church, Vidal
Extra
Close relationship from 18M (2 Interacting
Ribosomal Proteins)
21Clusteringthe yeast cell cycle to uncover
interacting proteins
Extra
Predict Functional Interaction of Unknown Member
of Cluster
22Global Network of Relationships
Core
470K significant relationships from 18M
possible
23Horak, et al, Genes Development, 163017-3033
24Protein Interaction Network
Jeong et al.
25Yeast two-hybrid
26Affinity Purification and Mass Spec.
From ocw.mit.edu//791_ak_lecture7.pdf
27DeRisi, Iyer, and Brown, Science, 278680-686
28Interaction networks
Metabolic networks
29... more biological networks
30Networks as a universal language
Internet Burch Cheswick
Electronic Circuit
Food Web
Disease Spread Krebs
Neural Network Cajal
ProteinInteractions Barabasi
Social Network
31Networks occupy a midway point in terms of level
of understanding
1D Complete Genetic Partslist
3D Detailed structural understanding of
cellular machinery
2D Bio-molecular Network Wiring Diagram
Jeong et al.
32Richness of the Visual Representation of Networks
- Some structure (connectivity) but some
flexibility (e.g. edge colors, node positions and
shapes) that can used to encode additional
information
33VisualComplexity.com
34Networks What are the Main Quantities that Can
be Calculated from Network Topology?
35- Degree of a node the number of edges incident on
the node
i
Degree of node i 5
36Network parameters
- Number of incoming and outgoing connections
Connectivity
37Network parameters
- Ratio of existing links to maximum number of
links for neighbouring nodes
Measure of inter-connectedness of the network
Average coefficient 0.04
Clustering coefficient
1/6 0.17
38Path length
- Number of intermediate TFs to reach final
target - Indication of how immediate a response is
- Average path length 4
39Characteristic path length ? GLOBAL property
- is the number of edges in the shortest
path between vertices i and j
Networks with small values of L are said to have
the small world property
40Network motifs
- Regulatory modules within the network
SIM
MIM
FFL
FBL
Alon
41FFL Feed-forward loops
SBF
Yox1
Pog1
Tos8
Plm2
Alon
42Cliques
- Fully connected sub-components
- Related measures k-cores, Hogue
43Predicting protein interactions by completing
defective cliques
- High-throughput experiments are prone to missing
interactions
P
Q
- If proteins P and Q interact with a clique K of
proteins which all interact with each other, then
P and Q are more likely to interact with each
other - P, Q, and K form a defective clique
Yu et al. Bioinformatics (2006)
44Networks Simple Mathematical Models for
Interpreting Complex Topology
45Models for networks of complex topology
- Erdos-Renyi (1960)
- Watts-Strogatz (1998)
- Barabasi-Albert (1999)
A Barabási R Albert "Emergence of scaling in
random networks," Science 286, 509-512 (1999).
46The Erdos-Rényi ER model (1960)
- Start with N vertices and no edges
- Connect each pair of vertices with probability
PER
- Important result many properties in these graphs
appear quite suddenly, at a threshold value of
PER(N) - If PERc/N with clt1, then almost all vertices
belong to isolated trees - Cycles of all orders appear at PER 1/N
47The Watts-Strogatz WS model (1998)
- Start with a regular network with N vertices
- Rewire each edge with probability p
- For p0 (Regular Networks)
- high clustering coefficient
- high characteristic path length
- For p1 (Random Networks)
- low clustering coefficient
- low characteristic path length
QUESTION What happens for intermediate values of
p?
481) There is a broad interval of p for which L is
small but C remains large
2) Small world networks are common
49The Barabási-Albert BA model (1999)
Look at the distribution of degrees
ER Model
ER Model
WS Model
www
actors
power grid
The probability of finding a highly connected
node decreases exponentially with k
50- ? two problems with the previous models
- 1. N does not vary
- 2. the probability that two vertices are
connected is uniform
- GROWTH starting with a small number of vertices
m0 at every timestep add a new vertex with m m0
- PREFERENTIAL ATTACHMENT the probability ? that
a new vertex will be connected to vertex i
depends on the connectivity of that vertex
51Birth of Scale-Free Network
From Barabasi Bonabeau, Sci. Am., May '03
52SCALE FREENESS GENERALLY EVOLVES THROUGH
PREFERENTIAL ATTACHMENT (THE RICH GET RICHER)
ILLUSTRATIVE
The Duplication Mutation Model
Description
- Theoretical work shows that a mechanism of
preferential attachment leads to a scale-free
topology - (The rich get richer)
- In interaction network, gene duplication followed
by mutation of the duplicated gene is generally
thought to lead to preferential attachment
The interaction partners of A are more likely to
be duplicated
Gene duplication
- Simple reasoning The partners of a hub are more
likely to be duplicated than the partners of a
non-hub
Source Albert et al. Rev. Mod. Phys. (2002)
and Middendorf et al. PNAS (2005)
53Random v Scale-free Networks
From Barabasi Bonabeau, Sci. Am., May '03
54Scale-free networks in Biology
Power-law distribution
log(Frequency)
log(Degree)
Hubs dictate the structure of the network
Barabasi
55Power-law distribution of connectivities
- Many TFs have few target genes
- Few TFs have many target genes
56Knocking Out Nodes in Scale-free and Random
Networks
From Barabasi Bonabeau, Sci. Am., May '03
57Hubs tend to be Essential
Integrate gene essentiality data with protein
interaction network. Perhaps hubs represent
vulnerable points? Lauffenburger, Barabasi
"hubbiness"
Essential
Non- Essential
Yu et al., 2003, TIG
58Relationships extends to "Marginal Essentiality"
- Marginal essentiality measures relative
importance of each gene (e.g. in growth-rate and
condition-specific essentiality experiments) and
scales continuously with "hubbiness"
"hubbiness"
Essential
Not important
Very important
important
59Bottlenecks Hubs
Yu et al., PLOS CB (2007)
60Bottleneck bridging between processes