Title: YoungRae Cho
1Seminar 2009
Identification of Functional Modules and Hub
Proteins in Protein Interaction Networks
- Young-Rae Cho
- Department of Computer Science and Engineering
- State University of New York at Buffalo
2 What is Bioinformatics?
- Bioinformatics
- Interdisciplinary research area to manage and
analyze biological data
Techniques
Data
Applications
3 What is Bioinformatics?
Computational Techniques
Data Mining Machine Learning
Data Mining
Biomedical Applications
Knowledge
Biological Data
Functional Characterization
Genome Proteome Networks
Functional Characterization Disease
Diagnosis Drug Development
Networks
4 Overview
- Introduction
- Protein Interaction Networks and Their Structural
Properties - Preprocess - Network Weighting
- Integration of Gene Ontology using Semantic
Similarity Measures - Functional Module Identification
- Weighted Interaction Networks ?
? Functional Modules - Hub Protein Identification
- Weighted Interaction Networks ?
? Hub Proteins - Conclusion
5 Biological Network
- Definition
- Directed or undirected graph representation
- Biological molecules as nodes and
- biochemical reactions or biophysical
interactions as edges - Examples
- Metabolic networks
- Signal transduction networks
- Gene regulatory networks
- Protein interaction networks
- Importance
- Provide a global view of cellular organizations
and biological processes - Applicable to systematic approaches for knowledge
discovery
6 Protein-Protein Interaction (PPI)
- Biological Meaning of PPI
- Proteins interact with each other for stability
and functionality - Most cellular functions are performed in a
protein complex level - Interaction evidence is interpreted as functional
coherence / consistency - Determination of PPIs
- Experimental methods
- Yeast two-hybrid systems, Mass
spectrometry, Protein microarray - Computational methods
- Homology search, Gene fusion analysis,
Phylogenetic profiles - Problem of PPI data
- Current PPI databases include a large amount of
false positives / false negatives - ? Unreliability
7 Protein Interaction Network
- Representation of Protein Interaction Networks
- Undirected, un-weighted graph G(V,E),
- a set of nodes V as proteins and a set
of edges E as interactions - Problem of Protein Interaction Networks
- Large scale
- Complex connectivity
8 Structural Properties
- Small-world Phenomenon ( Watts Strogatz )
- Appearance of networks in the middle of regular
and random networks - Higher average clustering coefficient than
expected by random chance - Significantly small average shortest path length
- Scale-free Distribution ( Barabasi Albert )
- Network growth by preferential attachment
- Power law degree distribution a few high degree
nodes, many low degree nodes - Clustering coefficient distribution independent
to degree
9 Overview
- Introduction
- Protein Interaction Networks and Their Structural
Properties - Preprocess - Network Weighting
- Integration of Gene Ontology using Semantic
Similarity Measures - Functional Module Identification
- Weighted Interaction Networks ?
? Functional Modules - Hub Protein Identification
- Weighted Interaction Networks ?
? Hub Proteins - Conclusion
10 Network Weighting Schemes
- Motivation
- Unreliable protein interaction networks
- Transforming un-weighted graph to weighted graph
- by assigning the interaction reliability
(or intensity) into each edge as a weight - Unsupervised Approaches
- Using network connectivity, e.g., common
neighbors, alternative paths - Problem unreliable weights
- Supervised Approaches
- Using other resources verifying interactions,
e.g., gene sequence, gene expression - Integrating Gene Ontology data in my works
- the most comprehensive
- well-curated
11 Gene Ontology (GO)
- Structure
- Terms (Concepts) well-defined biological
description - Relationships is-a / part-of
(general-to-specific) between terms - Annotation
- If a protein is annotated on a term, then it is
also annotated on the terms on the -
paths towards root.
? Transitivity
P5
P1
P1, P2, P3
P1, P2, P4
P2, P3
P1, P6
P1, P2, P3, P6
P2, P3
12 Semantic Similarity
- Reliability of Interacting Proteins
- Average (or Maximum) semantic similarity of
pair-wise terms - including the interacting proteins in
annotations - Structure-based Approaches
- Path length or Common parent terms
- Problem all edges should represent the uniform
specificity - Information Content-based Approaches
- Information content of a term T is defined as
log(P(T)) - simxy - log ( Pi(x,y) )
- where Pi(x,y) is the proportion of the
annotations of the term including x and y - Normalized simxy
13 Overview
- Introduction
- Protein Interaction Networks and Their Structural
Properties - Preprocess - Network Weighting
- Integration of Gene Ontology using Semantic
Similarity Measures - Functional Module Identification
- Weighted Interaction Networks ?
? Functional Modules - Hub Protein Identification
- Weighted Interaction Networks ?
? Hub Proteins - Conclusion
14 Functional Module Identification
- Functional Module
- A set of molecules that participate in the same
biological processes or functions - Sub-network with dense intra-connections and
sparse interconnection - Functional Module Identification
- ? Graph clustering problem
- Previous Clustering Approaches
- Density-based methods, e.g., maximum clique,
quasi clique, clique percolation - Partition-based methods, e.g., restricted
neighborhood search, Markov clustering - Hierarchical methods
- Bottom-up approaches, e.g., distance-based,
common neighbors - Top-down approaches, e.g., minimum cut,
betweenness cut
15 Functional Influence Model
- Functional Influence
-
- Influence factors normalized weights, inverse
of degree - Measurements
- Single-path-based method
- O( V E )
- All-path-based method NP
- Random-walk-based method
- O( V3 ) iteration O( V4 )
Improvement by an efficient algorithm
16 Flow Simulation
- Information Flow Simulation
- Computation of functional influence infs(x) of s
on x ? V based on random walks - Input a weighted interaction network and a
source node s - Output functional influence pattern of s
- Algorithm
- Initialize infs(s)
- Compute initial flow finit(s ? y) by
- Update infs(y) by
- Compute flow fs(y ? z) by
- Repeat 3 and 4 until fs(y ? z) is less than a
threshold ?
17 Lower-level Algorithm
18 Schematic View
0.15
0.28
0.65
0.79
0.45
Pattern Clustering
0.27
1.26
0.83
1.0
1.74
0.41
0.92
0.89
1.38
0.11
0.31
19 Time Complexity
- Efficiency
- Traces only connecting nodes to calculate
functional influence of a source - Removes trivial flow, being less than ?, as early
as possible - Run Time
- Theoretical upper bound is unknown ( not depends
on the network diameter ) - Test potential factors ( nodes, density,
average degree ) with synthetic networks
20 Accuracy
- Experiment
- Data yeast protein interaction network from DIP
- Pattern clustering pCluster algorithm (Wang et
al., SIGMOD 2002) - Evaluation
- Functional categories and annotations from MIPS
- Hyper-geometric p-value
- Result
21 Overview
- Introduction
- Protein Interaction Networks and Their Structural
Properties - Preprocess - Network Weighting
- Integration of Gene Ontology using Semantic
Similarity Measures - Functional Module Identification
- Weighted Interaction Networks ?
? Functional Modules - Hub Protein Identification
- Weighted Interaction Networks ?
? Hub Proteins - Conclusion
22 Hub Protein Identification
- Hub Protein
- Centrally located node in the modular structure
of a protein interaction network - ( a structural hub )
- Functionally essential protein
- Previous Centrality Measurements
- Closeness centrality
- Betweenness centrality
- Bridging centrality
23 Functional Influence Model
- Functional Influence
-
- Influence factors normalized weights, inverse
of degree - Measurements
- Single-path-based method
- O( V E )
- All-path-based method NP
- Random-walk-based method
- O( V3 ) iteration O( V4 )
Improvement by a heuristic algorithm
24 Path Strength
- Single-path-based path strength
- All-path-based path strength
- sums up the k-length path strength for all
possible k - uses the threshold of maximum k
25 Network Conversion
- Network Conversion
- Input a protein interaction network / Output
a hierarchical tree structure - Algorithm
- Centrality (weighted closeness) of a node a
- Set of ancestor nodes T(a) of a
- Parent node p(a) of a
- Hub Confidence Measurement
- Set of child nodes D(a) of a
- Set of descendent nodes La of a
- Hub confidence H(a) of a
26 Schematic View
- Hub Confidence
- How strongly a node plays a role as a structural
hub - Not fully depends on the hierarchical level in
the tree structure
27 Structural Hubs
- Top 10 Structural Hubs in the Yeast Protein
Interaction Network - Not related to their degree
- Each one has several different functions
28 Lethality
- Biological Essentiality
- Evaluated by comparing with lethal proteins
- Lethality has been determined by protein
knock-out experiments - Result
29 Conclusion
- Problems
- Complex and unreliable connectivity in protein
interaction networks - Contributions
- Reliable network generation by edge weighting
- Hidden knowledge discovery, e.g., patterns or
taxonomy - Collaboration with existing computational
techniques - Future Works
- Integration with multiple data sources
- Comparative analysis across organisms
30 Questions?