Title: Bioinformatics and Computational Biology Projects
1Bioinformatics and Computational Biology Projects
- HiPCAT Meeting
- September 22, 2005
- Richard H. Scheuermann
- Department of Pathology
- U.T. Southwestern Medical Center
2Overview
- Projects
- Bioinformatics Resource Centers for Biodefense
and Emerging/Re-emerging Infectious Disease (BRC) - Bioinformatics Integration Support Contract
(BISC) - Data Analysis and Algorithm Development
- Microarray data analysis and gene ontology
integration - Microarray data analysis and transcription
regulation integration - Biological network analysis
- Yeast protein-protein interaction network
- B lymphocyte genetic interaction network
- Population genetic analysis
- Genome annotation
- Multi-dimensional fluorescence activated cell
sorting analysis
3Bioinformatics Resource Centers for Biodefense
and Emerging/Re-emerging Infectious Disease
(BRC)
4BRC Program Overview
- NIAID (National Institute of Allergy and
Infectious Diseases) - Significant investment in sequencing pathogens
- Potential targets for diagnostics, therapeutics,
and vaccines - Desire to facilitate data exploration
- Desire to integrate data and tools from disparate
sources - Accordingly, NIAID awarded 8 BRC contracts
- 5-year contract starting June, 2004
- NIAID Program Officer - Dr. Valentina di
Francesco - The BioHealthBase BRC
- 17 million project
- Collaboration with Northrop Grumman IT
5BioHealthBase BRC Overview
- Web-base System Accessible to Science Researchers
- Storage of sequences, annotation, reference
databases, etc. - Generation and storage of value added products
- Pathways, promoter regions, operons,
multi-alignments, epitopes, etc. - Deployment of available science community tools
- GBrowse, BioCyc Navigator, etc.
- Development of new analytical tools
- Development of new query and visualization tools
- GeneCards query tool with PlasmoDB-like
interface, etc. - Emphasis on host-pathogen interactions to support
drug discovery and vaccine development
6BioHealthBase BRC - Pathogen Rationale
- Each pathogen is not only a potential
bioterrorism agent, but is also endemic or
re-emerging in the United States - dual purpose biodefense and public health
- At least one pathogen from NIAID Category A, B,
and C - Pathogens represent each of the three major
classes of microorganism - bacteria, viruses, and
parasites - and even plants - Developing a database that supports these
disparate organism types represents a significant
challenge. However, the microorganism-independent
database structure that we will develop will
support the addition of other organisms in the
future
Francisella tularensis Influenza virus
Mycobacterium tuberculosis Microsporidia Giardia
lamblia Ricinus communis
7Bioinformatics Integration Support Contract (BISC)
8BISC summary
- The Bioinformatics Integration Support Contract
(BISC) was awarded to provide advanced computer
support for the collection, integration and
analysis of immunology research data from
numerous, diverse sources, and their long-term
storage in sustainable databases - 28 million project
- Collaboration with Northrop Grumman IT
- The users of BISC include all NIAID/DAIT-funded
programs, which includes - basic scientific research
- clinical trials
- The Immunology Database and Analysis Portal
(ImmPort) is being developed as the solution to
the needs defined in the BISC RFP - System deployment - ImmPort v1.0 by 26OCT2005
9ImmPort System Components
- Data Warehouse
- Access restricted to DAIT/NIAID investigators
- Containing public reference data, primary
experimental data, processed data - Built upon a cross-disciplinary data model
supported by a customized ImmPort Ontology - Private Project Workspaces
- For pre-publication data storage and analysis
- PI-controlled access to collaborators
- Securely segregated from public warehouse
- Analytical Toolkit, Interfaces and Reports
- Database query interfaces
- Data analysis applications
- Journaling and reports
- User Support
- Data sharing compliance
- Educational resources
- Outreach
10ImmPort - Principles
- Utilize open source software as much as possible
- Contribution of data to the ImmPort system should
require minimal effort and should significantly
enhance the research program of the investigator
(incentive) - Ensure that data is appropriately attributed
- Data integration is key
- Provide analytical tools - extant and novel
- Provide training in the use of the analytical
tools - Develop and utilize data standards, including
controlled vocabularies
11Public reference data
- Protein information
- Protein description
- Cross referencing IDs
- PubMed references
- Domain structure
- Motif location
- Molecular function
- UniProt
- Basic gene information
- Standard gene symbol/name
- Cross referencing IDs
- Gene location/chr. position
- GO terms
- Reference sequences
- Homologues
- NCBI Entrez
- Polymorphism information
- SNP location
- SNP allele frequencies
- SNP haplotype blocks
- Microsatellite/VNTR location
- Disease association
- NCBI dbSNP, dbMHC, UniSTS
- HapMap
- CEPH
Immune response gene
Gene expression information NCBI GEO
Metabolic pathways BioCyc, KEGG
Protein-protein interaction networks DIP, BIND,
MIPS, AfCS Y2H
Signal transduction pathways AfCS Molecule Page
network, Reactome
12Data Analysis and Algorithm Development
13Parallel MoNet
- Biological network analysis
14Motivation
- The cell functions as a system of integrated
components - There is increasing evidence that the cell system
is composed of modules - A module in a biological system is a discrete
unit whose function is separable from those of
other modules - Modules defined based on functional criteria
reflect the critical level of biological
organization (Hartwell, et al.) - A modular system can reuse existing, well-tested
modules - The notion of regulation requires the assembly of
individual components into modular networks - Functional modules can then be assembled together
into cellular networks - Thus, identifying functional modules and their
relationship from biological networks is
important to the understanding of the
organization, evolution and interaction of the
cellular systems they represent
15Definition of network modules
1
2
3
16Edge betweeness
- Girvan-Newman proposed an algorithm to find
social communities within human population
networks - Utilized the concept of edge betweenness as a
unit of measure - defined as the number of shortest paths between
all pairs of vertices that run through it - edges between modules tend to have higher values
- Provides a quantitative criterion to distinguish
edges inside modules from the edges between
modules
Betweenness 20
17A new definition of network modules
- Definition of module degree
- Given a graph G, let U be a subgraph of G (U? G).
The number of edges within U is defined as the
indegree of U, ind(U). The number of edges that
connect U to remaining part of G (G-U) is defined
as the outdegree of U, outd(U). - Definition of module
- A subgraph U? G is a module if ind(U)? outd(U).
- A subgraph is a complex module if it can be
separated into at least two modules by removing
edges inside it. Otherwise, it is a simple
module. - Adjacent relationship between Modules
- Given two subgraphs U, V? G, U and V are adjacent
if U?V? and there are edges in G connecting
vertices in U and V.
18Algorithm
- 1. The G-N algorithm is run on the network and
the order of edge deletion is obtained. - a. The betweenness scores for each edge in the
network is calculated. - b. The edge with the highest betweenness is
identified and removed from the network. - c. Step 1 is repeated until no edges remain in
the network. - 2. An edge list is created in the reverse order
of edge deletion in Step 1. - 3. The agglomerative algorithm is initialized by
setting each vertex as a singleton sub-graph with
no edges. All singleton sub-graphs are labeled
as mergable. - 4. An edge is removed from the top of the edge
list. - 5. If the edge connects vertices in the same
sub-graph, it is added to the sub-graph. - 6. If the edge connects vertices in two different
sub-graphs - a. If both sub-graphs are mergable, the two
sub-graphs are evaluated based on the module
definition. - i. The edge is retained if merging occurs
- 1) between two non-modules, or
- 2) between a non-module and a module
- ii. Otherwise, the two sub-graphs are set as
non-mergable. - b. If one of the sub-graphs is non-mergable, the
other sub-graph is set as non-mergable. - 7. Repeat steps 4 - 7 until no edges are left in
the edge list.
19Modular decomposition of the Yeast PIN
- Yeast protein interaction network
- The core protein interaction network of yeast
from DIP database (version ScereCR20041003)
(Xenarios et al. 2002) - After removal of all self-connecting links, the
final protein interaction network included 2609
yeast proteins and 6355 interactions - Consist of a large component of 2440 proteins and
65 small components with size no more than 7
proteins
20Modular decomposition of Yeast PIN
- Result Summary
- 202 simple modules generated
- 137 simple modules from the large component
- The size of modules ranges from 2 to 201
21Gene/protein annotation
- Gene Ontology (GO) Consortium database that
describes genes based on their molecular
function, biological process and cellular
component. - Gene example - Retinoblastoma
- Function - DNA binding
- Component - nucleus
- Process - transcription regulation regulation of
cell cycle - Description represents a leaf on a branch in a
hierarchical tree indicating parent-child
relationships (directed acyclic graph). - Although we have used GO annotation as a means of
gene classification, CLASSIFI can be used with
any gene description scheme of interest.
22Validation of modules
- Annotated each protein with the Gene OntologyTM
(GO) terms from the Saccharomyces Genome Database
(SGD) (Cherry et al. 1998 Balakrishna et al) - Quantified the co-occurrence of GO terms using
the hypergeometric distribution analysis
supported by the Gene Ontology Term Finder
(Balakrishna et al) - The results show that each module has
statistically significant co-occurrence of
functional GO categories
23Validation of modules
Modules with 100 GO frequency
24Validation of modules
- P values of modules obtained based our definition
plot against P values of the corresponding weak
modules (two modules share most of their content
genes)
25Why Parallelize?
26Time Complexity of MoNet
- Time complexity of G-N algorithm is O(M2N), where
M is the number of edge and N is the number of
vertices. - Each step to delete high betweenness edge need
O(MN). - The time complexity of MoNet is (M2NM).
27Why Parallelize?
- Modular decomposition of the yeast core protein
interaction network from DIP - Number of vertices 2609
- Number of edges 6355
- Time complexity 635522609 6355) 1.1 x 1011
- Run time on a server running Microsoft windows
server 2003 Enterprise Edition with two Intel
Xeon 3.2 GHz processors and 4G RAM 2 days
28B lymphocyte Genetic Network
- Reverse engineering of regulatory networks in
human B cells K. Basso, A.A. Margolin, G.
Stolovitzky, U. Klein, R. Dalla-Favera A.
Califano Nature Genetics 37, 382 - 390 (2005) - Cellular network from a set of 336 expression
profiles representative of normal B cell subpops,
various subtypes of B cell tumors and
experimentally manipulated B cells - MoNet Analysis
- Number of vertices 6111.
- Number of edges 64751.
- 6475126111 64751) 2.6 x 1013
- Run time on a server running Microsoft windows
server 2003 Enterprise Edition with two Intel
Xeon 3.2 GHz processors and 4G RAM 50 days
29Where to Parallelize
- The calculation of all shortest path in the G-N
algorithm to obtain the edge betweenness.