Bioinformatics and Computational Biology Projects - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Bioinformatics and Computational Biology Projects

Description:

Bioinformatics Resource Centers for Biodefense and Emerging/Re-emerging Infectious Disease (BRC) ... The Bioinformatics Integration Support Contract (BISC) was ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 30
Provided by: richardhsc
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics and Computational Biology Projects


1
Bioinformatics and Computational Biology Projects
  • HiPCAT Meeting
  • September 22, 2005
  • Richard H. Scheuermann
  • Department of Pathology
  • U.T. Southwestern Medical Center

2
Overview
  • Projects
  • Bioinformatics Resource Centers for Biodefense
    and Emerging/Re-emerging Infectious Disease (BRC)
  • Bioinformatics Integration Support Contract
    (BISC)
  • Data Analysis and Algorithm Development
  • Microarray data analysis and gene ontology
    integration
  • Microarray data analysis and transcription
    regulation integration
  • Biological network analysis
  • Yeast protein-protein interaction network
  • B lymphocyte genetic interaction network
  • Population genetic analysis
  • Genome annotation
  • Multi-dimensional fluorescence activated cell
    sorting analysis

3
Bioinformatics Resource Centers for Biodefense
and Emerging/Re-emerging Infectious Disease
(BRC)
4
BRC Program Overview
  • NIAID (National Institute of Allergy and
    Infectious Diseases)
  • Significant investment in sequencing pathogens
  • Potential targets for diagnostics, therapeutics,
    and vaccines
  • Desire to facilitate data exploration
  • Desire to integrate data and tools from disparate
    sources
  • Accordingly, NIAID awarded 8 BRC contracts
  • 5-year contract starting June, 2004
  • NIAID Program Officer - Dr. Valentina di
    Francesco
  • The BioHealthBase BRC
  • 17 million project
  • Collaboration with Northrop Grumman IT

5
BioHealthBase BRC Overview
  • Web-base System Accessible to Science Researchers
  • Storage of sequences, annotation, reference
    databases, etc.
  • Generation and storage of value added products
  • Pathways, promoter regions, operons,
    multi-alignments, epitopes, etc.
  • Deployment of available science community tools
  • GBrowse, BioCyc Navigator, etc.
  • Development of new analytical tools
  • Development of new query and visualization tools
  • GeneCards query tool with PlasmoDB-like
    interface, etc.
  • Emphasis on host-pathogen interactions to support
    drug discovery and vaccine development

6
BioHealthBase BRC - Pathogen Rationale
  • Each pathogen is not only a potential
    bioterrorism agent, but is also endemic or
    re-emerging in the United States
  • dual purpose biodefense and public health
  • At least one pathogen from NIAID Category A, B,
    and C
  • Pathogens represent each of the three major
    classes of microorganism - bacteria, viruses, and
    parasites - and even plants
  • Developing a database that supports these
    disparate organism types represents a significant
    challenge. However, the microorganism-independent
    database structure that we will develop will
    support the addition of other organisms in the
    future

Francisella tularensis Influenza virus
Mycobacterium tuberculosis Microsporidia Giardia
lamblia Ricinus communis
7
Bioinformatics Integration Support Contract (BISC)
8
BISC summary
  • The Bioinformatics Integration Support Contract
    (BISC) was awarded to provide advanced computer
    support for the collection, integration and
    analysis of immunology research data from
    numerous, diverse sources, and their long-term
    storage in sustainable databases
  • 28 million project
  • Collaboration with Northrop Grumman IT
  • The users of BISC include all NIAID/DAIT-funded
    programs, which includes
  • basic scientific research
  • clinical trials
  • The Immunology Database and Analysis Portal
    (ImmPort) is being developed as the solution to
    the needs defined in the BISC RFP
  • System deployment - ImmPort v1.0 by 26OCT2005

9
ImmPort System Components
  • Data Warehouse
  • Access restricted to DAIT/NIAID investigators
  • Containing public reference data, primary
    experimental data, processed data
  • Built upon a cross-disciplinary data model
    supported by a customized ImmPort Ontology
  • Private Project Workspaces
  • For pre-publication data storage and analysis
  • PI-controlled access to collaborators
  • Securely segregated from public warehouse
  • Analytical Toolkit, Interfaces and Reports
  • Database query interfaces
  • Data analysis applications
  • Journaling and reports
  • User Support
  • Data sharing compliance
  • Educational resources
  • Outreach

10
ImmPort - Principles
  • Utilize open source software as much as possible
  • Contribution of data to the ImmPort system should
    require minimal effort and should significantly
    enhance the research program of the investigator
    (incentive)
  • Ensure that data is appropriately attributed
  • Data integration is key
  • Provide analytical tools - extant and novel
  • Provide training in the use of the analytical
    tools
  • Develop and utilize data standards, including
    controlled vocabularies

11
Public reference data
  • Protein information
  • Protein description
  • Cross referencing IDs
  • PubMed references
  • Domain structure
  • Motif location
  • Molecular function
  • UniProt
  • Basic gene information
  • Standard gene symbol/name
  • Cross referencing IDs
  • Gene location/chr. position
  • GO terms
  • Reference sequences
  • Homologues
  • NCBI Entrez
  • Polymorphism information
  • SNP location
  • SNP allele frequencies
  • SNP haplotype blocks
  • Microsatellite/VNTR location
  • Disease association
  • NCBI dbSNP, dbMHC, UniSTS
  • HapMap
  • CEPH

Immune response gene
Gene expression information NCBI GEO
Metabolic pathways BioCyc, KEGG
Protein-protein interaction networks DIP, BIND,
MIPS, AfCS Y2H
Signal transduction pathways AfCS Molecule Page
network, Reactome
12
Data Analysis and Algorithm Development
13
Parallel MoNet
  • Biological network analysis

14
Motivation
  • The cell functions as a system of integrated
    components
  • There is increasing evidence that the cell system
    is composed of modules
  • A module in a biological system is a discrete
    unit whose function is separable from those of
    other modules
  • Modules defined based on functional criteria
    reflect the critical level of biological
    organization (Hartwell, et al.)
  • A modular system can reuse existing, well-tested
    modules
  • The notion of regulation requires the assembly of
    individual components into modular networks
  • Functional modules can then be assembled together
    into cellular networks
  • Thus, identifying functional modules and their
    relationship from biological networks is
    important to the understanding of the
    organization, evolution and interaction of the
    cellular systems they represent

15
Definition of network modules
1
2
3
16
Edge betweeness
  • Girvan-Newman proposed an algorithm to find
    social communities within human population
    networks
  • Utilized the concept of edge betweenness as a
    unit of measure
  • defined as the number of shortest paths between
    all pairs of vertices that run through it
  • edges between modules tend to have higher values
  • Provides a quantitative criterion to distinguish
    edges inside modules from the edges between
    modules

Betweenness 20
17
A new definition of network modules
  • Definition of module degree
  • Given a graph G, let U be a subgraph of G (U? G).
    The number of edges within U is defined as the
    indegree of U, ind(U). The number of edges that
    connect U to remaining part of G (G-U) is defined
    as the outdegree of U, outd(U).
  • Definition of module
  • A subgraph U? G is a module if ind(U)? outd(U).
  • A subgraph is a complex module if it can be
    separated into at least two modules by removing
    edges inside it. Otherwise, it is a simple
    module.
  • Adjacent relationship between Modules
  • Given two subgraphs U, V? G, U and V are adjacent
    if U?V? and there are edges in G connecting
    vertices in U and V.

18
Algorithm
  • 1. The G-N algorithm is run on the network and
    the order of edge deletion is obtained.
  • a. The betweenness scores for each edge in the
    network is calculated.
  • b. The edge with the highest betweenness is
    identified and removed from the network.
  • c. Step 1 is repeated until no edges remain in
    the network.
  • 2. An edge list is created in the reverse order
    of edge deletion in Step 1.
  • 3. The agglomerative algorithm is initialized by
    setting each vertex as a singleton sub-graph with
    no edges. All singleton sub-graphs are labeled
    as mergable.
  • 4. An edge is removed from the top of the edge
    list.
  • 5. If the edge connects vertices in the same
    sub-graph, it is added to the sub-graph.
  • 6. If the edge connects vertices in two different
    sub-graphs
  • a. If both sub-graphs are mergable, the two
    sub-graphs are evaluated based on the module
    definition.
  • i. The edge is retained if merging occurs
  • 1) between two non-modules, or
  • 2) between a non-module and a module
  • ii. Otherwise, the two sub-graphs are set as
    non-mergable.
  • b. If one of the sub-graphs is non-mergable, the
    other sub-graph is set as non-mergable.
  • 7. Repeat steps 4 - 7 until no edges are left in
    the edge list.

19
Modular decomposition of the Yeast PIN
  • Yeast protein interaction network
  • The core protein interaction network of yeast
    from DIP database (version ScereCR20041003)
    (Xenarios et al. 2002)
  • After removal of all self-connecting links, the
    final protein interaction network included 2609
    yeast proteins and 6355 interactions
  • Consist of a large component of 2440 proteins and
    65 small components with size no more than 7
    proteins

20
Modular decomposition of Yeast PIN
  • Result Summary
  • 202 simple modules generated
  • 137 simple modules from the large component
  • The size of modules ranges from 2 to 201

21
Gene/protein annotation
  • Gene Ontology (GO) Consortium database that
    describes genes based on their molecular
    function, biological process and cellular
    component.
  • Gene example - Retinoblastoma
  • Function - DNA binding
  • Component - nucleus
  • Process - transcription regulation regulation of
    cell cycle
  • Description represents a leaf on a branch in a
    hierarchical tree indicating parent-child
    relationships (directed acyclic graph).
  • Although we have used GO annotation as a means of
    gene classification, CLASSIFI can be used with
    any gene description scheme of interest.

22
Validation of modules
  • Annotated each protein with the Gene OntologyTM
    (GO) terms from the Saccharomyces Genome Database
    (SGD) (Cherry et al. 1998 Balakrishna et al)
  • Quantified the co-occurrence of GO terms using
    the hypergeometric distribution analysis
    supported by the Gene Ontology Term Finder
    (Balakrishna et al)
  • The results show that each module has
    statistically significant co-occurrence of
    functional GO categories

23
Validation of modules
Modules with 100 GO frequency
24
Validation of modules
  • P values of modules obtained based our definition
    plot against P values of the corresponding weak
    modules (two modules share most of their content
    genes)

25
Why Parallelize?
26
Time Complexity of MoNet
  • Time complexity of G-N algorithm is O(M2N), where
    M is the number of edge and N is the number of
    vertices.
  • Each step to delete high betweenness edge need
    O(MN).
  • The time complexity of MoNet is (M2NM).

27
Why Parallelize?
  • Modular decomposition of the yeast core protein
    interaction network from DIP
  • Number of vertices 2609
  • Number of edges 6355
  • Time complexity 635522609 6355) 1.1 x 1011
  • Run time on a server running Microsoft windows
    server 2003 Enterprise Edition with two Intel
    Xeon 3.2 GHz processors and 4G RAM 2 days

28
B lymphocyte Genetic Network
  • Reverse engineering of regulatory networks in
    human B cells K. Basso, A.A. Margolin, G.
    Stolovitzky, U. Klein, R. Dalla-Favera  A.
    Califano Nature Genetics  37, 382 - 390 (2005)
  • Cellular network from a set of 336 expression
    profiles representative of normal B cell subpops,
    various subtypes of B cell tumors and
    experimentally manipulated B cells
  • MoNet Analysis
  • Number of vertices 6111.
  • Number of edges 64751.
  • 6475126111 64751) 2.6 x 1013
  • Run time on a server running Microsoft windows
    server 2003 Enterprise Edition with two Intel
    Xeon 3.2 GHz processors and 4G RAM 50 days

29
Where to Parallelize
  • The calculation of all shortest path in the G-N
    algorithm to obtain the edge betweenness.
Write a Comment
User Comments (0)
About PowerShow.com