Title: The Cancer Biomedical Informatics Grid
1The Cancer Biomedical Informatics Grid (caBIG)
Overview of the Integrative Cancer Research
Workspace
Carl Schaefer National Cancer Institute Center
for Bioinformatics March 16, 2006
2caBIG Introductory Seminars March 2006
- Topics
- caBIG Overview March 13
- Overview of caBIG Activities for Clinical Trials
and Tissue Banking March 15 - Overview of caBIG Activities for Integrated
Cancer Research March 16 - caBIG Interoperability and Compatibility Basics
- March 17
- https//cabig.nci.nih.gov/seminars
- http//videocast.nih.gov/
3Agenda
- Mission and goals (briefly)
- Overview of Year 1 Year 2 products
- Example usage scenarios
- Year 3
4Cancer Biomedical Informatics Grid (caBIGTM)
- Common, widely distributed infrastructure
permits research community to focus on
innovation - Shared vocabulary, data elements, data models
facilitate information exchange - Collection of interoperable applications
developed to common standards - Raw published cancer research data is available
for mining and integration
5System Interoperability
- Need to use common data elements (e.g. mutation
type) - registered in caDSR
- Need to use common vocabularies (e.g. missense,
nonsense insertion, ) - registered in EVS
- Need to know how data elements are aggregated
into complex objects (e.g. mutation locus
mutation type normal allele mutated
allele) - UML model
- Need to use a standard query/transport protocol
(e.g. WSDL SOAP or HTTP GET/POST XML) - caBIG Compatibility Guidelines at
https//cabig.nci.nih.gov/guidelines_documentation
6Four Domain Workspaces and two Cross Cutting
Workspaces have been launched
DOMAIN WORKSPACE 1 Clinical Trial Management
Systems
Addresses the need for consistent, open and
comprehensive tools for clinical trials
management.
DOMAIN WORKSPACE 2 Integrative Cancer Research
Provides tools and systems to enable integration
and sharing of information.
DOMAIN WORKSPACE 3 Tissue Banks Pathology Tools
Provides for the integration, development, and
implementation of tissue and pathology tools.
DOMAIN WORKSPACE 4 Imaging
Provides for the sharing and analysis of in vivo
imaging data.
Responsible for evaluating, developing, and
integrating systems for vocabulary and ontology
content, standards, and software systems for
content delivery
CROSS CUTTING WORKSPACE 1 Vocabularies Common
Data Elements
Developing architectural standards and
architecture necessary for other workspaces.
CROSS CUTTING WORKSPACE 2 Architecture
7Strategic Level Workspaces
Data Sharing and Intellectual Capital
Addresses issues related to the sharing of data,
applications and infrastructure both within the
consortium and in the larger cancer research
community.
Training
Developing strategies for providing training in
the use of the caBIG developed resources
including on-line turtorials, workshops, training
programs.
caBIG Strategic Planning
Assists in identifying strategic priorities for
the development and evolution of the caBIG effort.
8ICR Mission
- Facilitate translational research by integrating
clinical and basic research data - Produce informatics systems and tools that are
- interoperable
- modular
- well-engineered, well-documented
- validated
9Participating Cancer Centers
- Burnham Institute
- Cold Spring Harbor
- Columbia UniversityHerbert Irving
- DartmouthNorris Cotton
- Duke University
- Fox Chase
- Fred Hutchinson Cancer Research Center
- Georgetown UniversityLombardi
- Massachusetts Institute of Technology
- Memorial Sloan Kettering
- Meyer L. Prentis-Karmanos
- New York University
Northwestern UniversityRobert H. Lurie Oregon
Health and Science University Thomas Jefferson
UniversityKimmel University of California San
Francisco University of Chicago University of
IowaHolden University of Michigan University of
North CarolinaLineberger University of
PennsylvaniaAbramson University of South
FloridaH. Lee Moffitt Vanderbilt
UniversityIngram Washington UniversitySiteman Wi
star
10End-Users
- Informatics researchers
- Bench researchers
- Clinical researchers
- Patients
11Typical Project Tasks
- Use case document (developer, with adopter
approval) - Software requirements specification (developer)
- Data model (developer, with VCDE WS approval),
data elements registered in caDSR - Code, compatible with caBIG guidelines
(developer, with Architecture WS approval) - Test Procedures (adopter)
- Installation Guide (developer)
- Training Plan (adopter)
- User Guide (adopter)
12Overview of Year 1 Year 2 Products
13ICR Special Interest Groups
14ICR Projects by Domain (SIG)
15ICR Products by Type
16Pathways Projects
- Reactome Data
- Developer CSHL
- Adopter MSKCC
- Pathways Tools (cPath, Cytoscape, BioPAX)
- Developer MSKCC
- Adopter OHSU
- QPACA
- Developer UCSF
- Adopter OHSU
17Proteomics Tools
- RProteomics
- Developer Duke
- Adopters Penn, OHSU
- Proteomics LIMS
- Developer Fox Chase
- Adopter Moffitt
- Q5
- Developer Dartmouth
- Adopter OHSU
18Genome Annotation
- GOMiner
- Developer CCR
- Adopter Wistar
- TrAPSS
- Developer Iowa
- Adopter Wistar
- HapMap Data
- Developer CHSL
- Adopter Wistar
- Vertebrate Promoter Data
- Developer CSHL
- Adopter MSKCC
- FunctionExpress
- Developer Wash U
- Adopter Wistar
- Cancer Molecular Pages
- Developer Burnham
- Adopter Moffitt
- Seed
- Developer U Chicago
- Adopter Georgetown
- PIR
- Developer Georgetown
- Adopter Penn
19Microarray Repositories
- caArray
- Developer NCICB
- Adopters Georgetown, NYU, Wistar, Thomas
Jefferson - NCI-60 Data
- Developer CCR
- Adopter MSKCC
20Data Analysis and Statistical Tools
- DWD
- Developer UNC
- Adopter Wistar
- VISDA
- Developer Georgetown/VA Tech
- Adopter Wistar
- Magellan
- Developer UCSF
- Adopter Penn
- GenePattern
- Developer MIT
- Adopter NYU
- caWorkbench
- Developer Columbia
- Adopter Northwestern
21CDEs and Vocabularies
22Example Scenarios
23Example Scenarios
- Annotate lists of genes, proteins
- Search for and retrieve array-based data
- Display expression data on pathway networks
- Integrate biologically heterogenous data
- Aggregate data from heterogeneous platforms
- Build tumor/normal mass-spec classifier
- Custom analysis and visualization
24Annotate List of Genes and Proteins
- Example get physical and functional properties
and homologies for 1500 proteins detected in
serum sample - Using caBIG standard APIs, query
- Cancer Molecular Pages Burnham
- PIR/UniProt Georgetown Now on the grid
- SEED U. Chicago
- Retrieve data -- molecular weight, functional
domains, modified residues, homologies, etc.
25Search for/Retrieve Array-Based Bata
- Example Find copy number alteration data and
gene expression data for cases of invasive ductal
carcinoma - caArray NCICB and CC adopters
- MAGE-compliant repository for microarray data
- international standard for array data
- Oligo arrays, spotted arrays, array CGH,
- Raw data and (in the future) analyzed data
- Data in via web-based data annotation forms
- MIAME 1.1 level annotations using controlled
terminology from MGED ontology - Data out via low-level MAGE-OM API and
higher-level services API - Now on the grid
26Display Expression Data on Pathways
- Example highlight functional roles of genes
overexpressed in glioblastoma multiforme samples
(compared with normal) - Query caArray repositories for availability of
samples retrieve data in MAGE-ML format. - Query cPath and Reactome for network data in
BioPAX format - cPath protein/protein interaction data MSKCC
- Reactome curated pathways CSHL
- Using Cytoscape, superimpose epxression data on a
network with gene expression values displayed
along a color gradient - Cytoscape plugins for cPath, BioPAX, MAGE-ML
MSKCC - Use QPACA UCSF to assess match between
expression data and pathway membership
27Integrate Heterogeneous Data
- Example If we select genes whose mRNA expression
correlates with an outcome, do copy number
changes of loci that map close to those genes
also correlate? - Magellan UCSF
- Allows use of biological annotation information
to reduce false positives from multiple
comparisons in high-throughput data - qualitative descriptions of biological variables
- quantitative results of computations
- Allows user-defined data types (stored as
entity-value pairs) - Interoperable with caArray (for mRNA expression,
CGH)and caBIO (for genomic location).
28Build Proteomics Classifier
- Example given peptide mass-spectra from serum
samples (100 cases of non small-cell carcinoma,
100 controls), infer diagnostic profile - Retrieve data from future proteomics repository
in mzXML format. - Use RProteomics Duke to e-noise, remove
background, align peaks - Reads mzXML
- Analysis routines in R with Java wrappers
- Now on the grid
- Use Q5 Dartmouth to build the classifier.
29Aggregate Data from Heterogenous Platforms
- Example infer differential expression patterns
for subtypes of breast cancer where available
data was generated on multiple array platforms by
multiple institutions - FunctionExpress WashU to correlate probes on
different platforms - Distance-Weighted Discrimination UNC Lineberger
- Tools for combining comparable but distinct types
of micro-array data sets, with the goal of
improving statistical power - Cross-platform analysis of oligo arrays
(Affymetrix) and cDNA spotted arrays - In tests of DWD, institutional and chip
clustering disappears while a clear clustering by
cancer type emerges
30Custom Analysis and Visualization
- Example Jointly analyze microarray expression
profiles, sequences, motifs, and transcription
factors to identify candidate upstream regulators
of a particular transcription factor - caWorkBench Columbia
- Customizable, configurable graphical user
interface - Visualization analytical components can be
plugged in - interoperable based on published interfaces
- Scripting support caScript
- Java-like programmatic access to components from
GUI
31Year 3
- More focused
- example informatics for translational research
- Other new projects
- BioConductor
- GeneConnect
- DAS2 plugin for caCORE
- possibly a new datatype for caArray
- Continuing work
- RProteomics
- Grid enablement of caWorkBench and GenePattern
- Proteomics LIMS
32caBIG/ICR and Other NCI Initiatives
- caIntegrator Clinical Genomics Object Model
(CGOM) - SNP500Cancer
- Mouse Proteomics Biomarker Discovery Initiative
- CGEMS
- TCGA
33List of Tools
- caBIGProgram Update March 2006
- This issue spotlights caBIG products currently
available and pending release in 2006 - 2007, and
highlights the release of caGrid Software Version
0.5 - cabig.nci.nih.gov/Program_Updates/cabig_
March_2006_Program_Update.pdf
34How can my research benefit from caBIG Tools?
- Everything developed by the program is open
source and freely available - Training is available at https//cabig.nci.nih.gov
/training - The latest versions of all the software developed
as part of the project can be obtained from the
caBIG CVS site - http//cabigcvs.nci.nih.gov/viewcvs/viewcvs.cgi/
- Commercial-grade documentation is provided as
part of the project, which will be located at the
project gforge site - http//gforge.nci.nih.gov
35How can I get support for these tools?
- NCICB Applications Support will coordinate
support for caBIG tools - Live Support Monday Friday 8 am 8 pm Eastern
Time - Telephone support is available Monday to Friday,
8 am 8 pm Eastern Time, excluding government
holidays. - You may leave a message, send an email or submit
a support request via the Web at any time. - Email ncicb_at_pop.nci.nih.gov
- Phone 301-451-4384
- Toll-free 888-478-4423
- Web http//ncicbsupport.nci.nih.gov
36caBIG Getting Involved
- To get involved with caBIG
- Track caBIG activities on the NCIs caBIG
website, https//cabig.nci.nih.gov/ - Attend caBIG Annual Meeting, April 9-11, 2006,
Hyatt Regency Crystal City, Arlington, Virginia - Learn about the existing bioinformatics
infrastructure, caCORE, at https//ncicb.nci.nih.g
ov/core - Download currently available caBIG tools from
the caBIG website at https//cabig.nci.nih.gov/in
ventory - Sign up for the caBIG mailing list at
http//list.nih.gov/archives/cabig_announce.html - Please visit the main caBIG website for more
information https//cabig.nci.nih.gov/
37Save the Date!
- The caBIGTM 2006 Annual Meeting
- April 9-11, 2006
- Hyatt Regency Crystal City, Arlington, Virginia
- Plenary sessions 35 break out sessions dozens
of demonstrations, and posters exhibits
hackathon - Tailored sessions for newcomers April 9 and
throughout the conference - https//cabig.nci.nih.gov/2006_Annual_Meeting
38Contact Information
- Carl Schaefer, Ph.D
- Director for Biomedical Informatics
- NCI Center for Bioinformatics
- National Cancer Institute
- 6116 Executive Blvd., Suite 403
- Rockville, MD 20852
- tel 301-435-1535
- schaefec_at_mail.nih.gov