Title: CCEGA VisionDan Reed
1Carolina Center for ExploratoryGenetic
AnalysisIntroduction and Context
- Dan Reed
- Dan_Reed_at_unc.edu
- Chancellors Eminent Professor
- Director, Renaissance Computing Institute
- University of North Carolina at Chapel Hill
Supported in part by NIH Grant 5P20RR020751-02
2Genetics and Disease Susceptibility
Phenotype 1 Phenotype 2 Phenotype 3
Phenotype 4
Ancestry Environment
Age Gender
Identify Genes
Pharmacokinetics
Metabolism
Endocrine
Biomarker Signatures
Physiology
Proteome
Transcriptome
Immune
Morphometrics
Predictive Disease Susceptibility
Source David Threadgill/Terry Magnuson
3The Data Wave
- Many sources
- sequencing
- microarrays
- environmental
- public health
- family studies
- clinical
- Many technology enablers
- increased detector resolution
- increased storage capability
- The challenges
- managing complexity
- tracking knowledge
- extracting insight
We Are Here!
4Data Heterogeneity and Complexity
Genomic, proteomic, transcriptomic, metabolomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns and
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
Source Carole Goble (Manchester)
5Convergence and Opportunity
- Center for Genome Sciences (CCGS)
- ten year investment of 245M
- new center and department
- 4 buildings and 22 faculty lines
- advanced facilities and equipment
- participation by multiple schools and departments
- 25M anonymous gift for proteomics
- Renaissance Computing Institute (RENCI)
- interdisciplinary applications of computing
- faculty, staff and student collaborations
- new infrastructure and capabilities
- technology transfer and economic development
6Information Visualization Challenges
- Heterogeneity and complexity
- data types
- numerical, non-numerical
- textual, graphical,
- sources and scale
- distributed databases, conferences
- journals, web pages,
- ontology and relationships
- metadata, meanings and connections
- Compatibility and collaboration
- from desktop to distributed and collaborative
- familiar tools and interfaces
- UNC tiled display wall deployment
- Health Sciences Library/RENCI collaboration
- HSL focus on fostering collaboration
- building renovation and redesign
7RENCI/Health Sciences Collaboration
June 2005
8PITAC Data and Software Repositories
- Findings
- Explosive growth in sensors and scientific
instruments has engendered unprecedented volumes
of data, presenting historic opportunities for
major scientific breakthroughs in the 21st
century - Computational science now encompasses modeling
and simulation using data from these and other
sources, requiring data management, mining, and
interrogation - Recommendations
- Federal government must provide long-term support
for computational science community data
repositories - defined frameworks, metadata structures
- algorithms, data sets, applications
- review and validation infrastructure
- Government must require funded researchers to
deposit their data and research software in these
repositories or with access providers that
respect any necessary or appropriate security
and/or privacy requirements
9Deep Carolina (Proposed)
- Features
- five year partnership timeline (minimum)
- RTP/UNC IBM anchors
- proximity facilitates collaboration
- joint faculty/staff participation
- leading edge computing infrastructure
- hardware and software
- Rationale
- leverage IBM and Triangle resources
- develop and evaluate new technologies
- explore applications of computing to new problems
- Joint resource commitments
- Carolina and IBM
10CCEGA Project Goals
- Develop collaborative experiences and plans
- preliminary data to apply for a P50 grant
- Deliverables and activities
- develop a protocol for prospective studies
- using ongoing studies as examples to define best
practices - Carolina Cohort
- develop a prototype informatics infrastructure
- data models, methods, tools and portals
- demonstrate the utility of data mining
- applied to established project(s)
- facilitate use of best practices for existing
projects - develop an environment for cross training and
education - formal and informal education touching project
participants and trainees - Foster mutual awareness and shared needs
Supported in part by NIH Grant 5P20RR020751-02
11CCEGA Vision
Interoperable Data Management
Faculty, Staff Students
Driving Problems
Promoting Mutual Awareness
Experimental Genetics Portal
Analysis Techniques
Statistical Computational Techniques
Extant Data Models
Virtuous Cycle
Interdisciplinary Research Education
12Tentative Science Requirements
- Integrated storage, analysis and exploration
- reusable infrastructure and shared capability
- Shared collaborative infrastructure
- new science and larger collaborations
- Leverage from other infrastructure
- distributed resource sharing and use
- Simplicity, simplicity, simplicity
- reduce redundant infrastructure construction
- focus time and talent on research
13CCEGA Participants
- Coordination team
- Dan Reed, RENCI
- Terry Magnuson, CCGS
- Alan Blatecky, RENCI
- Kirk Wilhelmsen, CCGS
- Eleven departments/institutes
- Biostatistics
- Cancer Center
- CCGS
- Computer Science
- Epidemiology
- Genetics
- Health Science Library
- Information and Library Science
- Pharmacy
- RENCI
- Statistics
- Campus wide support
- from many sources
- Project participants
- Brad Hemminger, Information Library Science
- James Evans, Genetics
- Kevin Gamiel, RENCI
- Xiaojun Guan, RENCI
- Barrie Hays, Health Science Library
- Clark Jefferies, RENCI
- Ethan Lange, Genetics
- Andrew Nobel, Statistics
- Karen Mohlke, Genetics
- Kari North, Epidemiology
- Susan Paulsen, Computer Science
- Fernando Manuel Pardo, Genetics
- Charles Perou, Cancer Center
- Lavanya Ramakrishnan, RENCI
- Jan Prins, Computer Science
- Patrick Sullivan, Genetics
- Lisa Susswein, Cancer Center
- David Threadgill, Genetics
14Formal CCEGA Activities
- Workshops
- genetics and disease
- analysis methods (today)
- Cross-disciplinary tutorials
- genotyping
- XML
- others to come
- Working groups
- ELSI, analysis and informatics
- Software prototyping
- portal and data model planning
- Management group
- planning and strategy
www.renci.org/P20
15CCEGA Working Groups/Structures
- ELSI
- IRB and coordinated data sharing
- James Evans, Genetics (lead)
- Exploratory analysis
- data mining and classification techniques
- Jan Prins, Computer Science (lead)
- Informatics
- LIMS, data models and representations
- Brad Hemminger, Information and Library Science
(lead) - Integration and prototyping
- portals, software and tools
- Xiaojun Guan, RENCI (lead)
- Organization and operation
- weekly meetings with posted topics
- web summaries for project access
- www.renci.org/P20
16Infrastructure and Data
- Bioinformatics portal
- standard tools and community interfaces
- data integration and access
- Large scale visualization and collaboration
- tiled display wall and tools
- Strawman data models
- discussion, data validation and tool development
- Simulated case control data sets
- no genotype/phenotype connection (null data)
- phenotype via simulated development process
17North Carolina Bioportal
- Goals and features
- standard interfaces
- common tools and databases
- extensibility mechanisms
- new tools, techniques and data
- authentication and security
- controlled access
- local and remote access
- national coupling and sharing
- Currently
- 100 standard applications
- Emboss, Glimmer, Hmmer
- NCBI, Phylip, other,
- growing suite of databases
- NCBI Blast, GenBank, GenPept
- PDB, Prints, rebase, Repbase
- Uniprot, Fasta, genomes, Pfam
- Prosite, Refseq, Transfac, WU Blast
- May 2005 initial release
18(No Transcript)
19North Carolina Bioportal
Users
Account Management
BioPortal
MySQL databases
Grid Gatekeeper
MyProxy
GridFTP
OpenPBS
Applications
Application Databases
Pise
- Open Grid Computing Environment (OGCE)
- shared development
- standard web services
- adopting portal standards (JSR168)
- used by cyberinfrastructure projects
- LEAD, NEES, PACI, DOE, TeraGrid
Local cluster
20Our Vision of Success
- Local avatars for the national community
- driving problems and experiences
- infrastructure testing and validation
- Multidisciplinary collaboration
- biomedical and IT researchers
- software developers
- National infrastructure and communities
- distributed and federated
- customizable to local needs
- interoperable and shared
- The Virtual Observatory astronomy model
- standard tools
- metadata and data models
- virtual community
21Next Kirk Wilhelmsen