Title: The NIH Roadmap and PubChem
1The NIH Roadmapand PubChem
- Gary Wiggins
- I533
- Spring 2006
2NIH Roadmap
- Series of initiatives designed to pursue major
opportunities in biomedical research and gaps in
current knowledge that cannot be addressed by any
single NIH Institute or Center - Goal enable rapid transformation of new
scientific knowledge into tangible benefits for
public health - http//nihroadmap.nih.gov/
3NIH Molecular Libraries and Imaging Initiative
- Part of the New Pathways to Discovery area
- Goal augment the toolbox for understanding the
functionally interconnected molecular events
that maintain health and lead to disease - Build on high-throughput, highly specific,
mechanism-based biological assays - Aims to develop and discover small molecules that
hold promise as research tools to probe cellular
physiology and pathophysiology
4NIH Molecular Imaging Roadmap
- High specificity/high sensitivity molecular
imaging probes - Molecular imaging and contrast database
- Imaging probe development center
5NIH Roadmap Molecular Libraries Initiative (MLI)
- A series of integrated research programs with the
goal of making small molecule screening and
screening data more widely available to the
research community - http//nihroadmap.nih.gov/molecularlibraries/index
.asp
6MLI Aims
- Go beyond the identification of compounds with
potential therapeutic properties - Will result in the identification of compounds to
use as probes to study cellular processes in
health and disease - Biological screening data, assay protocols, and
chemical structures for compounds to be publicly
available in PubChem
7NIH MLI Components
- Molecular Libraries Screening Center Network
(MLSCN) - Cheminformatics (centered around PubChem)
- Technology development
8NIH MLI Technology Development Areas
- Chemical diversity
- Pilot-scale libraries for investigation of novel
chemical diversity space - Novel methods for natural product chemistry
- Development of assays
- Novel instrumentation and detection technologies
for high throughput screening - Datasets and algorithms for better prediction of
absorption, distribution, metabolism, excretion,
and toxicity properties of small molecules
9Assay Guidance Manual
- Originally written as a guide for therapeutic
projects teams within Eli Lilly covers - Identifying potential assay formats compatible
with High Throughput Screen (HTS) and Structure
Activity Relationship (SAR) - Developing optimal assay reagents
- Optimizing assay protocol with respect to
sensitivity, dynamic range, signal intensity and
stability - Adaptation of the assay to the microtiter plate
formats - Validation of the assay performance
- Orthogonal follow-up assays for chemical probe
validation and refinement - http//www.ncgc.nih.gov/guidance/index.html
10NIH Molecular Libraries Small Molecule Repository
- Run under contract by Discovery Partners
International - Collects samples for high throughput biological
screening and distributes them to the NIH
Molecular Libraries Screening Center Network - http//mlsmr.discoverypartners.com/MLSMR_HomePage/
11Roadmap MLI Funded Areas
- Molecular Libraries Screening Centers (MLSCN)
- Ten of them at academic institutions
- NIH Chemical Genomics Center
- http//www.ncgc.nih.gov/
- http//nihroadmap.nih.gov/molecularlibraries/funde
dresearch.asp
12Roadmap MLI Funded Areas
- Submitting assays for HTS in the MLSCN
- 28 different submissions
- Pilot-scale libraries for HTS (8)
- New methodologies for natural product chemistry
(6) - Assay development for HT molecular Screening (39)
- Molecular libraries screening instrumentation (4)
13Roadmap MLI Funded Areas
- Novel preclinical tools for predictive
ADME-Toxicology (5) - Innovation in molecular imaging probes (11)
- Development of high-resolution probes for
cellular imaging (9)
14Roadmap MLI Funded Areas
- Exploratory Centers for Cheminformatics Research
at - Indiana University
- University of Michigan
- Rensselaer Polytechnic Institute
- MIT
- North Carolina State University, Raleigh
- University of North Carolina, Chapel Hill
15IU Projects Underway
- Innovative cross-screen analysis of NIH
Developmental Therapeutics Project Human Tumor
Cell Line data - Development of cheminformatics web services and
use cases in Taverna - Development of a novel interface for the analysis
of PubChem HTS data - A structure storage and searching system for
Distributed Drug Discovery - Quantum chemical computer simulations database
- Training modules for cheminformatics instruction
on the Web - Web guide for essential cheminformatics resources
(http//www.indiana.edu/cheminfo/cicc/resources.h
tml) - Design of a grid-based distributed data
architecture for chemistry
16NIH NCI Developmental Therapeutics Program
- The NCI has been collecting and testing compounds
for 50 years. For about 30 years this has been
managed by the Developmental Therapeutics Program
(DTP). From 1955 to 1985 the primary test was to
look for increase in survival of mice bearing
transplantable tumors. In 1990, the primary
screen switched to looking for inhibition of
growth of 60 human tumor cell lines in culture.
DTP also ran the anti-HIV screen for about 10
years and managed the yeast anti-cancer screen in
which compounds were tested for their ability to
inhibit the growth of yeast strains with defined
mutations in cell cycle genes. These assays
provide the bulk of the data DTP makes publicly
available.
17NIH NCI DTP
- DTPs correlation analyses allow one to associate
a list of genes with a given compound or vice
versa - Want to get workflows running that integrate
chemical structure data with the gene expression
and sequence data in the bioinformatics world - Need help in the practical details of creating
web services that will work in the mygrid/Taverna
(or equivalent) framework
18NIH DTP Data
19NCI Panel of 60 Human Cell Cancer Lines
- Protein levels
- RNA measurements
- Mutation status
- Enzyme activity levels
20NIH DTPs COMPARE Program
- The pattern of activity across all 60 cell lines
that a compound exhibits is related to the
mechanism of action - Can be used to discover the mechanism of a
compounds actions by looking at which compounds
of known activity are correlated with the unknown - Has been used to discover novel compounds with a
given activity by testing the top correlating
compounds to a compound with the activity of
interest - Used to prioritize compounds that seem to have a
novel mechanism - Calculates a correlation coefficient between two
vectors in 60-dimensional space
21NIH DTP
- Given a compound tested in the 60 cell assay, one
can look for the genes whose expression most
highly correlates with the ability of the
compound to inhibit cell growth. Conversely,
given a gene, one can look for compounds whose
ability to inhibit cell growth is most highly
correlated with the expression of that gene.
22NIH DTP Needs
- Grid Web services
- Visualization may use VOTables
- Tools to squish a set of points in a large
dimensional space down into 2D or 3D while
attempting to preserve the relative distances - Looking at the nearest neighbors of the point of
interest with such a map could reveal relations
that would be missed in just a table listed by
distance
23NIH DTP Main Search Page
- http//dtp.nci.nih.gov/docs/dtp_search.html
24High-Throughput Screening (HTS)
- the integration of biological, chemical and
clinical data - automated standardized statistical analysis of
large and complex data volumes - biological and chemical profiling by use of
statistical analyses on combined data from
screening, pharmacological profiling, and
structural properties
25Other Potential Partners
- Center for Chemical Genomics at the University of
Michigan - http//www.lifesciences.umich.edu/institute/labs/c
cg/index.html - Milos Novotny (IUB Chemistry) 3.5 million
National Center for Research Resources (NIH)
grant to conduct research in the analysis of
glycoproteins - David Flockhart (IUB School of Medicine)
Cytochrome P450 database http//medicine.iupui.edu
/flockhart/
26PubChem
- 5,298,729 compounds as of 1/16/2006
- the place to go for biological and related data
- the central depository of all information related
to the NIH Roadmap project - expected that the actual data will reside there,
and only some things may be held elsewhere, with
PubChem acting as a pointer - May even have the images from screens and assays
- chemical structures from Elsevier's xPharm
database
27PubChem Data (as of 10/25/2005)
- Bioassays deposited 177
- Bioassay test results 3,158,669
- Substances deposited 7,848,390
- Unique Substances 5,269,228
28PubChem Technical Details
- Entrez database system
- For all textual information in the database
- NCBI Toolkit - an open-source infrastructure
toolkit - OpenEye OEChem toolkit and associated software
- for most structure standardization tasks, plus
some structure identifier computations like
SMILES and IUPAC name generation. - NIST InChI library
- for computing the InChI identifier
- CACTVS Chemoinformatics Toolkit
- for structure depictions, structure database
system, structure query execution, structure
deduplication, some property calculations and the
WWW structure and image editors - Various general low-level support libraries,
e.g., - zlib, png, gd and freetype libraries
- In-house code
- for the queuing system, deposition system,
display CGIs, structure standardization set-up,
update scripts, etc.
29PubChem Database Display and Query Subsystems - 1
- A special Entrez version
- stores textual and numerical data
- hosted on a MS SQL Server relational database
cluster - holds precomputed structure images for display,
ASN.1 structure data blobs for download, and
extensive crosslinking functions for linking to
other NCBI databases
30PubChem Display and Query Subsystems - 2
- structure search component
- based on the CACTVS structure search system
- pseudo-relational in nature (the underlying
storage manager is the Sleepycat BDB database
manager) - hosted on a Linux server cluster
- structure search file is not stored in the SQL
database, but there is an automatic
synchronization and update mechanism - Some data, such as Lipinski filter criteria, are
stored in both databases
31PubChem Programming Utilities
- Entrez Programming Utilities
- http//eutils.ncbi.nlm.nih.gov/entrez/query/static
/eutils_help.html - CACTVS chemoinformatics toolkit
- a full ASN.1 parser for CACTVS understands the
full data spec for structures and assay data - modules for talking to the Entrez database for
accessing structure blobs and some other NCBI
systems
32PubChem Data Deposition
- PubChem Deposition Gateway
- http//pubchem.ncbi.nlm.nih.gov/deposit/deposit.cg
i
33PubChem Sketcher
- No need to worry about the type of structure
definition displayed in the top line - uses a hidden internal representation to transfer
the information - http//pubchem.ncbi.nlm.nih.gov/search/
34InChI, The IUPAC International Chemical
Identifier
- Official site http//www.iupac.org/inchi/
- Unofficial InChI FAQ
- http//wwmm.ch.cam.ac.uk/inchifaq/
- WSDL InChI server at
- http//wwmm.ch.cam.ac.uk/gridsphere/gridsphere
35Searching InChIs
- Sample search
- InChI1/C17H14O4S/c1-22(19,20)14-9-7-12(8-10-14)1
5-11-21-17(18)16(15)13-5-3-2-4-6-13/h2-10H,11H2,1H
3 - Must include the quotation marks
- no carriage return or line feed in the string
- InChI code for C60 fullerene
- InChI1/C60/c1-2-5-6-3(1)8-12-10-4(1)9-11-7(2)17-2
1-13(5)23-24-14(6)22-18(8)28-20(12)30-26-16(10)15(
9)25-29-19(11)27(17)37-41-31(21)33(23)43-44-34(24)
32(22)42-38(28)48-40(30)46-36(26)35(25)45-39(29)47
(37)55-49(41)51(43)57-52(44)50(42)56(48)59-54(46)5
3(45)58(55)60(57)59
36ACD Labs and InChIs
- Transferring structures from PubChem to
ACD/ChemSketch - http//www.acdlabs.com/download/technotes/90/draw_
db/pubchem.pdf
37InChI Support in BKChem
- BKchem - a free chemical drawing program
- Successfully reads most InChIs
- http//bkchem.zirael.org/inchi_en.html
38InChI
- PubChem sketcher also supports generation of
InChI strings - http//pubchem.ncbi.nlm.nih.gov/edit/
- change the format selector to "InChI"
39Protein Data Bank (PDB) Data Dictionaries
- develop software and data definitions to support
the structural genomics efforts - enable high-throughput data deposition
- data dictionaries define items at the level of
detail of the materials and methods section of a
journal - uses macromolecular Crystallographic Information
File (mmCIF) data dictionaries - http//mmcif.pdb.org/index.html
40Translate WSDL to Human Readable Form
- http//soapclient.com/soaptest.html