Title: EChemistry and Web 2.0
1E-Chemistry and Web 2.0
- Marlon Pierce
- mpierce_at_cs.indiana.edu
- Community Grids Lab
- Indiana University
2One Talk, Two Projects
- NIH funded Chemical Informatics and
Cyberinfrastructure Collaboratory. - Geoffrey Fox, Gary Wiggins, Rajarshi Guha, David
Wild, Mookie Baik, Kevin Gilbert - Proposed Microsoft-Funded Project
- Carl Lagoze (Cornell), Lee Giles (PSU), Steve
Bryant (NIH), Jeremy Frey (Soton), Peter
Murray-Rust (Cambridge), Herbert Van de Sompel
(Los Alamos), Geoffrey Fox (Indiana)
3CICC Project Summary
- Creating a comprehensive, easily accessible
infrastructure for chemoinformatics tools and
data sources, linked with NIH PubChem and made
available as web services, and partnering with
screening centers and other users to demonstrate
how this infrastructure can be usefully applied - Infrastructure can include any tools, not just
ours (commercial/open source, chemoinformatics,
bioinformatics, and so on) - New, custom applications can be built quickly
using existing services in a similar way to
Google Maps and other web 2.0 resources - Develop education program to foster chemical
informatics as an academic discipline. - Field is dominated by big pharma
- Need to train students, provide open research
environment.
4Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC
CICC
CICC Combines Grid Computing with Chemical
Informatics
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
.
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.
- CICC supports the NIH mission by combining state
of the art chemical informatics techniques with - World class high performance computing
- National-scale computing resources (TeraGrid)
- Internet-standard web services
- International activities for service
orchestration - Open distributed computing infrastructure for
scientists world wide
NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
Laboratories
5CICC Infrastructure Vision
- Drug Discovery and other academic chemistry and
pharmacologyresearch will be aided by powerful
modern information technology CICC set up as
distributed cyberinfrastructure in eScience model - Web clients (user interfaces) to distributed
databases, results of high throughput screening
instruments, results of computational chemical
simulations and other analyses. - Aggregated into portals
- Web services manipulate this data and are
combined into workflows. - CICC includes access to PubChem, PubMed, PubMed
Central, the Internet and its derivatives like
Microsoft Academic Live and Google Scholar - The services include open-source software like
CDK, commercial code from vendors from BCI,
OpenEye, Gaussian and Google, and any user
contributed programs
6Services and Sample Clients
- Our SOA philosophy use standard Web services.
- Mostly stateless
- Some cluster, HPC work
- Services are aggregate-able into different
workflows. - Taverna, Pipeline Pilot,
- You can also build lots of clients.
- See http//www.chembiogrid.org/wiki/index.php/CICC
_Web_Resources for links and details. - Not so far from Web 2.0.
7Sample Services
8More Services
9More Services.
10More Services
11More Services (Last)
12Web Client Interfaces
13More Clients
14More Clients
15Web Service Locations
Web Service Locations
- Cambridge University
- InChi generation / search
- CMLRSS
- OpenBabel
- Cambridge University
- InChi generation / search
- CMLRSS
- OpenBabel
- Indiana University
- Clustering
- VOTables
- OSCAR3
- Toxicity classification
- Database services
- Indiana University
- Clustering
- VOTables
- Toxicity classification
- Database services
- Statistics services
- University of Cologne
- NMRShiftDB
16Where Does The Functionality Come From?
- University of Michigan
- PkCell
- Cambridge University
- InChi generation / search
- OSCAR
- DigitalChemistry
- BCI fingerprints
- DivKMeans
gNova Consulting
- European Chemicals Bureau
- ToxTree toxicity predictions
- Indiana University
- VOTables
- NCI DTP predictions
- Database services
17MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
HTS
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs
be?
Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
etc
Compounds submitted to PubChem
CHEMINFORMATICS
PROCESS
GRIDS
18Example HTS workflow finding cell-protein
relationships
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
correlations
LeastSquares Regression
RandomForests
NeuralNets
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
19Example PubDock
- Database of approximately 1 million PubChem
structures (the most drug-like) docked into
proteins taken from the PDB - Available as a web service, so structures can be
accessed in your own programs, or using workflow
tools like Pipeline Polit - Several interfaces developed, including one based
on Chimera (right) which integrates the database
with the PDB to allow browsing of compounds in
different targets, or different compounds in the
same target - Can be used as a tool to help understand
molecular basis of activity in cellular or image
based assays
20Example R Statistics applied to PubChem data
- By exposing the R statistical package, and the
Chemistry Development Kit (CDK) toolkit as web
services and integrating them with PubChem, we
can quickly and easily perform statistical
analysis and virtual screening of PubChem assay
data - Predictive models for particular screens are
exposed as web services, and can be used either
as simple web tools or integrated into other
applications - Example uses DTP Tumor Cell Line screens - a
predictive model using Random Forests in R makes
predictions of probability of activity across
multiple cell lines.
21RSS Feeds
- Provide access to DB's via RSS feeds
- Feeds include 2D/3D structures in CML
- Viewable in Bioclipse, Jmol as well as Sage etc.
- Two feeds currently available
- SynSearch get structures based on full or
partial chemical names - DockSearch get best N structures for a target
22R, CDK PubChem
- Goals
- Access cheminformatics from within R
- Access PubChem data from within R
- rcdk package allows to do cheminformatics within
R using CDK functionality - rpubchem provides access to PubChem compound data
and bioassay data - Searchable via assay ID, keywords
- J. Stat. Soft, 2007, 18(6)
23Databases
- Most of our databases aim to add value to PubChem
or link into PubChem - We maintain a local mirror for testing, data
mining - 3D structures (MMFF94)
- Searchable by CID, SMARTS, 3D similarity
- Docked ligands (FRED)
- 906K drug-like compounds into 7 ligands
- Will eventually cover 2000 targets
24(Cheminformatics) Algorithm Development
- Goals
- Focus on interpretability and applicability
- Devise novel approaches to clustering problems
- Investigate the utility of low dimensional
representations for a variety of problems - Examples
- Ensemble feature selection (JCIM, in press)
- Cluster counting with R-NN curves (in revision)
25Chemical Data Mining
- Collaboration on screening data with Scripps, FL
- Random forests (modeling feature selection)
- Naïve Bayes (modeling)
- Identifying features indicative of toxicity
- Domain applicability
- NCI DTP Cell line activity predictions
- Random forest models for 60 cell lines
- All available as
- downloadable R models
- web services (supply SMILES, get prediction) with
web page clients
26Mining information from journal articles
- Until now SciFinder / CAS only chemistry-aware
portal into journal information - We can access full text of journal articles
online (with subscription) - ACS does not make full text available but there
are ways round that! - RSC is now marking up with SMILES and GO/Goldbook
terms! - www.projectprospect.org
- Having SMILES or InChI means that we can build a
similarity/structure searchable database of
papers e.g. find me all the papers published
since 2000 which contain a structure with 90
similarity to this one - In the absence of full text, we can at least use
the abstract - OSCAR3 - Murray Rust Group
- A tool for shallow, chemistry-specific natural
language parsing of chemical documents (e.g.
journal articles). - It identifies (or attempts to identify)
- Chemical names singular nouns, plurals, verbs
etc., also formulae and acronyms. - Chemical data Spectra, melting/boiling point,
yield etc. in experimental sections. - Other entities Things like N(5)-C(3) and so on.
- Part of the larger SciBorg effort
- See http//www.cl.cam.ac.uk/aac10/escience/scibor
g.html) - http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3
27E-Chemistry and Digital Libraries
28E-Chemistry and Digital Libraries
- Key problem with our SOA-based e-Science is
information management. - Where is the service?
- What does it do?
- We may also consider our data-centric services to
be digital libraries. - Data is diverse
- Documents
- Not just computational information like
structures.
29Digital Libraries
- Open Archives Initiative Object Reuse and
Exchange Project (OAI-ORE) - Developing standardized, interoperable, and
machine-readable mechanisms to express
information about compound information objects on
the web. - Graph-based representations
30Security
31(No Transcript)
32(No Transcript)
33More Information
- Project Web Site www.chembiogrid.org
- Project Wiki www.chembiogrid.org/wiki
- Contact me mpierce_at_cs.indiana.edu
34(No Transcript)
35R Web Services
36Why?
- Need access to math and stat functionality
- Did not want to recode algorithms
- Wanted latest methods
- Needed a distributed approach to computation
- Keep computation on a powerful machine
- Access it from a smaller machine
37Why R?
- Free, open-source
- Many cutting edge methods avilable
- Flexible programming language
- Interfaces with many languages
- Python
- Perl
- Java
- C
38The R Server
- R can be run as a remote compute server
- Requires the rserve package
- Allows authenticated access over TCP/IP
- Connections can maintain state
- Client libraries for Java C
39R as a Web Service
- On its own the R server is not a web service
- We provide Java frontends to specific
functionalities - The frontend classes are hosted in a Tomcat web
container - Accessible via SOAP
- Full Javadocs for all available WSs
40Flowchart
41Functionality
- Two classes of functionality
- General functions
- Allows you to supply data and build a predictive
model - Sample from various distributions
- Obtain scatter plots and hisotgram
- Model development functions use a Java front-end
to encapsulate model specific information
42Functionality
- Two classes of functionality
- Model deployment
- Allows you to build a model outside of the
infrastructure - Place the final model in the infrastructure
- Becomes available as a web service
- Each model deployed requires its own front end
class - In general, these classes are identical - could
be autogenerated
43Available Functionality
- Predictive models - OLS, RF, CNN, LDA
- Clustering - k-means
- Statistical distributions
- XY plot and scatter plots
- Model deployment for single model types and
ensemble model types
44Deployed Models
- Since deployed models are visible as web services
we can build a simple web front end for them - Examples
- NCI anti-cancer predictions
- Ames mutagenicity predictions
45Applications
- The R WS is not restricted to atomic
functionality - Can write a whole R program
- Load it on the R compute server
- Provide a Java WS frontend
- Examples
- Feature selection
- Automated model generation
- Pharmacokinetic parameter calculation
46Data Input/Output
- Most modeling applications require data matrices
- Depending on client language we can use
- SOAP array of arrays (2D matrices)
- SOAP array (1D vector form of a 2D matrix)
- VOTables
47Data Input/Output
- Some R web services can take a URL to a VOTables
document - Conversion to R or Java matrices is done by a
local VOTables Java library - R also has basic support for VOTables directly
- Ignores binary data streams
48Interacting With R WSs
- Traditional WSs do not maintain state
- Predictive models are different
- A model is built at one time
- May be used for prediction at another time
- Need to maintain state
- State is maintained by serialization to R binary
files on the compute server - Clients deal with model IDs
49Interacting with R WSs
- Protocol
- Send data to model WS
- Get back model ID
- Get various information via model ID
- Fitted values
- Training statistics
- New predictions
50Cheminformatics at Indiana University School of
Informatics
- David J. Wild
- djwild_at_indiana.edu
- Associate Director of Chemical Informatics
Assistant Professor - Indiana University School of Informatics,
Bloomington - http//djwild.info
51Cheminformatics education at Indiana
- M.S. in Chemical Informatics
- 2 years, 36 semester hours
- Includes a 6-hour capstone / research project
- Opportunity to work in Laboratory Informatics
(IUPUI) or closely with Bioinformatics (IUB) - Currently 9 students enrolled
- Ph.D. in Informatics, Cheminformatics Specialty
- 90 credit hours, including 30 hours dissertation
research. Usually 4 years. - Research rotations expose students to research in
related areas - Currently 4 students enrolled
- Graduate Certificate
- 4 courses, all available by Distance Education
- I571 Chemical Information Technology
- I572 Computational Chemistry Molecular Modeling
- I573 Programming for Science Informatics
- I553 Independent Study in Chemical Informatics
- D.E. students pay in-state fees! (800 per
class) - See http//cheminfo.informatics.indiana.edu for
more information, or a general review of
cheminformatics education in Drug Discovery Today
11, 910 (May 2006), pp436-439
52Distance Education for Cheminformatics
- Uses Breeze teleconference for live sharing of
classes all that is required is a P.C. and a
telephone. Optional Polycom videoconferencing. - Lectures are recorded for easy playback through a
web browser - Wiki or similar webpage for dissemination of
course materials - Also participate in CIC courseshare to give class
at University of Michigan - Of 75 students taking our courses since fall
2005, 39 have been D.E. students - See JCIM 2006 46(2) pp 495 - 502 for more
details
53Current research in the Wild lab
- Integration of cheminformatics tools and data
sources - A web service infrastructure for cheminformatics
- Compound information aggregation web service
and interface (by the way box) - An enhanced chatbot for exploting chemical
information web services - A semantically-aware workflow tools for
cheminformatics - Data mining the NIH DTP tumor cell line database
- PubDock a docking database for PubChem
- Aggregating life science information from web and
journal documents - Data mining semantically rich chemistry journal
articles - Document similarity based on chemical structure
similarity - Evaluating semantic markup of chemistry journal
articles - Integrating cheminformatics into the chemistry
lab - Integrating cheminformatics with the Second Life
virtual world - Integrating cheminformatics tools with electronic
lab notebooks - Usability of cheminformatics tools
54Current research in the Guha lab
- Predictive Modeling
- Interpretation, validation, domain applicability
- Generalization to other models such as docking,
pharmacophore etc - Integration of multiple data types
- Addressing imbalanced and noisy datasets
- Analysis of Chemical Spaces
- Quantify distributions in spaces
- Investigation of density approaches
- Applications to lead hopping, model domains
- Methods to summarize compare data
- Applications to HTS and smaller lead series type
datasets - Network models combining chemical structures and
biological systems - Software and infrastructure
- Model exchange and annotation
- Pharmacophore representations, matching
- Toolkit development (CDK)
55Cheminformatics web service infrastructure
Cheminformatics services Docking (FRED) 3D
structure generation (OMEGA) Filtering (FRED,
etc) OSCAR3 Fingerprints (BCI, CDK) Clustering
(BCI) Toxicity prediction (ToxTree) R-based
predictive models Similarity calculations
(CDK) Descriptor calculation (CDK) 2D structure
diagrams (CDK)
- Database Services
- PostgreSQL gNova
- PubChem mirror (augmented)
- Pub3D - 3D structures for PubChem
- PubDock - Bound 3D structures
- Compound-indexed journal article DB
- NIH Human Tumor Cell Line
- Local PubChem mirror
- VARUNA quantum chemistry database
- Statistics (based on R)
- Regression, LDA
- Neural Nets, Random Forest
- K-means clustering
- Plotting
- T-test and distribution sampling
Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy
Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey
C. Fox and David J. Wild, Web service
infrastructure for chemoinformatics, Journal of
Chemical Information and Modeling, 2007 47(4) pp
1303-1307
56Tools and mashups based on web service
infrastructure
http//www.chembiogrid.org/projects/proj_tools.htm
l
57PubDock - database of docked PubChem Ligands
- 1 million PubChem compounds (drugable) docked
into PDB proteins (currently 7 but more coming -
hope to have 100 or so) - Multiple interfaces. This is really a
bioinformatics / chemoinformatics mashup - Retrieve top hits for a protein
- Organize proteins by similarity between docking
profiles over compounds - Cluster compounds by docking profile across
targets - Uses many web services PDB services, our PubDock
database service, our CDK services etc
58MashUp What published compounds might bind to
this protein?
Create a database containing thetext of all
recent PubMed abstracts(2006-2007 500,000)
Use OSCAR to extract all of the chemical names
referred to in the abstracts and covert to SMILES
DATABASE SERVICE
DOCKING SERVICE
Convert molecules to 3D and dock into a protein
of interest
Visualize top docked molecules in a Google-like
interface
59RSC Project Prospect - what can we do with the
information?
- www.projectprospect.org
- 100 papers marked up with SMILES/InChI (using
OSCAR3), plus Gene Ontology and Goldbook Ontology
terms - Created similarity searchable PostgreSQL / gNova
database with paper DOIs, SMILES, and ontology
terms - Web service and simple HTML interfaces for
searching which papers reference compounds
similar to this one in the scope of these
ontological terms? - Applying statistics to look at co-occurrence of
compounds, structural features (MACCS keys) and
ontological terms in papers
60Greasemonkey / OSCAR script
http//cheminfo.informatics.indiana.edu8080/ChemG
M/index.jsp
61By the way annotation (mock-up!)
By the way This compounds is very similar to a
prescription drug, Tamoxifen. This compound is
referenced in 20 journal articles published in
the last 5 years Similar compounds are associated
with the words toxic and death in 280 web
pages It appears to be covered under 3 patents It
has been shown to be active in 5 screens Computer
models predict it to show some activity against 8
protein targets Here are some comments on this
compound David Wild dont take any notice of
the computational models - they are rubbish
62Cheminformatics aware simple lab notebook (mock
up!)
Plug-in allows structures to be drawn with the
pen and cleaned up
Some useful chemical reactions Iodoacetate a
Iodoacetamide I-CH4COO- ICH2CONH2 This
may also react, chem favored by alkaline pH .
Web service interfaceprovides access
to computation and searching. Page is marked up
by what is possible
FIND INFO ABOUT THIS REACTION
Free text input can be converted to
machine readable form by electrovaya
Automatic detection ofdata fields (yield,
etc) Where possible
63Automatic workflow generation and natural
language queries
- Develop service ontology using OWL-S or similar
language - Allows service interoperability, replacement and
input/outut compatibility - We can then use generic reasoning and network
analysis tools to find paths from inputs to
desired outputs - Natural language can be parsed to inputs and
desired outputs - Smart Clients Agents Services
- Possible supercharged life science Google? -
e.g. type in what compounds might bind to the
enclosed protein?
3D search
2dsimilarity
3D structures are compounds
2D - 3D
2D structures
3D structures
2Dstructurecrawler
2D structures
3D structures
result
3D structures complexes
dock
Pphoresearch
2D structures are compounds
3D proteinstructure
3D structures are compounds
dock bind