Title: EChemistry and Web 2'0
1E-Chemistry and Web 2.0
- Marlon Pierce
- mpierce_at_cs.indiana.edu
- Community Grids Lab
- Indiana University
2One Talk, Two Projects
- NIH funded Chemical Informatics and
Cyberinfrastructure Collaboratory (CICC) _at_ IU. - Geoffrey Fox
- Gary Wiggins
- Rajarshi Guha
- David Wild
- Mookie Baik
- Kevin Gilbert
- And others
- Proposed Microsoft-Funded Project E-Chemistry
- Carl Lagoze (Cornell),
- Lee Giles (PSU),
- Steve Bryant (NIH),
- Jeremy Frey (Soton),
- Peter Murray-Rust (Cambridge),
- Herbert Van de Sompel (Los Alamos),
- Geoffrey Fox (Indiana)
- And others
3CICC Infrastructure Vision
- Chemical Informatics drug discovery and other
academic chemistry, pharmacology, and
bioinformatics research will be aided by
powerful, modern, open, information technology. - NIH PubChem and PubMed provide unprecedented
open, free data and information. - We need a corresponding open service architecture
(i.e. avoid stove-piped applications) - CICC set up as distributed cyberinfrastructure in
eScience model - Web clients (user interfaces) to distributed
databases, results of high throughput screening
instruments, results of computational chemical
simulations and other analyses. - Composed of clients to open service APIs
(mash-ups) - Aggregated into portals
- Web services manipulate this data and are
combined into workflows. - So our main agenda items create interesting
databases and build lots of Web services and
clients.
4CICC Databases
- Most of our databases aim to add value to PubChem
or link into PubChem - 1D (SMILES) and 2D structures
- 3D structures (MMFF94)
- Searchable by CID, SMARTS, 3D similarity
- Docked ligands (FRED, Autodock)
- 906K drug-like compounds into 7 ligands
- Will eventually cover 2000 targets
- Philosophy we have big computers, so lets
calculate everything ahead of time and put the
results in a DB.
5Building Up the Infrastructure
- Our SOA philosophy use standard Web services.
- Mostly stateless
- Some cluster, HPC work needed but these populate
databases - Services are aggregate-able into different
workflows. - Taverna, Pipeline Pilot,
- You can also build lots of Web clients.
- See http//www.chembiogrid.org/wiki/index.php/CICC
_Web_Resources for links and details. - Not so far from Web 2.0.
6Sample Services
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Web Client Interfaces
12More Clients
13More Clients
14Example PubDock
- Database of approximately 1 million PubChem
structures (the most drug-like) docked into
proteins taken from the PDB - Available as a web service, so structures can be
accessed in your own programs, or using workflow
tools like Pipeline Polit - Several interfaces developed, including one based
on Chimera (right) which integrates the database
with the PDB to allow browsing of compounds in
different targets, or different compounds in the
same target - Can be used as a tool to help understand
molecular basis of activity in cellular or image
based assays
15Example R Statistics applied to PubChem data
- By exposing the R statistical package, and the
Chemistry Development Kit (CDK) toolkit as web
services and integrating them with PubChem, we
can quickly and easily perform statistical
analysis and virtual screening of PubChem assay
data - Predictive models for particular screens are
exposed as web services, and can be used either
as simple web tools or integrated into other
applications - Example uses DTP Tumor Cell Line screens - a
predictive model using Random Forests in R makes
predictions of probability of activity across
multiple cell lines.
16Example assay screening workflow finding
cell-protein relationships
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
correlations
LeastSquares Regression
RandomForests
NeuralNets
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
17Relevance to Web 2.0
- Some Web 2.0 Key Features
- REST Services
- Use of RSS/Atom feeds
- Client interfaces are mashups
- Gadgets, widgets for portals aggregate clients
- So
- We provide RSS as an alternative WS format.
- We have experimented with RSS feeds, using Yahoo
Pipes to manipulate multiple feeds. - CICC Web interfaces can be easily wrapped as
universal gadgets in iGoogle, Netvibes. - Alternative to classic science gateways.
18RSS Feeds/REST Services
- Provide access to DB's via RSS feeds
- Feeds include 2D/3D structures in CML
- Viewable in Bioclipse, Jmol as well as Sage etc.
- Two feeds currently available
- SynSearch get structures based on full or
partial chemical names - DockSearch get best N structures for a target
- Really hampered by size of DB and Postgres
performance.
19Tools and mashups based on web service
infrastructure
http//www.chembiogrid.org/projects/proj_tools.htm
l
20Mining information from journal articles
- Until now SciFinder / CAS only chemistry-aware
portal into journal information - We can access full text of journal articles
online (with subscription) - ACS does not make full text available but there
are ways round that! - RSC is now marking up with SMILES and GO/Goldbook
terms! - www.projectprospect.org
- Having SMILES or InChI means that we can build a
similarity/structure searchable database of
papers e.g. find me all the papers published
since 2000 which contain a structure with gt90
similarity to this one - In the absence of full text, we can at least use
the abstract
21Text Mining OSCAR
- A tool for shallow, chemistry-specific natural
language parsing of chemical documents (e.g.
journal articles). - It identifies (or attempts to identify)
- Chemical names singular nouns, plurals, verbs
etc., also formulae and acronyms. - Chemical data Spectra, melting/boiling point,
yield etc. in experimental sections. - Other entities Things like N(5)-C(3) and so on.
- Part of the larger SciBorg effort
- See http//www.cl.cam.ac.uk/aac10/escience/scibor
g.html) - http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3
22Mash-Up What published compounds might bind to
this protein?
Create a database containing thetext of all
recent PubMed abstracts(2006-2007 500,000)
Use OSCAR to extract all of the chemical names
referred to in the abstracts and covert to SMILES
DATABASE SERVICE
DOCKING SERVICE
Convert molecules to 3D and dock into a protein
of interest
Visualize top docked molecules in a Google-like
interface
23E-Chemistry and Digital Libraries
- We cant wait to get started.
24E-Chemistry and Digital Libraries
- Key problem with our SOA-based e-Science is
information management. - Where is the service that I need?
- What does it do?
- We may consider our data-centric services to be
digital libraries. - Data is diverse
- Documents
- Not just computational information like
structures. - Another point of view how can I link together
publications, results, workflows, etc? - That is, I need to manage digital documents.
25Digital Libraries
- Open Archives Initiative Object Reuse and
Exchange Project (OAI-ORE) - Developing standardized, interoperable, and
machine-readable mechanisms to express
information about compound information objects on
the web. - Graph-based representations of connected digital
objects. - Objects may be encoded in (for example) RDF or
XML, - Retrievable via repositories with REST service
interfaces (c.f. Atom Publishing Protocal) - Obtain, harvest, and register
26(No Transcript)
27(No Transcript)
28Challenges for E-Chemistry
- Can digital library principals be applied to data
as well as documents? - Can you link your workflow to your conference
paper? - Can we engineer a publishing framework and
message formats around Web 2.0 principals? - REST, Atom Publishing Protocol, Atom Syndication
Format, JSON, Microformats - Can we do this securely?
- Access control, provenance, identify federation
are key problems.
29(No Transcript)
30More Information
- Project Web Site www.chembiogrid.org
- Project Wiki www.chembiogrid.org/wiki
- Contact me mpierce_at_cs.indiana.edu
31(No Transcript)
32Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC
CICC
CICC Combines Grid Computing with Chemical
Informatics
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
.
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.
- CICC supports the NIH mission by combining state
of the art chemical informatics techniques with - World class high performance computing
- National-scale computing resources (TeraGrid)
- Internet-standard web services
- International activities for service
orchestration - Open distributed computing infrastructure for
scientists world wide
NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
Laboratories
33MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
HTS
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs
be?
Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
etc
Compounds submitted to PubChem
CHEMINFORMATICS
PROCESS
GRIDS
34R Web Services
35Why?
- Need access to math and stat functionality
- Did not want to recode algorithms
- Wanted latest methods
- Needed a distributed approach to computation
- Keep computation on a powerful machine
- Access it from a smaller machine
36Why R?
- Free, open-source
- Many cutting edge methods avilable
- Flexible programming language
- Interfaces with many languages
- Python
- Perl
- Java
- C
37The R Server
- R can be run as a remote compute server
- Requires the rserve package
- Allows authenticated access over TCP/IP
- Connections can maintain state
- Client libraries for Java C
38R as a Web Service
- On its own the R server is not a web service
- We provide Java frontends to specific
functionalities - The frontend classes are hosted in a Tomcat web
container - Accessible via SOAP
- Full Javadocs for all available WSs
39Flowchart
40Functionality
- Two classes of functionality
- General functions
- Allows you to supply data and build a predictive
model - Sample from various distributions
- Obtain scatter plots and hisotgram
- Model development functions use a Java front-end
to encapsulate model specific information
41Functionality
- Two classes of functionality
- Model deployment
- Allows you to build a model outside of the
infrastructure - Place the final model in the infrastructure
- Becomes available as a web service
- Each model deployed requires its own front end
class - In general, these classes are identical - could
be autogenerated
42Available Functionality
- Predictive models - OLS, RF, CNN, LDA
- Clustering - k-means
- Statistical distributions
- XY plot and scatter plots
- Model deployment for single model types and
ensemble model types
43Deployed Models
- Since deployed models are visible as web services
we can build a simple web front end for them - Examples
- NCI anti-cancer predictions
- Ames mutagenicity predictions
44Applications
- The R WS is not restricted to atomic
functionality - Can write a whole R program
- Load it on the R compute server
- Provide a Java WS frontend
- Examples
- Feature selection
- Automated model generation
- Pharmacokinetic parameter calculation
45Data Input/Output
- Most modeling applications require data matrices
- Depending on client language we can use
- SOAP array of arrays (2D matrices)
- SOAP array (1D vector form of a 2D matrix)
- VOTables
46Data Input/Output
- Some R web services can take a URL to a VOTables
document - Conversion to R or Java matrices is done by a
local VOTables Java library - R also has basic support for VOTables directly
- Ignores binary data streams
47Interacting With R WSs
- Traditional WSs do not maintain state
- Predictive models are different
- A model is built at one time
- May be used for prediction at another time
- Need to maintain state
- State is maintained by serialization to R binary
files on the compute server - Clients deal with model IDs
48Interacting with R WSs
- Protocol
- Send data to model WS
- Get back model ID
- Get various information via model ID
- Fitted values
- Training statistics
- New predictions
49Cheminformatics at Indiana University School of
Informatics
- David J. Wild
- djwild_at_indiana.edu
- Associate Director of Chemical Informatics
Assistant Professor - Indiana University School of Informatics,
Bloomington - http//djwild.info
50Cheminformatics education at Indiana
- M.S. in Chemical Informatics
- 2 years, 36 semester hours
- Includes a 6-hour capstone / research project
- Opportunity to work in Laboratory Informatics
(IUPUI) or closely with Bioinformatics (IUB) - Currently 9 students enrolled
- Ph.D. in Informatics, Cheminformatics Specialty
- 90 credit hours, including 30 hours dissertation
research. Usually 4 years. - Research rotations expose students to research in
related areas - Currently 4 students enrolled
- Graduate Certificate
- 4 courses, all available by Distance Education
- I571 Chemical Information Technology
- I572 Computational Chemistry Molecular Modeling
- I573 Programming for Science Informatics
- I553 Independent Study in Chemical Informatics
- D.E. students pay in-state fees! (800 per
class) - See http//cheminfo.informatics.indiana.edu for
more information, or a general review of
cheminformatics education in Drug Discovery Today
11, 910 (May 2006), pp436-439
51Distance Education for Cheminformatics
- Uses Breeze teleconference for live sharing of
classes all that is required is a P.C. and a
telephone. Optional Polycom videoconferencing. - Lectures are recorded for easy playback through a
web browser - Wiki or similar webpage for dissemination of
course materials - Also participate in CIC courseshare to give class
at University of Michigan - Of 75 students taking our courses since fall
2005, 39 have been D.E. students - See JCIM 2006 46(2) pp 495 - 502 for more
details
52Current research in the Wild lab
- Integration of cheminformatics tools and data
sources - A web service infrastructure for cheminformatics
- Compound information aggregation web service
and interface (by the way box) - An enhanced chatbot for exploting chemical
information web services - A semantically-aware workflow tools for
cheminformatics - Data mining the NIH DTP tumor cell line database
- PubDock a docking database for PubChem
- Aggregating life science information from web and
journal documents - Data mining semantically rich chemistry journal
articles - Document similarity based on chemical structure
similarity - Evaluating semantic markup of chemistry journal
articles - Integrating cheminformatics into the chemistry
lab - Integrating cheminformatics with the Second Life
virtual world - Integrating cheminformatics tools with electronic
lab notebooks - Usability of cheminformatics tools
53Current research in the Guha lab
- Predictive Modeling
- Interpretation, validation, domain applicability
- Generalization to other models such as docking,
pharmacophore etc - Integration of multiple data types
- Addressing imbalanced and noisy datasets
- Analysis of Chemical Spaces
- Quantify distributions in spaces
- Investigation of density approaches
- Applications to lead hopping, model domains
- Methods to summarize compare data
- Applications to HTS and smaller lead series type
datasets - Network models combining chemical structures and
biological systems - Software and infrastructure
- Model exchange and annotation
- Pharmacophore representations, matching
- Toolkit development (CDK)
54Cheminformatics web service infrastructure
Cheminformatics services Docking (FRED) 3D
structure generation (OMEGA) Filtering (FRED,
etc) OSCAR3 Fingerprints (BCI, CDK) Clustering
(BCI) Toxicity prediction (ToxTree) R-based
predictive models Similarity calculations
(CDK) Descriptor calculation (CDK) 2D structure
diagrams (CDK)
- Database Services
- PostgreSQL gNova
- PubChem mirror (augmented)
- Pub3D - 3D structures for PubChem
- PubDock - Bound 3D structures
- Compound-indexed journal article DB
- NIH Human Tumor Cell Line
- Local PubChem mirror
- VARUNA quantum chemistry database
- Statistics (based on R)
- Regression, LDA
- Neural Nets, Random Forest
- K-means clustering
- Plotting
- T-test and distribution sampling
Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy
Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey
C. Fox and David J. Wild, Web service
infrastructure for chemoinformatics, Journal of
Chemical Information and Modeling, 2007 47(4) pp
1303-1307
55RSC Project Prospect - what can we do with the
information?
- www.projectprospect.org
- gt100 papers marked up with SMILES/InChI (using
OSCAR3), plus Gene Ontology and Goldbook Ontology
terms - Created similarity searchable PostgreSQL / gNova
database with paper DOIs, SMILES, and ontology
terms - Web service and simple HTML interfaces for
searching which papers reference compounds
similar to this one in the scope of these
ontological terms? - Applying statistics to look at co-occurrence of
compounds, structural features (MACCS keys) and
ontological terms in papers
56Greasemonkey / OSCAR script
http//cheminfo.informatics.indiana.edu8080/ChemG
M/index.jsp
57By the way annotation (mock-up!)
By the way This compounds is very similar to a
prescription drug, Tamoxifen. This compound is
referenced in 20 journal articles published in
the last 5 years Similar compounds are associated
with the words toxic and death in 280 web
pages It appears to be covered under 3 patents It
has been shown to be active in 5 screens Computer
models predict it to show some activity against 8
protein targets Here are some comments on this
compound David Wild dont take any notice of
the computational models - they are rubbish
58Cheminformatics aware simple lab notebook (mock
up!)
Plug-in allows structures to be drawn with the
pen and cleaned up
Some useful chemical reactions Iodoacetate a
Iodoacetamide I-CH4COO- ICH2CONH2 This
may also react, chem favored by alkaline pH .
Web service interfaceprovides access
to computation and searching. Page is marked up
by what is possible
FIND INFO ABOUT THIS REACTION
Free text input can be converted to
machine readable form by electrovaya
Automatic detection ofdata fields (yield,
etc) Where possible
59Automatic workflow generation and natural
language queries
- Develop service ontology using OWL-S or similar
language - Allows service interoperability, replacement and
input/outut compatibility - We can then use generic reasoning and network
analysis tools to find paths from inputs to
desired outputs - Natural language can be parsed to inputs and
desired outputs - Smart Clients lt--gt Agents lt--gt Services
- Possible supercharged life science Google? -
e.g. type in what compounds might bind to the
enclosed protein?
3D search
2dsimilarity
3D structures are compounds
2D -gt 3D
2D structures
3D structures
2Dstructurecrawler
2D structures
3D structures
result
3D structures complexes
dock
Pphoresearch
2D structures are compounds
3D proteinstructure
3D structures are compounds
dock bind