EChemistry and Web 2.0

About This Presentation

Title:

EChemistry and Web 2.0

Description:

Geoffrey Fox, Gary Wiggins, Rajarshi Guha, David Wild, Mookie ... Peter Murray-Rust (Cambridge), Herbert Van de Sompel (Los Alamos), Geoffrey Fox (Indiana) ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 64

Provided by: servo

Category:

more less

Transcript and Presenter's Notes

Title: EChemistry and Web 2.0

1
E-Chemistry and Web 2.0

Marlon Pierce
mpierce_at_cs.indiana.edu
Community Grids Lab
Indiana University

2
One Talk, Two Projects

NIH funded Chemical Informatics and
Cyberinfrastructure Collaboratory.
Geoffrey Fox, Gary Wiggins, Rajarshi Guha, David
Wild, Mookie Baik, Kevin Gilbert
Proposed Microsoft-Funded Project
Carl Lagoze (Cornell), Lee Giles (PSU), Steve
Bryant (NIH), Jeremy Frey (Soton), Peter
Murray-Rust (Cambridge), Herbert Van de Sompel
(Los Alamos), Geoffrey Fox (Indiana)

3
CICC Project Summary

Creating a comprehensive, easily accessible
infrastructure for chemoinformatics tools and
data sources, linked with NIH PubChem and made
available as web services, and partnering with
screening centers and other users to demonstrate
how this infrastructure can be usefully applied
Infrastructure can include any tools, not just
ours (commercial/open source, chemoinformatics,
bioinformatics, and so on)
New, custom applications can be built quickly
using existing services in a similar way to
Google Maps and other web 2.0 resources
Develop education program to foster chemical
informatics as an academic discipline.
Field is dominated by big pharma
Need to train students, provide open research
environment.

4
Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC
CICC
CICC Combines Grid Computing with Chemical
Informatics
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
.
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.

CICC supports the NIH mission by combining state
of the art chemical informatics techniques with
World class high performance computing
National-scale computing resources (TeraGrid)
Internet-standard web services
International activities for service
orchestration
Open distributed computing infrastructure for
scientists world wide

NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
Laboratories
5
CICC Infrastructure Vision

Drug Discovery and other academic chemistry and
pharmacologyresearch will be aided by powerful
modern information technology CICC set up as
distributed cyberinfrastructure in eScience model
Web clients (user interfaces) to distributed
databases, results of high throughput screening
instruments, results of computational chemical
simulations and other analyses.
Aggregated into portals
Web services manipulate this data and are
combined into workflows.
CICC includes access to PubChem, PubMed, PubMed
Central, the Internet and its derivatives like
Microsoft Academic Live and Google Scholar
The services include open-source software like
CDK, commercial code from vendors from BCI,
OpenEye, Gaussian and Google, and any user
contributed programs

6
Services and Sample Clients

Our SOA philosophy use standard Web services.
Mostly stateless
Some cluster, HPC work
Services are aggregate-able into different
workflows.
Taverna, Pipeline Pilot,
You can also build lots of clients.
See http//www.chembiogrid.org/wiki/index.php/CICC
_Web_Resources for links and details.
Not so far from Web 2.0.

7
Sample Services
8
More Services
9
More Services.
10
More Services
11
More Services (Last)
12
Web Client Interfaces
13
More Clients
14
More Clients
15
Web Service Locations
Web Service Locations

Cambridge University
InChi generation / search
CMLRSS
OpenBabel

Cambridge University
InChi generation / search
CMLRSS
OpenBabel

Indiana University
Clustering
VOTables
OSCAR3
Toxicity classification
Database services

Indiana University
Clustering
VOTables
Toxicity classification
Database services
Statistics services

VCC Laboratory
ALogPS

NCI
CSLS

University of Cologne
NMRShiftDB

16
Where Does The Functionality Come From?

University of Michigan
PkCell

Cambridge University
InChi generation / search
OSCAR

DigitalChemistry
BCI fingerprints
DivKMeans

gNova Consulting

NIH
PubChem
PubMed

CDK
Cheminformatics

European Chemicals Bureau
ToxTree toxicity predictions

OpenEye
Docking

R Foundation
R package

Indiana University
VOTables
NCI DTP predictions
Database services

17
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
HTS
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs
be?

Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
etc
Compounds submitted to PubChem
CHEMINFORMATICS
PROCESS
GRIDS
18
Example HTS workflow finding cell-protein
relationships
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
correlations
LeastSquares Regression
RandomForests
NeuralNets
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
19
Example PubDock

Database of approximately 1 million PubChem
structures (the most drug-like) docked into
proteins taken from the PDB
Available as a web service, so structures can be
accessed in your own programs, or using workflow
tools like Pipeline Polit
Several interfaces developed, including one based
on Chimera (right) which integrates the database
with the PDB to allow browsing of compounds in
different targets, or different compounds in the
same target
Can be used as a tool to help understand
molecular basis of activity in cellular or image
based assays

20
Example R Statistics applied to PubChem data

By exposing the R statistical package, and the
Chemistry Development Kit (CDK) toolkit as web
services and integrating them with PubChem, we
can quickly and easily perform statistical
analysis and virtual screening of PubChem assay
data
Predictive models for particular screens are
exposed as web services, and can be used either
as simple web tools or integrated into other
applications
Example uses DTP Tumor Cell Line screens - a
predictive model using Random Forests in R makes
predictions of probability of activity across
multiple cell lines.

21
RSS Feeds

Provide access to DB's via RSS feeds
Feeds include 2D/3D structures in CML
Viewable in Bioclipse, Jmol as well as Sage etc.
Two feeds currently available
SynSearch get structures based on full or
partial chemical names
DockSearch get best N structures for a target

22
R, CDK PubChem

Goals
Access cheminformatics from within R
Access PubChem data from within R
rcdk package allows to do cheminformatics within
R using CDK functionality
rpubchem provides access to PubChem compound data
and bioassay data
Searchable via assay ID, keywords
J. Stat. Soft, 2007, 18(6)

23
Databases

Most of our databases aim to add value to PubChem
or link into PubChem
We maintain a local mirror for testing, data
mining
3D structures (MMFF94)
Searchable by CID, SMARTS, 3D similarity
Docked ligands (FRED)
906K drug-like compounds into 7 ligands
Will eventually cover 2000 targets

24
(Cheminformatics) Algorithm Development

Goals
Focus on interpretability and applicability
Devise novel approaches to clustering problems
Investigate the utility of low dimensional
representations for a variety of problems
Examples
Ensemble feature selection (JCIM, in press)
Cluster counting with R-NN curves (in revision)

25
Chemical Data Mining

Collaboration on screening data with Scripps, FL
Random forests (modeling feature selection)
Naïve Bayes (modeling)
Identifying features indicative of toxicity
Domain applicability
NCI DTP Cell line activity predictions
Random forest models for 60 cell lines
All available as
downloadable R models
web services (supply SMILES, get prediction) with
web page clients

26
Mining information from journal articles

Until now SciFinder / CAS only chemistry-aware
portal into journal information
We can access full text of journal articles
online (with subscription)
ACS does not make full text available but there
are ways round that!
RSC is now marking up with SMILES and GO/Goldbook
terms!
www.projectprospect.org
Having SMILES or InChI means that we can build a
similarity/structure searchable database of
papers e.g. find me all the papers published
since 2000 which contain a structure with 90
similarity to this one
In the absence of full text, we can at least use
the abstract
OSCAR3 - Murray Rust Group
A tool for shallow, chemistry-specific natural
language parsing of chemical documents (e.g.
journal articles).
It identifies (or attempts to identify)
Chemical names singular nouns, plurals, verbs
etc., also formulae and acronyms.
Chemical data Spectra, melting/boiling point,
yield etc. in experimental sections.
Other entities Things like N(5)-C(3) and so on.
Part of the larger SciBorg effort
See http//www.cl.cam.ac.uk/aac10/escience/scibor
g.html)
http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3

27
E-Chemistry and Digital Libraries
28
E-Chemistry and Digital Libraries

Key problem with our SOA-based e-Science is
information management.
Where is the service?
What does it do?
We may also consider our data-centric services to
be digital libraries.
Data is diverse
Documents
Not just computational information like
structures.

29
Digital Libraries

Open Archives Initiative Object Reuse and
Exchange Project (OAI-ORE)
Developing standardized, interoperable, and
machine-readable mechanisms to express
information about compound information objects on
the web.
Graph-based representations

30
Security
31
(No Transcript)
32
(No Transcript)
33
More Information

Project Web Site www.chembiogrid.org
Project Wiki www.chembiogrid.org/wiki
Contact me mpierce_at_cs.indiana.edu

34
(No Transcript)
35
R Web Services
36
Why?

Need access to math and stat functionality
Did not want to recode algorithms
Wanted latest methods
Needed a distributed approach to computation
Keep computation on a powerful machine
Access it from a smaller machine

37
Why R?

Free, open-source
Many cutting edge methods avilable
Flexible programming language
Interfaces with many languages
Python
Perl
Java
C

38
The R Server

R can be run as a remote compute server
Requires the rserve package
Allows authenticated access over TCP/IP
Connections can maintain state
Client libraries for Java C

39
R as a Web Service

On its own the R server is not a web service
We provide Java frontends to specific
functionalities
The frontend classes are hosted in a Tomcat web
container
Accessible via SOAP
Full Javadocs for all available WSs

40
Flowchart
41
Functionality

Two classes of functionality
General functions
Allows you to supply data and build a predictive
model
Sample from various distributions
Obtain scatter plots and hisotgram
Model development functions use a Java front-end
to encapsulate model specific information

42
Functionality

Two classes of functionality
Model deployment
Allows you to build a model outside of the
infrastructure
Place the final model in the infrastructure
Becomes available as a web service
Each model deployed requires its own front end
class
In general, these classes are identical - could
be autogenerated

43
Available Functionality

Predictive models - OLS, RF, CNN, LDA
Clustering - k-means
Statistical distributions
XY plot and scatter plots
Model deployment for single model types and
ensemble model types

44
Deployed Models

Since deployed models are visible as web services
we can build a simple web front end for them
Examples
NCI anti-cancer predictions
Ames mutagenicity predictions

45
Applications

The R WS is not restricted to atomic
functionality
Can write a whole R program
Load it on the R compute server
Provide a Java WS frontend
Examples
Feature selection
Automated model generation
Pharmacokinetic parameter calculation

46
Data Input/Output

Most modeling applications require data matrices
Depending on client language we can use
SOAP array of arrays (2D matrices)
SOAP array (1D vector form of a 2D matrix)
VOTables

47
Data Input/Output

Some R web services can take a URL to a VOTables
document
Conversion to R or Java matrices is done by a
local VOTables Java library
R also has basic support for VOTables directly
Ignores binary data streams

48
Interacting With R WSs

Traditional WSs do not maintain state
Predictive models are different
A model is built at one time
May be used for prediction at another time
Need to maintain state
State is maintained by serialization to R binary
files on the compute server
Clients deal with model IDs

49
Interacting with R WSs

Protocol
Send data to model WS
Get back model ID
Get various information via model ID
Fitted values
Training statistics
New predictions

50
Cheminformatics at Indiana University School of
Informatics

David J. Wild
djwild_at_indiana.edu
Associate Director of Chemical Informatics
Assistant Professor
Indiana University School of Informatics,
Bloomington
http//djwild.info

51
Cheminformatics education at Indiana

M.S. in Chemical Informatics
2 years, 36 semester hours
Includes a 6-hour capstone / research project
Opportunity to work in Laboratory Informatics
(IUPUI) or closely with Bioinformatics (IUB)
Currently 9 students enrolled
Ph.D. in Informatics, Cheminformatics Specialty
90 credit hours, including 30 hours dissertation
research. Usually 4 years.
Research rotations expose students to research in
related areas
Currently 4 students enrolled
Graduate Certificate
4 courses, all available by Distance Education
I571 Chemical Information Technology
I572 Computational Chemistry Molecular Modeling
I573 Programming for Science Informatics
I553 Independent Study in Chemical Informatics
D.E. students pay in-state fees! (800 per
class)
See http//cheminfo.informatics.indiana.edu for
more information, or a general review of
cheminformatics education in Drug Discovery Today
11, 910 (May 2006), pp436-439

52
Distance Education for Cheminformatics

Uses Breeze teleconference for live sharing of
classes all that is required is a P.C. and a
telephone. Optional Polycom videoconferencing.
Lectures are recorded for easy playback through a
web browser
Wiki or similar webpage for dissemination of
course materials
Also participate in CIC courseshare to give class
at University of Michigan
Of 75 students taking our courses since fall
2005, 39 have been D.E. students
See JCIM 2006 46(2) pp 495 - 502 for more
details

53
Current research in the Wild lab

Integration of cheminformatics tools and data
sources
A web service infrastructure for cheminformatics
Compound information aggregation web service
and interface (by the way box)
An enhanced chatbot for exploting chemical
information web services
A semantically-aware workflow tools for
cheminformatics
Data mining the NIH DTP tumor cell line database
PubDock a docking database for PubChem
Aggregating life science information from web and
journal documents
Data mining semantically rich chemistry journal
articles
Document similarity based on chemical structure
similarity
Evaluating semantic markup of chemistry journal
articles
Integrating cheminformatics into the chemistry
lab
Integrating cheminformatics with the Second Life
virtual world
Integrating cheminformatics tools with electronic
lab notebooks
Usability of cheminformatics tools

54
Current research in the Guha lab

Predictive Modeling
Interpretation, validation, domain applicability
Generalization to other models such as docking,
pharmacophore etc
Integration of multiple data types
Addressing imbalanced and noisy datasets
Analysis of Chemical Spaces
Quantify distributions in spaces
Investigation of density approaches
Applications to lead hopping, model domains
Methods to summarize compare data
Applications to HTS and smaller lead series type
datasets
Network models combining chemical structures and
biological systems
Software and infrastructure
Model exchange and annotation
Pharmacophore representations, matching
Toolkit development (CDK)

55
Cheminformatics web service infrastructure
Cheminformatics services Docking (FRED) 3D
structure generation (OMEGA) Filtering (FRED,
etc) OSCAR3 Fingerprints (BCI, CDK) Clustering
(BCI) Toxicity prediction (ToxTree) R-based
predictive models Similarity calculations
(CDK) Descriptor calculation (CDK) 2D structure
diagrams (CDK)

Database Services
PostgreSQL gNova
PubChem mirror (augmented)
Pub3D - 3D structures for PubChem
PubDock - Bound 3D structures
Compound-indexed journal article DB
NIH Human Tumor Cell Line
Local PubChem mirror
VARUNA quantum chemistry database
Statistics (based on R)
Regression, LDA
Neural Nets, Random Forest
K-means clustering
Plotting
T-test and distribution sampling

Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy
Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey
C. Fox and David J. Wild, Web service
infrastructure for chemoinformatics, Journal of
Chemical Information and Modeling, 2007 47(4) pp
1303-1307
56
Tools and mashups based on web service
infrastructure
http//www.chembiogrid.org/projects/proj_tools.htm
l
57
PubDock - database of docked PubChem Ligands

1 million PubChem compounds (drugable) docked
into PDB proteins (currently 7 but more coming -
hope to have 100 or so)
Multiple interfaces. This is really a
bioinformatics / chemoinformatics mashup
Retrieve top hits for a protein
Organize proteins by similarity between docking
profiles over compounds
Cluster compounds by docking profile across
targets
Uses many web services PDB services, our PubDock
database service, our CDK services etc

58
MashUp What published compounds might bind to
this protein?
Create a database containing thetext of all
recent PubMed abstracts(2006-2007 500,000)
Use OSCAR to extract all of the chemical names
referred to in the abstracts and covert to SMILES
DATABASE SERVICE

DOCKING SERVICE
Convert molecules to 3D and dock into a protein
of interest
Visualize top docked molecules in a Google-like
interface
59
RSC Project Prospect - what can we do with the
information?

www.projectprospect.org
100 papers marked up with SMILES/InChI (using
OSCAR3), plus Gene Ontology and Goldbook Ontology
terms
Created similarity searchable PostgreSQL / gNova
database with paper DOIs, SMILES, and ontology
terms
Web service and simple HTML interfaces for
searching which papers reference compounds
similar to this one in the scope of these
ontological terms?
Applying statistics to look at co-occurrence of
compounds, structural features (MACCS keys) and
ontological terms in papers

60
Greasemonkey / OSCAR script
http//cheminfo.informatics.indiana.edu8080/ChemG
M/index.jsp
61
By the way annotation (mock-up!)
By the way This compounds is very similar to a
prescription drug, Tamoxifen. This compound is
referenced in 20 journal articles published in
the last 5 years Similar compounds are associated
with the words toxic and death in 280 web
pages It appears to be covered under 3 patents It
has been shown to be active in 5 screens Computer
models predict it to show some activity against 8
protein targets Here are some comments on this
compound David Wild dont take any notice of
the computational models - they are rubbish
62
Cheminformatics aware simple lab notebook (mock
up!)
Plug-in allows structures to be drawn with the
pen and cleaned up
Some useful chemical reactions Iodoacetate a
Iodoacetamide I-CH4COO- ICH2CONH2 This
may also react, chem favored by alkaline pH .
Web service interfaceprovides access
to computation and searching. Page is marked up
by what is possible
FIND INFO ABOUT THIS REACTION
Free text input can be converted to
machine readable form by electrovaya
Automatic detection ofdata fields (yield,
etc) Where possible
63
Automatic workflow generation and natural
language queries

Develop service ontology using OWL-S or similar
language
Allows service interoperability, replacement and
input/outut compatibility
We can then use generic reasoning and network
analysis tools to find paths from inputs to
desired outputs
Natural language can be parsed to inputs and
desired outputs
Smart Clients Agents Services
Possible supercharged life science Google? -
e.g. type in what compounds might bind to the
enclosed protein?

3D search
2dsimilarity
3D structures are compounds
2D - 3D
2D structures
3D structures
2Dstructurecrawler
2D structures
3D structures
result
3D structures complexes
dock
Pphoresearch
2D structures are compounds
3D proteinstructure
3D structures are compounds
dock bind

Write a Comment

User Comments (0)