Title: Ioerger Lab
1Ioerger Lab Bioinformatics Research
- Pattern recognition/machine learning
- issues of representation
- effect of feature extraction, weighting, and
interaction on performance of induction algorithm - Applications in Structural Biology
- molecular basis of biology protein structures
- predicting structures
- tools for solving structures (X-ray
crystallography, NMR) - stability, folding, packing, motions
- drug design (small-molecule inhibitors)
- large datasets exist exploit them find the
patterns
2TEXTAL - Automated Crystallographic Protein
Structure Determination Using Pattern Recognition
- Principal Investigators Thomas Ioerger (Dept.
Computer Science) - James Sacchettini (Dept. Biochem/Biophys)
- Other contributors Tod D. Romo, Kreshna Gopal,
Erik McKee, - Lalji Kanbi, Reetal Pai Jacob Smith
- Funding National Institutes of Health
- Texas AM University
3X-ray crystallography
- Most widely used method for protein modeling
- Steps
- Grow crystal
- Collect diffraction data
- Generate electron density map (Fourier transform)
- Interpret map i.e. infer atomic coordinates
- Refine structure
- Model-building
- Currently crystallographers
- Challenges noise, resolution
- Goal automation
4X-ray crystallography
- Most widely used method for protein modeling
- Steps
- Grow crystal
- Collect diffraction data
- Generate electron density map (Fourier transform)
- Interpret map i.e. infer atomic coordinates
- Refine structure
- Model-building
- Currently crystallographers
- Challenges noise, resolution
- Goal automation
5Overview of TEXTAL
- Automated model-building program
- Can we automate the kind of visual processing of
patterns that crystallographers use? - Intelligent methods to interpret density, despite
noise - Exploit knowledge about typical protein structure
- Focus on medium-resolution maps
- optimized for 2.8A (actually, 2.6-3.2A is fine)
- typical for MAD data (useful for high-throughput)
- other programs exist for higher-res data
(ARP/wARP)
Electron density map (or structure factors)
Protein model (may need refinement)
TEXTAL
6Crystal
Collect data
Electron density map
Diffraction data
LOOKUP model side chains
CAPRA models backbone
SCALE MAP TRACE MAP CALCULATE FEATURES PREDICT
Cas BUILD CHAINS PATCH STITCH CHAINS REFINE
CHAINS
Model of backbone
Model of backbone side chains
POST-PROCESSING
SEQUENCE ALIGNMENT REAL SPACE REFINEMENT
Corrected refined model
7Flt1.72,-0.39,1.04,1.55...gt
Flt1.58,0.18,1.09,-0.25...gt
Flt0.90,0.65,-1.40,0.87...gt
Flt1.79,-0.43,0.88,1.52...gt
8Examples of Numeric Density Features
- Distance from center-of-sphere to center-of-mass
- Moments of inertia - relative dispersion along
orthogonal axes - Geometric features like Spoke angles
- Local variance and other statistics
Features are designed to be rotation-invariant,
i.e. same values for region in any
orientation/frame-of-reference. TEXTAL uses 19
distinct numeric features to represent the
pattern of density in a region, each calculated
over 4 different radii, for a total of 76
features.
9The LOOKUP Process
Find optimal rotation
Database of known maps
Two-step filter 1) by features 2) by
density correlation
2-norm weighted Euclidean distance metric for
retrieving matches
Region in map to be interpreted
10SLIDER Feature-weighting algorithm
- Euclidean distance metric used for retrieval
-
- relevant features good, irrelevant features
bad -
- Goal find optimal weight vector w the generates
highest probability of hits (matches) in top K
candidates from database - Concept of Slider
- adjust features so the most matches are ranked
higher than mismatches
Slider Algorithm(w,F,Ri,matches,mismatches)
choose feature f?F at random for each
ltRi,Rj,Rkgt, Rj?matches(Ri),Rk?mismatches(Ri)
compute cross-over point li where
dist(Ri,Rj)dist(Ri,Rk) dist(X,Y)
l(Xf-Yf)2(1-l)dist\f(X,Y) pick l that is best
compromise among li ranks most matches above
mismatches update weight vector
w?update(w,f,l), wfl repeat until convergence
11Quality of TEXTAL models
- Typically builds gt80 of the protein atoms
- Accuracy of coordinates 1Ã… error (RMSD)
- Depends on resolution and quality of map
12Closeup of b-strand (TEXTAL model in green)
13Deployment
- September 2004 Linux and OSX distributions
- Can be downloaded from http//textal.tamu.edu
- 40 trial licenses granted so far
- June 2002 WebTex (http//textal.tamu.edu)
- Till May 2005 TB Structural Genomics Consortium
members only - Recently open to the public
- users upload data processed on server can
download results - 120 users from 70 institutions in 20 countries
- July 2003 Model building component of PHENIX
- Python-based Hierarchical ENvironment for
Integrated Xtallography - Consortium members
- Lawrence Berkeley National Lab
- University of Cambridge
- Los Alamos National Lab
- Texas AM University
14Intelligent Methods for Drug Design
- structure-based
- given protein structure, predict ligands that
might bind active site - other methods
- QSAR, high-throughput/combi-chem, manual design
using 3D - Virtual Screening
- docking algorithm large library of chemical
structures - sort compounds by interaction energy
- purchase top-ranked hits and assay in lab
- looking for mM inhibitors (leads that can be
refined) - goal enrichment to 5 hit rate
15Virtual Screening
- diversity
- ZINC database 2.6 million compounds
- purchasable satisfy Lipinskis rules
- docking algorithms
- FlexX, DOCK, GOLD, AutoDock, ICM...
- search for position and conformation of ligand
- scoring function
- electrostatic steric desolvation
- entropy effects?
- major open issues
- active site flexibility, charge state, waters,
co-factors - works best with co-crystal structures (already
bound)
16Grid at Texas AM
gridmaster.tamu.edu
DOCK binaries receptor files 20 ligands at a
time
West Campus Library
typical configuration 2.8 GHz dual-core Pentium
CPUs running Windows XP
Blocker
Zachary
1600 computers in student labs on TAMU campus
(Open-Access Labs)
GridMP software by United Devices (Austin, TX)
17Data Mining of Results
- promiscuous binders
- clusters of related compounds
- patterns of contacts within active site
- hydrogen-bonding interactions
- adjust weights of scoring function for unique
properties of each site - open/closed, hydrophobic/charged...
- ideas for active site variations
- development of pharmacophore search patterns
18Current Screens in Sacchettini Lab
- proteins related to tuberculosis (Mycobacterium)
- focus on unique pathways involved in
dormancy/starvation - glyoxylate shunt slow-growth metabolic pathway
- cell-wall biosynthesis (unique mycolic acid layer
in tb.) - biosynthesis of amino acids/co-factors that
humans get from diet - isocitrate lyase
- malate synthase
- PcaA mycolic acid cyclopropane synthase
- ACPS acyl-carrier protein synthase
- InhA enoyl-acyl reductase (target of isoniazid)
- KasB fatty-acid synthase
- BioA biotin (co-factor) synthase
- PGDH phospho-glycerol dehydrogenase (serine
biosynthesis) - Related proteins in malaria, SARS, shigella
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Conclusions
- Many opportunities for research in Structural
Bioinformatics - large datasets
- significant problems
- Provides challenges for machine learning
- drives development of novel methods, especially
for dealing with noise, sampling biases,
extraction of features... - Requires inherently interdisciplinary approach
- training in biochemistry knowledge of molecular
interactions - understanding chemical intuition use of
visualization tools - insights about strengths and limitations of
existing methods - Requires collaboration to construct appropriate
representations to enable learning algorithms to
find patterns - translate expectations about what is relevant,
dependencies, smoothing, sources of noise...