Ioerger Lab - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Ioerger Lab

Description:

effect of feature extraction, weighting, and interaction on performance of induction algorithm ... training in biochemistry; knowledge of molecular interactions ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 24

Provided by: thomasr1

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ioerger Lab

1
Ioerger Lab Bioinformatics Research

Pattern recognition/machine learning
issues of representation
effect of feature extraction, weighting, and
interaction on performance of induction algorithm
Applications in Structural Biology
molecular basis of biology protein structures
predicting structures
tools for solving structures (X-ray
crystallography, NMR)
stability, folding, packing, motions
drug design (small-molecule inhibitors)
large datasets exist exploit them find the
patterns

2
TEXTAL - Automated Crystallographic Protein
Structure Determination Using Pattern Recognition

Principal Investigators Thomas Ioerger (Dept.
Computer Science)
James Sacchettini (Dept. Biochem/Biophys)
Other contributors Tod D. Romo, Kreshna Gopal,
Erik McKee,
Lalji Kanbi, Reetal Pai Jacob Smith
Funding National Institutes of Health
Texas AM University

3
X-ray crystallography

Most widely used method for protein modeling
Steps
Grow crystal
Collect diffraction data
Generate electron density map (Fourier transform)
Interpret map i.e. infer atomic coordinates
Refine structure
Model-building
Currently crystallographers
Challenges noise, resolution
Goal automation

4
X-ray crystallography

Most widely used method for protein modeling
Steps
Grow crystal
Collect diffraction data
Generate electron density map (Fourier transform)
Interpret map i.e. infer atomic coordinates
Refine structure
Model-building
Currently crystallographers
Challenges noise, resolution
Goal automation

5
Overview of TEXTAL

Automated model-building program
Can we automate the kind of visual processing of
patterns that crystallographers use?
Intelligent methods to interpret density, despite
noise
Exploit knowledge about typical protein structure
Focus on medium-resolution maps
optimized for 2.8A (actually, 2.6-3.2A is fine)
typical for MAD data (useful for high-throughput)
other programs exist for higher-res data
(ARP/wARP)

Electron density map (or structure factors)
Protein model (may need refinement)
TEXTAL
6
Crystal
Collect data
Electron density map
Diffraction data
LOOKUP model side chains
CAPRA models backbone
SCALE MAP TRACE MAP CALCULATE FEATURES PREDICT
Cas BUILD CHAINS PATCH STITCH CHAINS REFINE
CHAINS
Model of backbone
Model of backbone side chains
POST-PROCESSING
SEQUENCE ALIGNMENT REAL SPACE REFINEMENT
Corrected refined model
7
Flt1.72,-0.39,1.04,1.55...gt
Flt1.58,0.18,1.09,-0.25...gt
Flt0.90,0.65,-1.40,0.87...gt
Flt1.79,-0.43,0.88,1.52...gt
8
Examples of Numeric Density Features

Distance from center-of-sphere to center-of-mass
Moments of inertia - relative dispersion along
orthogonal axes
Geometric features like Spoke angles
Local variance and other statistics

Features are designed to be rotation-invariant,
i.e. same values for region in any
orientation/frame-of-reference. TEXTAL uses 19
distinct numeric features to represent the
pattern of density in a region, each calculated
over 4 different radii, for a total of 76
features.
9
The LOOKUP Process
Find optimal rotation
Database of known maps
Two-step filter 1) by features 2) by
density correlation
2-norm weighted Euclidean distance metric for
retrieving matches
Region in map to be interpreted
10
SLIDER Feature-weighting algorithm

Euclidean distance metric used for retrieval
relevant features good, irrelevant features
bad
Goal find optimal weight vector w the generates
highest probability of hits (matches) in top K
candidates from database
Concept of Slider
adjust features so the most matches are ranked
higher than mismatches

Slider Algorithm(w,F,Ri,matches,mismatches)
choose feature f?F at random for each
ltRi,Rj,Rkgt, Rj?matches(Ri),Rk?mismatches(Ri)
compute cross-over point li where
dist(Ri,Rj)dist(Ri,Rk) dist(X,Y)
l(Xf-Yf)2(1-l)dist\f(X,Y) pick l that is best
compromise among li ranks most matches above
mismatches update weight vector
w?update(w,f,l), wfl repeat until convergence
11
Quality of TEXTAL models

Typically builds gt80 of the protein atoms
Accuracy of coordinates 1Å error (RMSD)
Depends on resolution and quality of map

12
Closeup of b-strand (TEXTAL model in green)
13
Deployment

September 2004 Linux and OSX distributions
Can be downloaded from http//textal.tamu.edu
40 trial licenses granted so far
June 2002 WebTex (http//textal.tamu.edu)
Till May 2005 TB Structural Genomics Consortium
members only
Recently open to the public
users upload data processed on server can
download results
120 users from 70 institutions in 20 countries
July 2003 Model building component of PHENIX
Python-based Hierarchical ENvironment for
Integrated Xtallography
Consortium members
Lawrence Berkeley National Lab
University of Cambridge
Los Alamos National Lab
Texas AM University

14
Intelligent Methods for Drug Design

structure-based
given protein structure, predict ligands that
might bind active site
other methods
QSAR, high-throughput/combi-chem, manual design
using 3D
Virtual Screening
docking algorithm large library of chemical
structures
sort compounds by interaction energy
purchase top-ranked hits and assay in lab
looking for mM inhibitors (leads that can be
refined)
goal enrichment to 5 hit rate

15
Virtual Screening

diversity
ZINC database 2.6 million compounds
purchasable satisfy Lipinskis rules
docking algorithms
FlexX, DOCK, GOLD, AutoDock, ICM...
search for position and conformation of ligand
scoring function
electrostatic steric desolvation
entropy effects?
major open issues
active site flexibility, charge state, waters,
co-factors
works best with co-crystal structures (already
bound)

16
Grid at Texas AM
gridmaster.tamu.edu
DOCK binaries receptor files 20 ligands at a
time
West Campus Library
typical configuration 2.8 GHz dual-core Pentium
CPUs running Windows XP
Blocker
Zachary
1600 computers in student labs on TAMU campus
(Open-Access Labs)
GridMP software by United Devices (Austin, TX)
17
Data Mining of Results

promiscuous binders
clusters of related compounds
patterns of contacts within active site
hydrogen-bonding interactions
adjust weights of scoring function for unique
properties of each site
open/closed, hydrophobic/charged...
ideas for active site variations
development of pharmacophore search patterns

18
Current Screens in Sacchettini Lab

proteins related to tuberculosis (Mycobacterium)
focus on unique pathways involved in
dormancy/starvation
glyoxylate shunt slow-growth metabolic pathway
cell-wall biosynthesis (unique mycolic acid layer
in tb.)
biosynthesis of amino acids/co-factors that
humans get from diet
isocitrate lyase
malate synthase
PcaA mycolic acid cyclopropane synthase
ACPS acyl-carrier protein synthase
InhA enoyl-acyl reductase (target of isoniazid)
KasB fatty-acid synthase
BioA biotin (co-factor) synthase
PGDH phospho-glycerol dehydrogenase (serine
biosynthesis)
Related proteins in malaria, SARS, shigella

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Conclusions

Many opportunities for research in Structural
Bioinformatics
large datasets
significant problems
Provides challenges for machine learning
drives development of novel methods, especially
for dealing with noise, sampling biases,
extraction of features...
Requires inherently interdisciplinary approach
training in biochemistry knowledge of molecular
interactions
understanding chemical intuition use of
visualization tools
insights about strengths and limitations of
existing methods
Requires collaboration to construct appropriate
representations to enable learning algorithms to
find patterns
translate expectations about what is relevant,
dependencies, smoothing, sources of noise...