Title: TEXTAL:%20Applications%20of%20Pattern%20Recognition%20to%20Macromolecular%20Crystallography
1TEXTALApplications of Pattern Recognition to
Macromolecular Crystallography
- Dr. Thomas R. Ioerger
- Department of Computer Science
- Texas AM University
Collaboration with Dr. James C. Sacchettini,
Center for Structural Biology, Texas AM
2Automating Structure Determination
- Typical Steps
- obtain crystals
- collect data (e.g. MAD, at synchrotron)
- determine initial set of phases
- generate electron density map
- density modification/phase refinement
- construct model (atomic coordinates)
3Automating Structure Determination
- Existing computational routines
- heavy atom search, Patterson correlation, solvent
flattening, maximum likelihood phase combination - few methods to interpret electron density maps
- requires humans potential bottleneck
- difficulty low res., phase errors, weak density
- must automate for structural genomics and
rational drug design
4Overview of TEXTAL
- Apply pattern recognition techniques
- Exploit database of previously-solved maps
- Model molecular structures in local regions (e.g.
spheres of 5 Angstrom radius) - Intuitive principles
- 1) Have I ever seen a region with a
- pattern of density like this before?
- 2) If so, what were previous
- local atomic coordinates?
5Overview (contd)
- Divide-and-Conquer
- 1) identify alpha-carbon positions
(chain-tracing) - 2) model regions around alpha-carbons (CAs),
including backbone and side-chain atoms - 3) concatenate local models back together,
resolve any conflicts - Database contains many regions centered on CAs
from previous maps - 5A radius right for structural repetition
6Overview (contd)
- Database 105 regions from 100 maps
- How to identify closest match (efficiently)???
- Calculate numerical features that represent the
pattern in each region - Must be rotation-invariant
- Search can be very fast just compare features
7Overview (contd)
8Database Construction
- Ideally would use solved MAD/MIR maps
- Using back-transformed maps works well
- PDB ? structure factors (include B-factors)
- keep reflections down to 2.8A
- Fourier transform ? electron density map
- 50 proteins from PDBSelect (non-homol.)
- about 50,000 regions
- Feature extraction done offline
9Rotation-Invariant Features
- Average density m(1/n)Sri, where ri is density
at each lattice point in region - Other Statistical Features standard deviation,
kurtosis - Distant to center of mass
- ltxc,yc,zcgt(1/n)lt Sxiri/m,Syiri/m,Sziri/mgt
- dcen?(xc2 yc2 zc2)
10More Features
- Moments of inertia
- measures dispersion around axes of symmetry in a
density distribution - calculate 3x3 inertia matrix
- diagonalize to get eigenvalues
- sort from largest to smallest
- take magnitudes and ratios of moments
11More Features
- Spoke angles
- if region centered on CA, should have 3 spokes
of density emanating from center - find best-fit vectors calc. angles among them
- surface area of contours
- connectivity of density/bones in region
- other geometrical features...
12Details of Matching Process
- Feature-based matching
- Euclidean distance metric between feature
vectors. - dist(R1,R2)?Swi(Fi(R1)-Fi(R2))2
- Must weight features by relevance
- less-relevant features add noise
- Slider algorithm optimize weights by comparing
features in matching regions versus mismatches - Verify selections by density correlation
- requires search for optimal rotation
13Experiments
- Goal evaluate potential of pattern-matching
- Assumption CA positions known
- Procedure
- 1. extract features for each region
- 2. collect top K400 feature-based matches in DB
- 3. calculate density correlation, take best match
- 4. rotate backbonesidechain atoms into position
- 30sec/residue on SGI Origin 2000
14Feature Weights
15Results
1gcn glucagon 1fnb ferredoxin reductase 1tup
p53 tumor suppressor IFABP intestinal fatty
acid binding protein BT back-transformed
16Results
Structural similarity groups Ala
Asp, Asn, Leu Gly
Glu, Gln Pro Arg, Lys,
Met Cys, Ser Phe, Trp, Tyr,
His Ile, Val, Thr
17Results
18Example Portion of 1tup
19Example Glucagon
20Post-Processing Routines
- Concatenate local models per a.a. into PDB
- Detect and repair flips by majority chain
direction - Utilize amino acid sequence information
- map chains into known sequence (alignment)
- re-lookup residues based on identity
- Real-space refinement
21CAPRA
- Need to find CAs automatically and accurately
- Bones doesnt identify CAs (except branches)
- Use pattern recognition again
- Extract features for all lattice points inside 1s
contour, or along trace - Use neural net to predict distance to true CA
- Training set examples of ltF1,F2gt,Di
- Status currently 1A rms, need to get 0.5-0.8
22Example
23Acknowledgements
- Dr. James C. Sacchettini
- Center for Structural Biology, Texas AM
- Graduate students/post-docs
- Dr. Jon Christopher, Tom Holton, Lydia Tapia
- Funding provided by NIH (GM-59398)
- See our forthcoming paper in Acta Cryst. D