Title: Bioinformatics Tools
1Bioinformatics Tools
2Bioinformatics
- The use of computer science, mathematics, and
information theory to model and analyze
biological systems, especially systems involving
genetic material. - Things I can do with a computer to improve and
accelerate my work.
3Applications of Bioinformatics
- Manage and analyze data in very large databases
- genetic info (DNA)
- protein sequences, structures
- Collections of scientific papers, experimental
results - Compare sequences and structures
- Do similar sequences or folding indicate proteins
have similar functions? - Modeling and prediction
- Predict 3D structure from known structures
(homology) or based on some computational
approach without modeling (ab initio) - Prediction of function from structure
- Molecular mechanics/ molecular dynamics
- Prediction of molecular interactions, docking
- Perform energy minimization calculations
- Predict useful mutations for protein engineering
4Sources of Data
- Sequence databases (EBI)
- FASTA (sequence similarity)
- http//www.ebi.ac.uk/Tools/fasta33/
- SwissProt (database of protein sequences)
- http//expasy.org/sprot/
- 3D structure database the RCSB PDB
- http//www.rcsb.org/pdb/home/home.do
5Sequence Analyses
- Sequence Alignment
- Single or Multiple Sequences
- Motif or Pattern Search
- Prediction of Secondary Structure
- 1E9NAPDBIDCHAINSEQUENCEMPKRGKKGAVAEDGDELRTEPEA
KKSKTAAKKNDKEAAGEGPALYEDPPDQKTSPSGKPATLKICSWNVDGLR
AWIKKKGLDWVKEEAPDILCLQETKCSENKLPAELQELPGLSHQYWLAL
6Sequence Alignment
- Usually first step in analysis of any
new/unidentified sequence is to perform
comparisons with sequence databases to find
existing homologues. - This might give you some idea of
- How the protein might potentially fold
- What other proteins it is related to
- What its function might be
- FASTA (http//www.ebi.ac.uk/Tools/fasta33/)
- One of several web servers you can use for this.
- Provides similarity search against protein
database. - Lets you select substitution matrix (BLOSUM50,
BLOSUM62, etc.) for search. - a substitution matrix describes the rate at which
one character in a sequence changes to other
character states over time. - One would use a higher numbered BLOSUM matrix for
aligning two closely related sequences and a
lower number for more divergent sequences.
7Gaps In Sequence Alignments
- When aligning sequences, score is affected by how
much penalty is assigned to gaps in sequence. - For larger gaps
- Assumes greater evolutionary distance between
sequences - Probably should be assigned a higher penalty
ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC
ATCTTCAGTGTTTCCC
CTGTTTTGCCCGATTTAGTTCGCTC ATCTTCAGTGTTTCCCCTGTTTT
GCCC....................ATTTAGTTCGCTC
ATCTTCAGTGTTTCCCCTGTTTTGCCCGCCCCCCCC
CCCCCCCCCCCATTTAGTTCGCTC
?Smaller gap, smaller penalty
8Other Sequence Databases
- BLAST and PSI-BLAST also commonly used.
- BLAST can be found at
- http//www.ncbi.nlm.nih.gov/BLAST/
- PSI-BLAST can be found at
- http//blast.ncbi.nlm.nih.gov/Blast.cgi?PAGEProte
insPROGRAMblastpBLAST_PROGRAMSblastpPAGE_TYPE
BlastSearchSHOW_DEFAULTSon
9BLAST Entry Window
- Enter FASTA sequence or upload file
- Choose your search set
- Select Program
10Results for mutant 1exr
- Calmodulin At 1.68 Angstroms Resolution
- Length148
- Score 267 bits (683), Expect 2e-70,
Method Compositional matrix adjust. - Identities 142/153 (92), Positives 142/153
(92), Gaps 8/153 (5) - Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRS
LGQNPTEAELQDMINEVDADGN 60 - AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRS
LGQNPTEAELQDMINEVDADGN - Sbjct 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRS
LGQNPTEAELQDMINEVDADGN 60 - Query 61 GTIDFPEFLSLMARKMKEQDDQFDQSEEELIEAFKVFD
RFFFGLISAAELRHV---LGEK 117 - GTIDFPEFLSLMARKMKEQD
SEEELIEAFKVFDR GLISAAELRHV LGEK - Sbjct 61 GTIDFPEFLSLMARKMKEQD-----SEEELIEAFKVFD
RDGNGLISAAELRHVMTNLGEK 115 - Query 118 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK
150 - LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK
- Sbjct 116 LTDDEVDEMIREADIDGDGHINYEEFVRMMVSK
148
Deleted
Inserted
Mutated
11Swiss-Prot/UniProt Database
- Central hub for the collection of functional
information on proteins. - amino acid sequence
- protein name or description
- taxonomic data and citation information
- Access point to ProSite
- http//www.expasy.ch/tools/scanprosite/
- Prosite identifies sequences and displays any
associated motifs, or accepts a motif and returns
related sequences.
12What are Motifs?
- A different approach for incorporating multiple
sequence information into a database search is to
use a Motif. - Motifs do not assign score at every position in
an alignment, but describes key residues that are
conserved and define the family. Sometimes this
is called a "signature". - Example of pseudo-EF-Hand motif (Calciomics
Pattern Search, developed in Dr. Yangs lab) - LMVITNF-FY-X(2)-YHIVF-SAITV-X(5,9)-LIMV-
X(3)-EDS-LFM-KRQL-X(20,28)-LQKF-DNG-X(1)
-DNSC-X(1)-DKN-X(4)-FY-X(1)-EKS - Specific residues can also be excluded by
enclosing in curly brackets DE
13Multiple Sequence Alignment (MSA)
- Alignments can provide information on
- domain structure
- location of residues likely to be involved in
protein function - Solvent exposure of residues
- Evolutionary relationships
- Build profiles for more sensitive searches
- What you can do with this information
- Create signatures for pattern searching
- Identify conserved vs. variable regions
- Identify structural and/or functional motifs
14MSA with ClustalW
http//www.ebi.ac.uk/Tools/clustalw2/index.html
15Ca-O-C angles for a) Non-EF-Hand and b) EF-Hand
16Distribution of SC angles
S100
S100
1WDC C
2BL0 C
R2
S100
S100
Calbindin d9k
Calbindin d9k
Penta-EF
Penta-EF
Parvalbumin
Parvalbumin
Osteonectin
2HQ8
Parvalbumin
Parvalbumin
2H2K
S100
Parvalbumin
R1
Parvalbumin
Parvalbumin
Polcalcin
Polcalcin
Unrooted N-J Phylogenic Tree generated by Treeview
17Distribution of MC angles
S100
S100
1WDC C
2BL0 C
R1
S100
S100
Calbindin d9k
Calbindin d9k
Penta-EF
Penta-EF
Parvalbumin
Parvalbumin
Osteonectin
2HQ8
Parvalbumin
Parvalbumin
2H2K
S100
Parvalbumin
Parvalbumin
Parvalbumin
Polcalcin
R2
Polcalcin
Unrooted N-J Phylogenic Tree generated by Treeview
18Secondary Structure PDBSum
- http//www.ebi.ac.uk/pdbsum/
- Predicted 2 structure from sequence
- Either enter PDB file or can load new/existing
sequence
19Secondary Structure PDBSum
2oky
20Protein Data Bank
- http//www.rcsb.org/pdb/home/home.do
- Comprehensive database of protein structures
- Provides
- 3D structural data
- Fasta sequence
- Citation Info (who solved it, related
publications, etc.) - experimental methods (X-Ray Diffraction, NMR)
- resolution
- classification (e.g. metal transporter)
- ligands, cofactors
- Related PDB entries
21PDB ATOM/HETATM Record Format
Data Record Partitioning
Occupancy Indicates frequency an atom is
detected in specific location. Where occupancy lt
1.00, x-ray diffraction indicates more than 1
position, i.e. there is flexibility or
disorder. B-Factor Thermal motion of atom. High
B-factor implies uncertainty. Text View of PDB
File
1-6 Record name "ATOM " or "HETATM
7-11 Atom serial number 13-14 Chemical
symbol (right justified) 18-20 Residue name
22 Chain identifier 23-26 Residue
sequence number 31-38 X- coordinate 39-46
Y- coordinate 47-54 Z- coordinate 55-60
Occupancy 61-66 Isotropic B-factor 77-78
Element symbol
ATOM 1 N ALA A 43 69.834 21.345
42.623 1.00 76.76 N ATOM 2 CA
ALA A 43 69.016 22.376 41.988 1.00 72.63
C ATOM 3 C ALA A 43
67.991 21.777 41.038 1.00 63.96 C
ATOM 4 O ALA A 43 66.942 22.368
40.784 1.00 56.68 O ATOM 5 CB
ALA A 43 69.924 23.339 41.198 1.00 72.97
C
22Pymol Viewer
Can save session, including labels, angles,
distances, etc. These features can be turned on
or off without loss of data.
23Proteomics Tools External tools to extract PDB
Data
- http//bip.weizmann.ac.il/oca-bin/lpccsu/
- LPC Analysis of interatomic Contacts in
Ligand-Protein complexes - CSU Analysis of interatomic contacts in protein
entries - OCA allows the user to rapidly search through the
contents of the entire PDB Archive for entries
obeying certain constraints - Ex. I want to find all proteins that have Zn2
bound to structure, deposited in PDB between
certain dates
24Revising the PDB File
- Adding Hydrogen Atoms (Required for using Delphi)
- Reduce (http//kinemage.biochem.duke.edu/software/
reduce.php) - Runs on Mac, Linux, Windows
- Free to download
- Sybyl (http//www.tripos.com)
- Runs on Linux
- Not free
- Calculating Electrostatic Potential
- Delphi (http//wiki.c2b2.columbia.edu/honiglab_pub
lic/index.php/SoftwareDelPhi) - Runs on Mac, Linux, Windows (C and Fortran
Compilers reqd) - Free to download
25Protein Structure Adding Hydrogen SYBYL
- In addition to adding Hydrogen atoms to a PDB
file, Sybyl can be used to compare structures,
calculate RMSD values between structures, perform
minimization calculations.
26Protein Structure Analysis
- PONDR (Predictor of Naturally Disordered Regions)
- (http//www.pondr.com/)
- Internet-based
- Not free
- VADAR (Volume, Area, Dihedral Angle Reporter)
- (http//redpoll.pharmacy.ualberta.ca/vadar/)
- Internet-based
- Free to use
Leigh Willard, Anuj Ranjan,Haiyan Zhang,Hassan
Monzavi, Robert F. Boyko, Brian D. Sykes, and
David S. Wishart "VADAR a web server for
quantitative evaluation of protein structure
quality" Nucleic Acids Res. 2003 July 1 31 (13)
3316.3319
27Protein Structure Analysis PONDR
Use a series of neural network predictors (NNPs)
that use sequence data to predict disorder (i.e.
lack of fixed 3 structure) in a given region.
28Protein Structure Analysis VADAR
- A compilation of 15 algorithms for analyzing and
assessing peptide and protein structures from PDB
data. - Ramachandran plot
- Shows possible conformations of phi and psi
angles for residues in a protein based on energy
considerations. - Very useful for determining whether model
structures are likely conformations - Disallowed regions involve steric clash (VDW
distances)
ß-sheet
LH a-helix
RH a-helix
http//www.bmb.uga.edu/wampler/tutorial/prot2.html
29Visualizing Electrostatic Potential DelPhi and
Grasp
- DelPhi
- (http//wiki.c2b2.columbia.edu/honiglab_public/ind
ex.php/SoftwareDelPhi) - SGI Unix
- Free to download
- GRASP (Graphical Representation and Analysis of
Structural Properties) - (http//wiki.c2b2.columbia.edu/honiglab_public/ind
ex.php/SoftwareGRASP) - SGI Unix
- Free to download
30Visualizing Electrostatic Potential DelPhi and
Grasp
DelPhi takes as input a coordinate file format of
a molecule or equivalent data for geometrical
objects (PDB File) calculates electrostatic
potential in and around the system, using a
finite difference solution to the
Poisson-Boltzmann equation. Produces modified
PDB file and emap file as input to a 3rd party
visualization software (e.g. GRASP). GRASP then
displays and manipulates the surfaces of
molecules and their electrostatic properties.
31Proteomics Tools GetArea 1.1
Total Area
Area by Residue
- To quickly calculate solvent accessible surface
area or solvation energy of a protein molecule. - Ex. Is a proposed metal-binding site solvent
accessible?
http//pauli.utmb.edu/cgi-bin/get_a_form.tcl
32Prediction and Design
- Prediction of protein functional site
- Prediction of protein structure
- Design of protein functional site
- Design of protein structure
- Why prediction and design?
33Protein Structure Prediction
- Modeller
- http//www.salilab.org/modeller/
- Homology modeling
- Tasser
- http//zhang.bioinformatics.ku.edu/I-TASSER/
- Treading
- Rosetta
- http//robetta.bakerlab.org/
- Ab initio
- CASP (many others)
- http//predictioncenter.org/
- A center providing objective testing of
prediction programs
34Protein Structure Homology Modeling SWISS-MODEL
- http//swissmodel.expasy.org/
- Submit a FASTA sequence (known or unknown)
- Swiss-Model conducts BLAST search to align
sequence with known structures - Build 3D output model that can be viewed using
DeepView (expasy) - Graphic file can be saved
- Many other features including alignment modeling
with MSAs.
1EXR.pdb viewed using DeepView
35Protein Structure Homology Modeling
PredictProtein
Similar to Swiss-Model, Modeller Requires
registration/login
36Protein Structure Homology Modeling Modeller
37Modeller
Šali and Blundell, JMB, 1993, Comparative protein
modeling by satisfaction of spatial restraints
38TASSER
1. Find templates (seq. with known structure)
that share seq similarity (global or local) with
query seq. 2. Based on 1, query seq. is divided
into aligned segments (have template) and
unaligned segments. 3. Using Monte Carlo method
to connect the aligned segments 4. Outputs
(multiple possible structures) are clustered and
find structure obtained,
Zhang and Skolnick, PNAS, 2004. Automated
structure prediction of weakly homologous
proteins on a genomic scale
39Rosetta
1. Construct a fragment library for each three
and nine residue The fragments are extracted
from observed structures in PDB. 2. Model the
structure of the fragments from the library 3.
Connect the fragments. 4. Rank the predicted
structures according to a scoring function.
40Programs for Predicting Metal Binding Site
- FEATURE
- http//feature.stanford.edu/webfeature/
- Machine learning (Bayesian method)
- MUG
- http//chemistry.gsu.edu/faculty/Yang/Calciomics.h
tm - Geometric search to predict calcium binding site
- CHED
- http//ligin.weizmann.ac.il/ched
- Combine machine learning and geometric search to
predict zinc and other transition metal binding
sites.
41FEATURE
- 1. Designed and tested their algorithm on
protein holo structures. - 2. The protein structure is embedded into a 3D
grid. - 3. Each grid point is evaluated by probability
scoring function (Wei and Altman) - 4. The points of high score are the predicted
Ca2 location
Wei and Altman, Protein Science, 1998
42Observation
A
B
C
D
filters
lt6.0Å
MUG
Wang, Kirberger, Qiu, Chen and Yang, Proteins,
2009
43 CHED
- 1. Use protein apo structures
- 2. Geometric search for a qualified triad of C,
H, E, D - 3. Side-chain rotation of a unqualified triad
- 4. Apply filters to resulting qualified triad to
classify the triad as binding triad or
non-binding triad
d3
d2
d1
qualified
unqualified
qualified
output
binding/nonbinding triad
Babor et al., Proteins, 2008
44Design Program
- DEZYMER (Hellinga)
- Given a ligand and a protein with known
structure, suggest residues to be mutated so that
the resulting protein binds the ligand. - ORBIT (Mayo)
- Given a backbone structure, design a sequence
such that it folds to that backbone. - Rosetta (Baker)
- One program to treat diverse problems
- Prediction and design
45DEZYMER
1. Define the expected binding geometry 2. Find
backbone places where if appropriate side chains
are added, the predefined geometry is
satisfied 3. Place the side chains and ligand,
and optimize there position 4. Repack residues
in positions other than binding residues. If
necessary, change residue type
Hellinga and Richards, JMB, 1991. Construction of
new ligand binding sites in protein of known
structure
46ORBIT
1. Divide the target structure into three parts
core, surface and boundary 2. Core Ala, Val,
Leu, Ile, Phe, Tyr, Trp Surface Ala, Ser,
Thr, His, Asp, Asn, Glu, Gln, Lys, and Arg
Boundary union of the above two 3. 1.91027
possible sequence 4. Select best sequence
efficiently, using dead end elimination (DDE)
Solution structure of the designed protein.
Stereoview showing the best-fit superposition of
the 41
Comparison between the designed backbone
(averaged NMR structure, blue) and the target
backbone (red)
Dahiyat and Mayo, Science, 1997. De Novo Protein
Design Fully Automated Sequence Selection
47Supplemental Slides
48Calciomics
- Calciomics is a specialized area of biochemistry
focusing on the study of calcium-binding
biological macromolecules and proteins to
understand the factors that contribute to
calcium-binding affinity and the selectivity of
proteins and calcium-dependent conformational
change. - http//lithium.gsu.edu/faculty/Yang/Calciomics.htm