Title: Introduction to Sequence Analysis Software
1Introduction to Sequence Analysis Software
Data Libraries
- BIOINFORMATICS I
- Protein and DNA Sequence Analysis
- Jaime E. Ramirez-Vick, PhD
2Why Sequence Analysis
- Sequence analysis is the process of applying
computational methods to a biological sequence
represented as a character string. - The goal is to use these computational methods to
infer information about the structure, function,
or evolutionary history of the sequence. - The stronger the evidence, the more confident we
can be in the inference. - To get the strongest evidence the proper
techniques must be employed.
3The Goal
4The Process
Homology Modeling
CURATED DATASET
5The Project
- Part I Submit three candidate families for your
course project. - Part II Collect an initial set of sequences,
generate a multiple sequence alignment - Part III Improve the quality of your alignment,
and identify additional family members - Part IV Add structural and/or evolutionary
information, and give a final report
6Part I
Homology Modeling
CURATED DATASET
7Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
8Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
9Part IV
Evolutionary Analysis
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
10Structural Libraries
- Structure libraries contain the actual three
dimensional coordinates of a macromolecule. - Used to
- Determine if the three dimensional structure for
a molecule has been solved - Visualize the three dimensional structure
- Assist in homology modeling
11Structural Libraries
- Protein Data Bank (PDB)
- Large Molecules (1000 atoms)
- For more information see
- http//www.psc.edu/general/software/packages/pdb/p
db.html - http//www.rcsb.org/pdb/
- Cambridge Structural Database
- Small Molecules (100 atoms)
- For more information see
- http//www.ccdc.cam.ac.uk/
12Classification Libraries
- Built from sets of related sequences and contain
information about the residues that are essential
to the structure/function of the group of related
sequences - Used to
- Generate a testable hypothesis that the query
sequence belongs to the group. - Quickly identify a good group of sequences known
to share a biological relationship.
13Classification Libraries
- Some Popular Classification Libraries
- PROSITE http//www.expasy.ch/prosite.html
- PFAM http//pfam.wustl.edu/
- IPROCLASS http//pir.georgetown.edu/iproclass/
- BLOCKS http//www.blocks.fhcrc.org/
- PRINTS http//www.biochem.ucl.ac.uk/bsm/dbbrowser
/PRINTS/PRINTS.html - Transcription Factor Database http//transfac.gbf
.de/TRANSFAC/ - Restriction Enzyme Database http//rebase.neb.com
/ - Search software is usually specific to the
database
14Classification Libraries - Representation
- Consensus
- Residue most common at each position in alignment
- Composite
- Set representation (e.g. a g,c acg t a)
- Composition Matrix
- Table of how many residues present at each
position - Position Specific Scoring Matrix (PSSM or
Profile) - Log-odds likelihood of each residue at each
position - Hidden Markov Model
- Probabilistic state representation
15Sequence Libraries
- Compilations of known sequences with experimental
information about those sequences. - Used to
- Generate a testable hypothesis that the query
sequence may be related to known sequences in the
library. - Retrieve annotation information about sequences
16Sequence Libraries
- Nucleic Acids
- GenBank http//www.ncbi.nlm.nih.gov/
- EMBL http//www.ebi.ac.uk/
- Protein
- UniProt/UniRef http//www.uniprot.org/
- Other protein collections
- PIR http//nbrfa.georgetown.edu/
- Swiss-Prot http//www.ebi.ac.uk/
- GenPept http//www.ncbi.nlm.nih.gov/
- PIR-NREF http//nbrfa.georgetown.edu/
- TREMBL http//www.ebi.ac.uk/
- Older Libraries PATCHX, OWL
17Sequence Libraries - Searching
- Searching Methods
- Dynamic Programming
- Global Needleman-Wunch, Sellers
- Local Smith-Waterman, Waterman-Egert "Maxsegs"
- Approximations
- Fasta
- Blast
- User must understand what the searching method
thinks similar means
18Sequence Libraries Similar?
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
19Sequence Libraries - Results
Box 3.6 from Introduction to Bioinformatics by
Attwood and Parry-Smith
20Multiple Sequence Alignment
- An MSA is an alignment of a group of related
sequences across their entire lengths in a manner
than highlights the conservation of the important
residues in the sequences - Critical building block for many next steps such
as finding distantly related sequences,
determining the evolutionary history, and
homology modeling - Not all sets of related sequences can be aligned
across their entire lengths cleanly
21Multiple Sequence Alignment
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
22Multiple Sequence Alignment
- When aligning groups of related sequences it is
important to note that residues in those
sequences are either - Conserved (not mutated)
- Unconstrained (when mutated can be almost any
amino acid) - Constrained (when mutated must be one of a few
amino acids) - Motifs are distinct units that consists of the
conserved and constrained regions Motifs
generally tell us the residues that are essential
to the structure/function of the sequence - Typically, multiple sequence alignments contain
motifs as well as unconstrained regions. - Aligning motifs in a multiple sequence alignment
will improve the quality of the alignment
23Multiple Sequence Alignment
helix
helix
sheet
Lrr 2e kdafrdlhsLsl-LsLydNnI-----qsL
LRR A1 wteLlpllqqyEvvrLddCgLTeehCkdi LRR A2
lqgLqsPtCkiqkLsLqnCsLTeaGCgvL LRR A3
cegLldPqChLEkLqLeyCrLTaasCepL LRR A4
gqgLadsaCqLEtLrLenCgLTpanCkdL LRR A5
cpgLlsPasrLktLwLweCdiTasGCrdL LRR A6
cesLlqPGCQLEsLwvksCsLTaacCqhv LRR A7
cqaLsqPgttLrvLcLgdCeVTnsGCssL LRR A8
lgsLeQPgCaLEqLvLydtywTeevedrL LRR B1
gsaLranpsLtE-LcLrtNeLGDaGvhlv LRR B2
pstLrslptLrE-LhLsdNpLGDaGlrlL LRR B3
asvLratraLkE-LtvsnNdiGeaGarvL LRR B4
cgivasqasLrE-LDLgsNgLGDaGiaeL LRR B5
crvLqaketkKE-LsLagNkLGDeGarlL LRR B6
slmLtqnkhLlE-LqLssNkLGDsGiqeL LRR B7
aslLlanrsLRE-LdLsnNcvGDpGvlqL
24Multiple Sequence Alignment
Figure 8.2 from Introduction to Bioinformatics by
Attwood and Parry-Smith
25Multiple Sequence Alignment - Programs
- MSA Using Progressive Pairwise Technique
- Clustal
- MSA Using Multidimensional Dynamic Programming
- MSA
- MSA Using Consistency Measures
- T-Coffee
- Probcons
- MSA Editor
- GeneDoc
26Position Specific Scoring Matrix
- A Position Specific Scoring Matrix (PSSM or
Profile) is a way to abstract the information
contained in a multiple sequence alignment. - Think of a PSSM as a custom PAM or BLOSUM scoring
matrix that has been specially tuned to locate
sequences exactly like those in the alignment. - Probabilities represented by Log Odds Technique
27Position Specific Scoring Matrix
- Used to
- Help locate distantly related sequences
- Help resolve sequences that are not considered
statistically significant by a database search,
but share enough important residues to infer that
the sequences may have the same function and be
distant members of the sequence family - Good MSA Good PSSM
- Poor MSA Poor PSSM
- A lot of Sequences Good PSSM
- Few Sequences Good PSSM
28PSSM Programs
- MakePSSM
- Used to create a PSSM from a multiple sequence
alignment - PSSM can be created using different methods
including Gribskov and Henikoff and with a
variety of PAM and BLOSUM matrices. - ProfileSS
- Used to search a sequence database with a profile.
29Hidden Markov Model
- A Hidden Markov Model (HMM) is a way to abstract
the information contained in a multiple sequence
alignment. - Think of a HMM as a way to represent a multiple
sequence alignment by deriving probabilities
directly from the multiple sequence alignment - Probabilistic model Includes probabilities for
insertions and deletions
30Hidden Markov Model
- Used to
- Help locate distantly related sequences
- Help resolve sequences that are not considered
statistically significant by a database search,
but share enough important residues to infer that
the sequences may have the same function and be
distant members of the sequence family - Good MSA Good HMM
- Poor MSA Poor HMM
- A lot of Sequences Good HMM
- Few Sequences Poor HMM
31Hidden Markov Model
- HMMER Package
- hmmalign - Align multiple sequences to a profile
HMM. - hmmbuild - Build a profile HMM from a given
multiple sequence alignment. - hmmcalibrate - Determine appropriate statistical
significance parameters for a profile HMM prior
to doing database searches. - hmmconvert - Convert HMMER profile HMMs to other
formats - hmmemit - Generate sequences probabilistically
from a profile HMM. - hmmfetch - Retrieve an HMM from an HMM database
- hmmindex - Create a binary SSI index for an HMM
database - hmmpfam - Search a profile HMM database with a
sequence hmmsearch - Search a sequence database
with a profile HMM
32Local Patterns
- Local patterns are short motifs that exist in
all (or a subset) of the related sequences. - Finding local patterns may help us to align
biologically important sections in a multiple
sequence alignment. - These local patterns can also be used to probe
sequence data libraries for distant relatives
33Local Patterns
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
34Local Patterns - Programs
- MEME a tool for discovering motifs in groups of
sequences - oops - One Occurrence Per Sequence
- zoops - Zero or One Occurrence Per Sequence
- tcm Multiple occurrences per sequence
- MAST will search a sequence database for
sequences that contain MEME patterns
35Homology Modeling
- Predicts the three-dimensional structure of a
given protein sequence (TARGET) based on an
alignment to one or more known protein structures
(TEMPLATES) - If similarity between the TARGET sequence and the
TEMPLATE sequence is detected, structural
similarity can be assumed.
36Homology Modeling
Structural Superposition of Aldehyde
Dehydrogenase Family Members
37Homology Modeling - Programs
- Modeller used for homology (or comparative)
modeling of protein three-dimensional structures - MMTSB Multiscale Modeling Tools for Structural
Biology - VMD - molecular visualization program for
displaying, animating, and analyzing large
biomolecular systems using 3-D graphics and
built-in scripting
38Evolutionary Analysis
- Inferring phylogenies finding the tree that
implies the correct evolutionary history of the
sequences - Principal Methods
- Parsimony analysis
- Distance methods
- Maximum Likelihood
- Each approach has its own strengths/weaknesses
- To assess the correctness of the tree we need
to understand - Overall signal noise in the data
- How the tree comparisons to alternate trees
- How reliable the individual branches are
39Evolutionary Analysis
- Refining Trees
- Bootstrap analysis can give us estimates of
variability - Incorporating information about duplication and
loss - reconcile a gene tree with a species tree
- identify gene duplications
- root an unrooted tree by minimizing gene
duplications and losses - refine rooted trees to minimize duplications and
losses - Groups analysis
- Discover what is unique about each subgroup in a
tree - Help resolve which subgroup a sequence belongs to
in a tree
40Evolutionary Analysis - Programs
- Phylip Package
- Package contains many programs to help infer
phylogenies including programs for
bootstrapping, maximum parsimony, distance
methods, maximal likelihood methods, etc. - Notung
- Enables the incorporation of information about
duplication and loss into phylogenies - Subgroup Refinement
- GEnt Calculates a group cross entropy
41Course Project
Homology Modeling
CURATED DATASET
42Part I
Homology Modeling
CURATED DATASET
43Course Project Part I
- Select Sequence Family
- In the labs we will be working with the
Haloalkane Dehalogenase superfamily - A hydrolase that acts on halide bonds in C-halide
compounds. - Reaction 1-haloalkane H(2)O a primary
alcohol halide. - The PIR Superfamily is PIRSF037173
- The initial query sequence that we will be using
for database searching is from Xanthobacter
autrophicus. The UniProt ID is DHLA_XANAU.
44Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
45Course Project Part II
- Collect an initial set of sequences, generate a
multiple sequence alignment - Perform a database search with the query sequence
across several databases with different
algorithms - Perform a multiple sequence alignment with a
variety of different alignment algorithms - Select best multiple sequence alignment
46Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
47Course Project Part III
- Improve the quality of your alignment, and
identify additional family members - Search for local patterns in the group of
sequences - Refine multiple sequence alignment based on local
patterns - Convert alignment into HMM/PSSM and search the
database for distantly related sequences.
48Part IV
Evolutionary Analysis
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
49Course Project Part IV
- Add structural and evolutionary information
- Build a phylogenetic tree
- Refine the phylogenetic tree
- Refine groups
- Produce and visualize structure using homology
modeling techniques - Produce final multiple sequence alignment