Introduction to Sequence Analysis Software - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Introduction to Sequence Analysis Software

Description:

Jaime E. Ramirez-Vick, PhD ... Sequence analysis is the process of applying ... The stronger the evidence, the more confident we can be in the inference. ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 50

Provided by: jaimeram

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Sequence Analysis Software

1
Introduction to Sequence Analysis Software
Data Libraries

BIOINFORMATICS I
Protein and DNA Sequence Analysis
Jaime E. Ramirez-Vick, PhD

2
Why Sequence Analysis

Sequence analysis is the process of applying
computational methods to a biological sequence
represented as a character string.
The goal is to use these computational methods to
infer information about the structure, function,
or evolutionary history of the sequence.
The stronger the evidence, the more confident we
can be in the inference.
To get the strongest evidence the proper
techniques must be employed.

3
The Goal
4
The Process
Homology Modeling
CURATED DATASET
5
The Project

Part I Submit three candidate families for your
course project.
Part II Collect an initial set of sequences,
generate a multiple sequence alignment
Part III Improve the quality of your alignment,
and identify additional family members
Part IV Add structural and/or evolutionary
information, and give a final report

6
Part I
Homology Modeling
CURATED DATASET
7
Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
8
Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
9
Part IV
Evolutionary Analysis
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
10
Structural Libraries

Structure libraries contain the actual three
dimensional coordinates of a macromolecule.
Used to
Determine if the three dimensional structure for
a molecule has been solved
Visualize the three dimensional structure
Assist in homology modeling

11
Structural Libraries

Protein Data Bank (PDB)
Large Molecules (1000 atoms)
For more information see
http//www.psc.edu/general/software/packages/pdb/p
db.html
http//www.rcsb.org/pdb/
Cambridge Structural Database
Small Molecules (100 atoms)
For more information see
http//www.ccdc.cam.ac.uk/

12
Classification Libraries

Built from sets of related sequences and contain
information about the residues that are essential
to the structure/function of the group of related
sequences
Used to
Generate a testable hypothesis that the query
sequence belongs to the group.
Quickly identify a good group of sequences known
to share a biological relationship.

13
Classification Libraries

Some Popular Classification Libraries
PROSITE http//www.expasy.ch/prosite.html
PFAM http//pfam.wustl.edu/
IPROCLASS http//pir.georgetown.edu/iproclass/
BLOCKS http//www.blocks.fhcrc.org/
PRINTS http//www.biochem.ucl.ac.uk/bsm/dbbrowser
/PRINTS/PRINTS.html
Transcription Factor Database http//transfac.gbf
.de/TRANSFAC/
Restriction Enzyme Database http//rebase.neb.com
/
Search software is usually specific to the
database

14
Classification Libraries - Representation

Consensus
Residue most common at each position in alignment
Composite
Set representation (e.g. a g,c acg t a)
Composition Matrix
Table of how many residues present at each
position
Position Specific Scoring Matrix (PSSM or
Profile)
Log-odds likelihood of each residue at each
position
Hidden Markov Model
Probabilistic state representation

15
Sequence Libraries

Compilations of known sequences with experimental
information about those sequences.
Used to
Generate a testable hypothesis that the query
sequence may be related to known sequences in the
library.
Retrieve annotation information about sequences

16
Sequence Libraries

Nucleic Acids
GenBank http//www.ncbi.nlm.nih.gov/
EMBL http//www.ebi.ac.uk/
Protein
UniProt/UniRef http//www.uniprot.org/
Other protein collections
PIR http//nbrfa.georgetown.edu/
Swiss-Prot http//www.ebi.ac.uk/
GenPept http//www.ncbi.nlm.nih.gov/
PIR-NREF http//nbrfa.georgetown.edu/
TREMBL http//www.ebi.ac.uk/
Older Libraries PATCHX, OWL

17
Sequence Libraries - Searching

Searching Methods
Dynamic Programming
Global Needleman-Wunch, Sellers
Local Smith-Waterman, Waterman-Egert "Maxsegs"
Approximations
Fasta
Blast
User must understand what the searching method
thinks similar means

18
Sequence Libraries Similar?
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
19
Sequence Libraries - Results
Box 3.6 from Introduction to Bioinformatics by
Attwood and Parry-Smith
20
Multiple Sequence Alignment

An MSA is an alignment of a group of related
sequences across their entire lengths in a manner
than highlights the conservation of the important
residues in the sequences
Critical building block for many next steps such
as finding distantly related sequences,
determining the evolutionary history, and
homology modeling
Not all sets of related sequences can be aligned
across their entire lengths cleanly

21
Multiple Sequence Alignment
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
22
Multiple Sequence Alignment

When aligning groups of related sequences it is
important to note that residues in those
sequences are either
Conserved (not mutated)
Unconstrained (when mutated can be almost any
amino acid)
Constrained (when mutated must be one of a few
amino acids)
Motifs are distinct units that consists of the
conserved and constrained regions Motifs
generally tell us the residues that are essential
to the structure/function of the sequence
Typically, multiple sequence alignments contain
motifs as well as unconstrained regions.
Aligning motifs in a multiple sequence alignment
will improve the quality of the alignment

23
Multiple Sequence Alignment
helix
helix
sheet
Lrr 2e kdafrdlhsLsl-LsLydNnI-----qsL
LRR A1 wteLlpllqqyEvvrLddCgLTeehCkdi LRR A2
lqgLqsPtCkiqkLsLqnCsLTeaGCgvL LRR A3
cegLldPqChLEkLqLeyCrLTaasCepL LRR A4
gqgLadsaCqLEtLrLenCgLTpanCkdL LRR A5
cpgLlsPasrLktLwLweCdiTasGCrdL LRR A6
cesLlqPGCQLEsLwvksCsLTaacCqhv LRR A7
cqaLsqPgttLrvLcLgdCeVTnsGCssL LRR A8
lgsLeQPgCaLEqLvLydtywTeevedrL LRR B1
gsaLranpsLtE-LcLrtNeLGDaGvhlv LRR B2
pstLrslptLrE-LhLsdNpLGDaGlrlL LRR B3
asvLratraLkE-LtvsnNdiGeaGarvL LRR B4
cgivasqasLrE-LDLgsNgLGDaGiaeL LRR B5
crvLqaketkKE-LsLagNkLGDeGarlL LRR B6
slmLtqnkhLlE-LqLssNkLGDsGiqeL LRR B7
aslLlanrsLRE-LdLsnNcvGDpGvlqL
24
Multiple Sequence Alignment
Figure 8.2 from Introduction to Bioinformatics by
Attwood and Parry-Smith
25
Multiple Sequence Alignment - Programs

MSA Using Progressive Pairwise Technique
Clustal
MSA Using Multidimensional Dynamic Programming
MSA
MSA Using Consistency Measures
T-Coffee
Probcons
MSA Editor
GeneDoc

26
Position Specific Scoring Matrix

A Position Specific Scoring Matrix (PSSM or
Profile) is a way to abstract the information
contained in a multiple sequence alignment.
Think of a PSSM as a custom PAM or BLOSUM scoring
matrix that has been specially tuned to locate
sequences exactly like those in the alignment.
Probabilities represented by Log Odds Technique

27
Position Specific Scoring Matrix

Used to
Help locate distantly related sequences
Help resolve sequences that are not considered
statistically significant by a database search,
but share enough important residues to infer that
the sequences may have the same function and be
distant members of the sequence family
Good MSA Good PSSM
Poor MSA Poor PSSM
A lot of Sequences Good PSSM
Few Sequences Good PSSM

28
PSSM Programs

MakePSSM
Used to create a PSSM from a multiple sequence
alignment
PSSM can be created using different methods
including Gribskov and Henikoff and with a
variety of PAM and BLOSUM matrices.
ProfileSS
Used to search a sequence database with a profile.

29
Hidden Markov Model

A Hidden Markov Model (HMM) is a way to abstract
the information contained in a multiple sequence
alignment.
Think of a HMM as a way to represent a multiple
sequence alignment by deriving probabilities
directly from the multiple sequence alignment
Probabilistic model Includes probabilities for
insertions and deletions

30
Hidden Markov Model

Used to
Help locate distantly related sequences
Help resolve sequences that are not considered
statistically significant by a database search,
but share enough important residues to infer that
the sequences may have the same function and be
distant members of the sequence family
Good MSA Good HMM
Poor MSA Poor HMM
A lot of Sequences Good HMM
Few Sequences Poor HMM

31
Hidden Markov Model

HMMER Package
hmmalign - Align multiple sequences to a profile
HMM.
hmmbuild - Build a profile HMM from a given
multiple sequence alignment.
hmmcalibrate - Determine appropriate statistical
significance parameters for a profile HMM prior
to doing database searches.
hmmconvert - Convert HMMER profile HMMs to other
formats
hmmemit - Generate sequences probabilistically
from a profile HMM.
hmmfetch - Retrieve an HMM from an HMM database
hmmindex - Create a binary SSI index for an HMM
database
hmmpfam - Search a profile HMM database with a
sequence hmmsearch - Search a sequence database
with a profile HMM

32
Local Patterns

Local patterns are short motifs that exist in
all (or a subset) of the related sequences.
Finding local patterns may help us to align
biologically important sections in a multiple
sequence alignment.
These local patterns can also be used to probe
sequence data libraries for distant relatives

33
Local Patterns
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
34
Local Patterns - Programs

MEME a tool for discovering motifs in groups of
sequences
oops - One Occurrence Per Sequence
zoops - Zero or One Occurrence Per Sequence
tcm Multiple occurrences per sequence
MAST will search a sequence database for
sequences that contain MEME patterns

35
Homology Modeling

Predicts the three-dimensional structure of a
given protein sequence (TARGET) based on an
alignment to one or more known protein structures
(TEMPLATES)
If similarity between the TARGET sequence and the
TEMPLATE sequence is detected, structural
similarity can be assumed.

36
Homology Modeling
Structural Superposition of Aldehyde
Dehydrogenase Family Members
37
Homology Modeling - Programs

Modeller used for homology (or comparative)
modeling of protein three-dimensional structures
MMTSB Multiscale Modeling Tools for Structural
Biology
VMD - molecular visualization program for
displaying, animating, and analyzing large
biomolecular systems using 3-D graphics and
built-in scripting

38
Evolutionary Analysis

Inferring phylogenies finding the tree that
implies the correct evolutionary history of the
sequences
Principal Methods
Parsimony analysis
Distance methods
Maximum Likelihood
Each approach has its own strengths/weaknesses
To assess the correctness of the tree we need
to understand
Overall signal noise in the data
How the tree comparisons to alternate trees
How reliable the individual branches are

39
Evolutionary Analysis

Refining Trees
Bootstrap analysis can give us estimates of
variability
Incorporating information about duplication and
loss
reconcile a gene tree with a species tree
identify gene duplications
root an unrooted tree by minimizing gene
duplications and losses
refine rooted trees to minimize duplications and
losses
Groups analysis
Discover what is unique about each subgroup in a
tree
Help resolve which subgroup a sequence belongs to
in a tree

40
Evolutionary Analysis - Programs

Phylip Package
Package contains many programs to help infer
phylogenies including programs for
bootstrapping, maximum parsimony, distance
methods, maximal likelihood methods, etc.
Notung
Enables the incorporation of information about
duplication and loss into phylogenies
Subgroup Refinement
GEnt Calculates a group cross entropy

41
Course Project
Homology Modeling
CURATED DATASET
42
Part I
Homology Modeling
CURATED DATASET
43
Course Project Part I

Select Sequence Family
In the labs we will be working with the
Haloalkane Dehalogenase superfamily
A hydrolase that acts on halide bonds in C-halide
compounds.
Reaction 1-haloalkane H(2)O a primary
alcohol halide.
The PIR Superfamily is PIRSF037173
The initial query sequence that we will be using
for database searching is from Xanthobacter
autrophicus. The UniProt ID is DHLA_XANAU.

44
Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
45
Course Project Part II

Collect an initial set of sequences, generate a
multiple sequence alignment
Perform a database search with the query sequence
across several databases with different
algorithms
Perform a multiple sequence alignment with a
variety of different alignment algorithms
Select best multiple sequence alignment

46
Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
47
Course Project Part III

Improve the quality of your alignment, and
identify additional family members
Search for local patterns in the group of
sequences
Refine multiple sequence alignment based on local
patterns
Convert alignment into HMM/PSSM and search the
database for distantly related sequences.