Multiple Sequence Analysis

About This Presentation

Title:

Multiple Sequence Analysis

Description:

Creates multiple sequence alignments from a group of related sequences by ... HIGhroad selects 'top' alignment path for equally optimal gaps ... – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 79

Provided by: ElliotLe6

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Analysis

1
Multiple Sequence Analysis
2
Conserved functional domains

Sequences required for common function
Conserved among different species

3
Variable domains

Sequences under selection
Antibody escape mutants

4
Peptide motifs

Active sites
Binding motifs
Protein modification motifs

5
Evolutionary relationships

Origin of new strains
Epidemiology
Disease spread
Disease origins
Common ancestry

6
Multiple Sequence Alignments
7
The "Best" Alignment

Optimal Alignments
BestFit and Gap algorithms
"approximate" Alignments
When the optimal alignment would exceed computer
capabilities
PileUp
In either case, the final alignment will always
be dependent on the chosen variables

8
Program Variables

Symbol Comparison Table
Gap Weight
Gap Length Weight
Algorithm
Other

9
Determining the Best Alignment

Optimize
Percent Identity/Similarity
Quality
Statistical measures
Make up a number

10
The Final Analysis

Your own eyes
Human knowledge
Biology
Evolution

11
PileUp

Creates multiple sequence alignments from a group
of related sequences by performing pairwise
alignments among all of the sequences in the group

12
PileUp Initial Comparison

Compares each sequence to every other sequence
Uses the GAP global alignment algorithm
Creates a table of similarities between every
sequence
The table is plotted as a dendogram to a .figure
file

13
PileUp Alignment

Align the two most common sequences to each other
Forms cluster number one
Align cluster one to the next most similar
sequence
Gaps introduced into cluster one are introduced
into both sequences
Forms cluster two (group of three)

14
Completion of the Alignment

Repeat the alignment by gapping each new cluster
to the next most similar sequence
Writes the final alignment to a Multiple Sequence
Format (MSF) file

15
MSF file

Can be read by other programs
Individually gapped sequences can be utilized on
their own or in groups that are subsets of the
whole alignment

16
Dendogram

Original pairwise relationships among all of the
sequences used to determine cluster alignment
order
The dendogram does not predict phylogenetic
relationships
The final alignment was not used to determine the
sequence relationships

17
(No Transcript)
18
Alignment Order

Alignments begin with the two most similar
sequences and end with the most distant sequence
The final alignment may be influenced by the
alignment order
This order cannot be changed in the present
version of PileUp

19
Similar Sequences

PileUp does not allow differential weighting of
the input sequences
All input sequences are weighted equally
Several very similar sequences will contribute
equally to the alignment
Several very similar sequences may bias the final
alignment

20
Different PileUp Runs

Run PileUp with different sets of input sequences
Use all members or only one member of a group of
very similar sequences
Run PileUp using previously determined consensus
sequences for sequence groups

21
Unrelated Sequences

PileUp includes all sequences in the final
alignment
An unrelated sequence appears in the alignment
even if it has no similarity to all of the other
sequences
Unrelated sequences may greatly alter the final
sequence alignment by the introduction of many
additional gaps

22
Restrictions

500 sequences
5,000 symbols per sequence
2,000 new gaps per sequence
7,000 final alignment length

23
Restrictions

Surface of comparison between any two comparisons
cannot exceed 2,250,000
Product of the sequence lengths
If the surface of comparison does exceed the
limit, the program will attempt an alignment by
limiting the total number of gaps introduced

24
analyze pileup -check _at_pol.list PileUp creates
a multiple sequence alignment from a group of
related sequences using progressive, pairwise
alignments. It can also plot a tree showing the
clustering relationships used to create the
alignment. Minimal Syntax pileup
-INfile_at_Hsp70.List -Default Prompted
Parameters -GAPweight12 gap creation
penalty -LENgthweight4 gap extension
penalty -DENsity20.0 number of
sequences per 100 pu in the dendrogram -OUTfile1
hsp70.msf output file for multiple sequence
alignment Local Data Files-MATRixblosum62.cmp
scoring matrix for peptides
-MATRixpileupdna.cmp scoring matrix for nucleic
acids
25
Optional Parameters -BEGin1 sets beginning
position for every sequence to be aligned
-END100 sets ending position for every
sequence to be aligned -REVerse uses the
reverse strand for each input sequence -ENDWeight
penalizes end gaps like other gaps -INSitu
realign a portion of an existing
alignment -HIGhroad selects "top" alignment
path for equally optimal gaps -LOWroad
selects "bottom" alignment path for equally
optimal gaps -MAXSeg5000 sets maximum segment
length for every input sequence -MAXGap2000 sets
maximum combined length of all gaps added to a
sequence -NOSORt presents
output sequences in the same order as
input -LINesize50 sets the number of
sequence symbols per line -BLOcksize10 sets
the number of sequence symbols per block -DEGap
removes gap characters ('.' and '') from the
input sequences -NOPLOt
suppresses plot of clustering relationships -NOMON
itor suppresses screen trace of each
alignment -NOSUMmary suppresses screen summary
at the end of the program -BATch submits
program to the batch queue Add what to the
command line ?
26
1 POLG_POL1M 461 aa 2
POLH_POL1M 461 aa ........ 48
POLN_SOUV3 266 aa 49 POLN_FCVF4 175
aa What is the gap creation penalty ( 12 ) ?
5 What is the gap extension penalty ( 4 ) ?
1 This program can display the clustering
relationships graphically. Do you want to
A) Plot to a FIGURE file called "pileup.figure"
B) Plot graphics on COLORWORKSTATION attached
to GCG_Graphics C) Suppress the plot
Please choose one ( A ) c What should I
call the output file name ( pol.msf ) ?
pol-a.msf
27
Determining pairwise similarity scores... 1
x 2 5.24 1 x 3 5.22
47 x 49 3.43 48 x 49
1.42 Aligning... 1 ........-. 2
........-. ........-. 47
.............-. 48 .....................-...
Total sequences 49
Alignment length 495 CPU
time 0207.25 Output
file/export/home/lefkowit/temp/pol-a.msf
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Specifying MSF Sequences

Picorna.MSF
All sequences
Picorna.MSFPolg_pol
All polio sequences
Picorna.MSFPolg_Pol1m
Only the pol1 Mahoney strain

33
The SeqLab Editor

Alignment Refinement
Color-coded symbol groups

34
Pretty

Display multiple sequence alignments
Does not create the alignment
Calculates a consensus sequence
Allows control over the output

35
Pretty Output

Show all symbols (default)
-Consensus
show all symbols and a consensus sequence
-Case
show symbols agreeing with the consensus in upper
case

36
Pretty Output

-Differences
Only show symbols differing from the consensus
-Identity
Only show consensus symbols which are identical
in all of the aligned sequences

37
Pretty Sequence file Output

-Ugly
Write individual sequences into separate files
Includes the consensus sequence

38
Consensus Calculation

Find the symbol at a particular position which
has the greatest number of "votes"
A vote is determined by the sum of the symbol
comparison values between that symbol and all
other symbols at that position

39
Consensus Calculation

If the total vote for the highest scoring symbol
is greater than the threshold value, that symbol
appears in the consensus
If no vote is higher than the threshold, no
symbol appears at that position in the consensus

40
Vote Weight

All sequences participate equally in the
consensus calculation by having a vote weight of
1.0
The vote weight can be changed for any individual
sequence
The symbols in that sequence are then weighted
when the consensus is calculated according to
their vote weight
This allows the votes of very similar sequences
not to overly influence the consensus calculation

41
Specifying Vote Weights

In an MSF file change the number under the weight
column
In a file of sequence names, add a number
following the sequence name for a vote weight
other than one

42
analyze pretty -check pol-a.msf Pretty
displays multiple sequence alignments and
calculates a consensus sequence. It does not
create the alignment it simply displays it.
Minimal Syntax pretty -INfile_at_Pretty.List
-Default Prompted Parameters -BEGin1 -END349
range of interest -OUTfilepretty.pretty
output file Local Data Files
-MATRixprettydna.cmp consensus scoring matrix
for nucleotides -MATRixblosum62.cmp consensus
scoring matrix for peptides
43
Optional Parameters -CONsensus
generates (displays) a consensus
sequence -IDEntity only shows positions
of unanimous agreement in the
consensus -DIFferences"-" only shows
positions disagreeing with the calculated -CASe
shows positions agreeing with the
calculated consensus in
upper case -THReshold1 sets minimum
comparison value for symbol to vote in
consensus -PLUrality2.0
defines the minimum number of votes for a
consensus to
exist -LINesize50 sets the number of
residues per line -WEIGHT1.0 sets the
weight for all input sequences -BLOcksize10
sets the number of residues per block -UGLy
writes the individual sequences into new
files
44
Add what to the command line ?
pol-a.msfPOLG_HPAV8 len 495 wgt 1.00
pol-a.msfPOLG_HPAV4 len 495 wgt 1.00
pol-a.msfPOLN_SMSV1 len 495 wgt
1.00 pol-a.msfPOLN_SOUV3 len 495
wgt 1.00 Begin ( 1 ) ?
End ( 495 ) ? What should I
call the output file ( pretty.pretty ) ?
pol-a.pretty
45
pretty
46
pretty -case
47
pretty -Dif
48
pretty -Ide
49
Picorna.MSF - Weighted
50
Picorna.MSF - Weighted
51
Pretty Consensus with Altered Vote Weights
52
Pretty Consensus with Altered Vote Weights
53
(No Transcript)
54
PrettyBox

Shaded representation of a multiple sequence
alignment
Creates a postscript file
Create a pdf file using Adobe Distiller
Edit using Adobe Illustrator

55
analyze prettybox -check dnaa.msf PrettyBox
displays multiple sequence alignments in
PostScript format, using shading to represent
regions that agree with a calculated consensus
sequence. The program does not create the
alignment it simply displays it. Minimal
Syntax prettybox -INfile_at_pretty.list
-Default Prompted Parameters -BEGin1 -END349
sets the range of interest -ORIentationl
specifies the direction for printing as
Landscape (L) or Portrait
(P) -NUMberingr sets printing of
sequence numbering to
Right side (R), Top (T), or None -CONsensus
generates a consensus sequence -OUTfilepret
tybox.ps writes to PostScript output file Local
Data Files -MATRixprettyboxdna.cmp assigns
the scoring matrix for nucleotides -MATRixblosum6
2.cmp assigns the scoring matrix for
proteins -MARkpretty.mrk defines
regions to be shaded
56
Optional Parameters -PAIrx,2,1 sets
thresholds for identical (x), very similar, and
weekly similar
comparisons to the consensus,
respectively. Protein defaults are x, 2,
1. Nucleic acid
defaults are 1, 1, 1. -THReshold1 sets
minimum comparison value for symbol to vote in
the consensus -PLUrality2.0 defines the
minimum number of votes for a consensus to
exist -IDEntity restricts shading and
consensus determination to
positions of unanimous agreement -CASe
shows positions agreeing with the calculated
consensu in uppercase
-SIMPlifysimplify.txt simplifies sequences
works like the Simplify program. -SIMIlara
considers similarity in generating a
consensus. If 'O' is
used, then only identical matches are
considered. -NOOFFset
prevents printing the consensus line offset from
the other
sequences -NOHEAder suppresses
printing a header -SEQNamep sets
sequences names to be Partial (P), Full (F),
or None (N)
57
-ASKstart asks about the starting
numbers for each sequence -WIDth50
sets the number of residues per
line -BLOcksize10 sets the number of
residues per block -SPAcing1 sets
the number of spaces between blocks -BLAnklines2
sets the number of blank lines between
each group of sequence
lines -FONtsize10 sets the font size
in terms of PostScript numbers -XMArgin20
sets the left and right margins in PostScript
units -YMArgin20 set the top and
bottom margins in PostScript units -FAT
uses fat (bold) lettering -COLorb,L,P,W
sets the colors (shading intensities) to
use for identical,
similar, somewhat-similar, and
non-similar comparisons to the consensus,
respectively. The
available colors, by decreasing
order of intensity, are Black (B), Dark
(D), Light (L), Pale
(P), and White (W). -DENsityf sets
the density of printing to be either Rough (R)
or Fine (F). Rough may
photocopy better. Density
only works with the colors Dark, Light, and
Pale. Add what to the command line ?
58
dnaa.msfDNAA_CAUCR, len 749
dnaa.msfDNAA_RHIME, len 749
dnaa.msfDNAA_MYCCA, len 749
dnaa.msfDNAA_MYCMY, len 749
dnaa.msfDNAA_SPIAP, len 749
dnaa.msfDNAA_SPICI, len 749
dnaa.msfDNAA_BORBU, len 749
dnaa.msfDNAA_TREPA, len 749
dnaa.msfDNAA_RICPR, len 749
dnaa.msfDNAA_WOLSP, len 749
dnaa.msfDNAA_BUCAI, len 749
dnaa.msfDNAA_BUCAP, len 749
dnaa.msfDNAA_ECOLI, len 749
dnaa.msfDNAA_SALTY, len 749
dnaa.msfDNAA_SERMA, len 749
dnaa.msfDNAA_PROMI, len 749
dnaa.msfDNAA_VIBHA, len 749
dnaa.msfDNAA_PSEPU, len 749
dnaa.msfDNAA_HAEIN, len 749
dnaa.msfDNAA_MYCBO, len 749
dnaa.msfDNAA_MYCPA, len 749
dnaa.msfDNAA_MYCAV, len 749
dnaa.msfDNAA_MYCTU, len 749
dnaa.msfDNAA_MYCLE, len 749
dnaa.msfDNAA_MYCSM, len 749
dnaa.msfDNAA_STRCO, len 749
dnaa.msfDNAA_STRRE, len 749
dnaa.msfDNAA_STRCH, len 749
dnaa.msfDNAA_MICLU, len 749
dnaa.msfDNAA_BACSU, len 749
dnaa.msfDNAA_STAAU, len 749
dnaa.msfDNAA_PROMA, len 749
dnaa.msfDNAA_SYNY3, len 749
dnaa.msfDNAA_STRPN, len 749
dnaa.msfDNAA_THEMA, len 749
dnaa.msfDNAA_MYCGE, len 749
dnaa.msfDNAA_MYCPN, len 749
dnaa.msfDNAA_UREPA, len 749 Begin ( 1 )
? End ( 749 ) ?
59
Print in which orientation l)andscape
p)ortrait Please select ( L ) Display a
consensus ( No ) ? Find consensus to what
minimum plurality ( 2.00 ) ? Where should
numbers be placed r)ight side t)op
n)one Please select ( R ) What should I
call the output PostScript file ( prettybox.ps
) ? dnaa.ps analyze
60
PrettyBox Output

pdf file

61
PlotSimilarity

Plots the similarity among sequences in a
multiple sequence alignment

62
Similarity Statistic

The similarity statistic is the average of all
symbol comparison scores when all symbols at any
one position are compared with each other
The similarity statistic is averaged over a
window size of 10 (default) and plotted along the
length of the sequence

63
PlotSimilarity -IDEntity

A measure of symbol identity along the sequence
Instead of using a symbol comparison table for
the calculation, all matches receive a value of
1, and mismatches a value of 0

64
analyze plotsimilarity -check pol-a.msf PlotS
imilarity plots the running average of the
similarity among the sequences in a multiple
sequence alignment. Minimal Syntax
plotsimilarity -INfile1hsp70.msf
-Default Prompted Parameters -WINdow10
comparison window size -DENsity624.3
the number of bases per 100 platen
units Prompted Parameters for comparing 2
sequences only -INfile2ggamma.gap second
input sequence -BEGin11 -END11700 the range of
interest for sequence 1 -BEGin21 -END21700 the
range of interest for sequence 2 -REVerse1
-REVerse2 strand of each sequence Local Data
Files -MATRixblosum62.cmp scoring matrix for
peptides -MATRixplotsimdna.cmp
scoring matrix for nucleic acids
65
Optional Parameters -BEGin11 -END1718
the range of interest in the alignment
-OUTfileHsp70.plotsim writes the similarity
values to a file -WEIGHT1 sets
the weight for all input sequences -IDEntity
plots the level of identity among the
sequences -BARgraph plots a bar
graph (rather than a
continuous curve) -PROFile plots
positional conservation in a profile -MINScale0
sets the bottom of the similarity
score scale -MAXScale2 sets the top
of the similarity score scale -EXPand
scales plot between observed min and max
similarity
scores -NOAVErage suppresses the
plot of overall similarity -NOPLOt
suppresses the plot -CMASKfilename
creates a SeqLab colormask file with grayscale
values for levels of
similarity Add what to the command line ?
66
Process set to plot with COLORWORKSTATION
attached to GCG_Graphics using the xwindows
graphic interface. pol-a.msfPOLG_HPAV8
pol-a.msfPOLG_HPAV4
pol-a.msfPOLN_SMSV1 pol-a.msfPOLN_SOUV3
What window to average ( 10 ) ? The
minimum density for this plot is 430.4
residues/100 platen units. What density do you
want ( 430.4 ) ? xwindows instructions for
a COLORWORKSTATION are now being sent to
GCG_Graphic. Press ltReturngt
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
MACAW

Multiple Alignment Construction Analysis
Workbench
Locate, analyze, edit, and combine blocks of
aligned sequence segments
Gregory D. Schuler, Stephen F. Altschul, and
David J. Lipman
FTP to ncbi.nlm.nih.gov

72
MACAW Blocks

Blocks
Ungapped regions of similarity between two or
more sequences
Identifies the best local regions of similarity
between two or more sequences
BestFit-like search
Will identify multiple blocks of similarity
between two or more sequences

73
MACAW Sensitivity

Multiple sequence patterns are located in more
than 2 sequences at a time
The significance and sensitivity of a match is
greater when a similar pattern is located in more
than two sequences

74
Detection and Alignment of Blocks

Multiple algorithms available
The statistical significance of blocks of
similarity is evaluated
Candidate blocks may be visually evaluated for
potential inclusion in a multiple alignment
Each block can be edited by moving its boundaries
or by eliminating particular segments
Blocks may be linked to form a composite multiple
alignment

75
MACAW Scoring