Title: Sequence Alignments
1Sequence Alignments
2Reading
- Chapter 2 in your textbook
3Sequence Alignments
- Alignment between characters found in two or more
nucleotide or amino acid sequences - Amino Acids?residues Nucleic Acids
- Similarity between sequences
- What can this tell you?
- In this chapter
- How do we align two or more sequences?
- How do we evaluate these alignments?
- What conclusions can we make based on these
alignments?
4Dot Plots
- Used to visualize regions of similarity
- One sequence placed on the x-axis, the other on
the y-axis - Dots are placed in the plot where the two
sequences are identical - Diagonal lines in plot indicate regions of
similarity - Example compare ATCG to GATC
- Advantages easy, quick
- Disadvantages only gives regions of similarity,
not actual alignment - What would the dot plot look like with longer
sequences?
5Noise in Dot Plots
- Control by adjusting the following
- Window size
- Similarity cutoff
- Removing too much noise might conceal small
region of similarity - Example GCTAGTCAGA and GATGGTCACA
Complete this plot!
Window of 1 Similarity cutoff of 1
Window of 4 Similarity cutoff of 3
6Dot Plots in Excel
7Try the DotPlot Program
- Download the program from this link
- It will automatically save the program and
several files to your desktop - Open DotPlot application
- Load sequences as FASTA text files
- File, Open Horizontal, Browse
- File, Open Vertical, Browse
- Parameters menu changes length and cutoff
- Draw, Identities shows plot
- Clear screen when change parameters to visualize
- Example Bos taurus and porcine myoglobin mRNA
sequences (sequences on course website)
8Simple Alignments
- Molecular changes occur when organisms evolve
- Mutation
- Most common
- Insertion
- Deletion
- Gaps in alignments
- Added to account for insertions/deletions
- Goal to obtain optimal alignment
- Most likely to represent the true relationship
between homologous sequences - Consider the following sequences AATCTATA and
AAGATA - Either 2 insertions in first sequence or 2
deletions in second sequence - What is the optimal alignment?
9- If no gaps allowed, there are three ways the
sequences can be aligned - AATCTATA AATCTATA AATCTATA
-
- AAGATA AAGATA AAGATA
- Which alignment is optimal?
- Scoring alignments
- Match score credit for identical aligned pair
- Mismatch score penalty for nonidentical
residues - Total score sum of match and mismatch scores
- Higher score better alignment
10- If gaps are allowed, there are many more ways the
sequences can be aligned - Three examples
- AATCTATA AATCTATA AATCTATA
- AAG-AT-A AA-G-ATA AA--GATA
- Scoring must now account for gaps
- Gap penalty penalty for each residue aligned
with - Total score match mismatch gap penalty
11- If match 1, mismatch 0, and gap penalty -1,
what are the scores for these three alignments? - AATCTATA AATCTATA AATCTATA
- AAG-AT-A AA-G-ATA AA--GATA
- Score 1 3
3 -
12Gap Penalties
- Is it more likely to have one longer
insertion/deletion, or multiple smaller ones? - Two types of gap penalties
- Length penalty
- Penalty for each residue aligned with -
- Origination penalty
- Penalty for presence of a gap
- Allows differentiation between alignments with
many short gaps and those with fewer, longer gaps - Further penalizes for rare insertion/deletion
(indel) events
13- If match 1, mismatch 0, length penalty -1,
and origination penalty -2, what are the scores
for these three alignments? - AATCTATA AATCTATA AATCTATA
- AAG-AT-A AA-G-ATA AA--GATA
- Score -3 -1
1 -
14Terminal Gaps
- Might not actually be indels
- Data could be incomplete
- Sometimes ignored in scoring
-
- AATCTATAGC
- AAG--ATA--
-
15Mismatch Penalties
- Different mismatch scores depending on particular
nucleotide or amino acid that is mismatched - Reward mismatches that are more likely to occur
(common substitutions) - Nucleotides
- Purine vs. pyrimidine
- Transitions vs. transversions
16Scoring Matrices
- Show scores for all non-gap positions in
alignment - For nucleotide sequences
Identity (Sparse)
BLAST
Transition/transversion
17Matrices for Proteins
- Amino acids
- 1. Structure and properties
- Substitution of similar AAs
- more likely to retain protein function
(conservative substitution) - 2. Genetic code
- Minimum number of nucleotide substitutions needed
to convert a codon
18Matrices for Proteins
- 3. Actual observed substitution rates
- Point accepted mutation (PAM)
- Alignment constructed with high similarity (gt85)
- Calculate relative mutability (mj)
- Number of times one amino acid (j) is substituted
by any other - Calculate specific substitution (Aij)
- Number of times j is substituted by a specific
amino acid i - See Box 2.1 (page 40)
19PAM Example
- Ambiguities
- X ambiguous amino acid
- B Asn or Asp
- Z Gln or Glu
- Some algorithms take ambiguities into account and
score some count them as identical others
ignore them - If the sequence has lots of ambiguities scores
may not be reliable with certain types of software
- Identical amino acids highest score
- Conservative substitution next highest score
- Non-conservative substitution lowest score
20PAM Matrices
- Pam matrix is normalized to represent
substitution over a fixed period of evolutionary
change - PAM-1
- 1 substitution per 100 residues
- Matrix represents probability of AA substitution
in time it takes for 1 of all residues to be
substituted - Used to compare sequences that are closely
related - PAM-1000
- Used for sequences with distant relationships
- PAM-250
- Commonly used middle ground
21BLOSUM Matrix
- Also derived from observing substitution rates in
proteins - Looks at clusters of amino acids sequences
- Lower numbered matrices used for more distantly
related sequences - BLOSUM-45 vs. BLOSUM-80
- BLOSUM-62 is the middle ground and default matrix
in most protein alignment programs
22PAM and BLOSUM
BLOSUM 80
BLOSUM 62
BLOSUM 45
PAM 1
PAM 250
PAM 1000
More Divergent
Less Divergent
23Types of Scores
- Raw Score
- Protein and nucleotide alignments
- Sum the scores for matches, mismatches, and gaps
- Percent identities
- Protein and nucleotide alignments
- Ratio of residues that match up in both sequences
to total number of residues compared - Percent positives
- Protein alignments only
- Matrix values gt1 are called positives
- Ratio of positive values to total number of
residues compared
24An Example
- Alignment of mouse and crayfish trypsin
- Raw score
- Identities
- Positives
Mouse I V G G Y N C E E N S V P Y
Q 5 4 5 5 -3 2 -2 2 3 0 0 -1 6
10 4 Crayfish I V G G T D A V L G E
F P Y Q
30
7/15 47
8/15 53
25Algorithms for Alignments
- Global
- Dynamic programming
- Breaking a problem down into smaller subproblems,
then rebuilding - Needleman and Wunsch
- Aligns whole sequences
- All gaps accounted for (internal and terminal)
- Semiglobal
- Revised by Needleman and Wunsch
- Aligns whole sequences
- Only internal gaps count
- Local
- Smith and Waterman
- Aligns localized regions of similarity
- Ignore gaps
26Partial Scores Table
- Used to align sequences
- Top and left axes labeled with sequences
- Contains alignment scores for all alignment
options - Used to determine optimal alignment
- Example alignment of ACTCG and ACAGTAG
- Rules for global alignment
- Horizontal move -1 (indicates gap in left axis)
- Vertical move -1 (indicates gap in top axis)
- Diagonal move 1 for match or 0 for mismatch
- First row and column are initialized with
multiples of gap penalty
27Initial Partial Scores Table
28- Start in outlined box
- Calculate the possible scores from diagonal,
above, and left - Put the LARGEST (best) score in the box
- Move across table to complete first row
- Move to second row, etc., until table is complete
Diagonal 0 1(match) 1 Top -1 1
-2 Left -1 1 -2
29Diagonal -1 0(mismatch) -1 Top -2 1
-3 Left 1 1 0
30Completed Table
Now, trace the optimal path. Start at the bottom
right, and move in the direction that gave that
score. End at the top left.
31Completed Table
Now, trace the optimal path. Start at the bottom
right, and move in the direction that gave that
score. End at the top left.
32Completed Path
?
?
?
?
?
?
?
Now, write the alignment
33Writing the Alignment from the Partial Scores
Table
- ? means the two residues are aligned
- ? means there is a gap in top axis
- ? means there is a gap in left axis
G G
CG AG
TCG TAG
-TCG GTAG
--TCG AGTAG
C--TCG CAGTAG
AC--TCG ACAGTAG
?
?
?
?
?
?
?
34Semiglobal Alignments
- Only internal gaps count
- Do not penalize gaps at ends of sequence
- Rules for semiglobal alignment
- Horizontal move -1 (indicates gap in left axis)
EXCEPT in bottom row - Vertical move -1 (indicates gap in top axis)
EXCEPT in last column - Diagonal move 1 for match or 0 for mismatch
- First row and column are initialized to zero
- Example align ACACTG and ACACTGATCG
35Initial Partial Scores Table
36Diagonal 0 0 (mismatch) 0 Top 0 0 (no
penalty last column) 0 Left 0 1 -1
37Diagonal 0 0 (mismatch) 0 Top 0 1
-1 Left 0 0 (no penalty last row) 0
38Completed Table
39Completed Path and Alignment
?
?
?
?
?
?
?
?
?
?
ACACTGATCG ACACTG----
40Local Alignments
- Used to find best matching subsequences within
two sequences - Rules for local alignment
- Horizontal move -1
- Vertical move -1
- Diagonal move 1 for match or -1 for mismatch
- First row and column are initialized to zero
- Place a zero in the table if all other scores are
negative for that box - When determining path, find highest number on
table, and work back until you come to a zero - Example GCGATATA and AACCTATAGCT
41Completed Table
42Alignment
Start with highest value continue until you
reach zero
?
?
?
TATA TATA
?
43NextBLAST!
- Lets let the computer do the work