Title: Bioinformatics Unit 1: Data Bases and Alignments
1BioinformaticsUnit 1 Data Bases and Alignments
- Lecture 3
- Homology Searches and Sequence Alignments
(cont.) - The Mechanics of Alignments
2Overview
- Introduction/review
- Reading alignment outputs
- Scoring (substitution) matrices
- More on alignment algorithms and dynamic
programming - Useful alignment algorithms
- Examples
3Introduction
- Sequence alignment is a useful tool with many,
diverse applications. - Examples of sequence alignments
- Compare a new sequence against an established
sequence from a database - In sequencing a new gene one usually sequences
both strands and then aligns (reversing one of
them, of course!). This ensures accuracy.
4Examples of Sequence Alignments (cont.)
- Compare the sequence homology to look for
evolutionary relatedness. - To identify the sites of mutations
- To find regions of overlapping sequence (cosmids
or YACs for example) - To identify conserved functional domains in gene
products - Others to be sure!
5Understanding Alignment Outputs
- One sequence is placed above another and the
aligned vertical pairs are compared (scored) - Matching pairs are joined with a bar ( ) to
indicate identity. - A colon ( ) is used to identify similar but
nonidentical pairs. - IUB ambiguity codes are used (e.g. N pairs with
G, C, T or A). - Nonidentical amino acids with similar physical
properties can also be reported as similar.
6Example
- 330 CCTTNATTTCCTTTTTGACA 349
- Â Â Â Â
- Â 991 CCTTAATTCCCTTTTTGACA 972
- Only 20 bases of each sequence aligned (a local
alignment) - The numbers at each end of the alignment
corresponds to the nucleotide number in the
original sequence. - There was a 329 nucleotide non-identical prefix
in the top query sequence and a 971 non-identical
prefix in the lower query sequence. - There may have been non-identical suffixes too,
or the entered sequences may only have been 341
and 991 bases long, respectfully.
7Example (cont.)
- 330 CCTTNATTTCCTTTTTGACA 349
- Â Â Â Â
- Â 991 CCTTAATTCCCTTTTTGACA 972
- The lower sequence has been reversed (complement)
- There are two non-identical pairs
- Nucleotides number 334 and 987 are paired by a
colon (). The nucleotide at this position on the
upper strand is an N indicating that the
sequencer was unable to determine the nucleotide
identity. - The nucleotide pair between numbers 338 (top) and
983 (bottom) comprises a T and a C. These do not
match and no line has been drawn between them.
This may be the result of a point mutation, or a
mistake in determining or entering the sequence.
8Scoring Alignments
- Positive values are given for each identical
match - Smaller positive values are given for
conservative substitutions - Negative values are given for non-identical,
non-conservative pairs - Gaps are penalized
- Total score is the sum of the individual pair
wise scores - Longer alignments give higher scores than shorter
ones
9Gaps and Scoring
- Gaps may be caused by insertion in one sequence
or deletion in the other (indel events). We
dont know which. - Gaps in an alignment are indicated by a - in
one or both of the sequences - Gaps are penalized in scoring an alignment in two
ways - Origination penalty - the scoring penalty for
creating a gap of any length (larger) - Length penalty - based on the length of the gap
(smaller)
10A Simple Example of Gap Scoring
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) Calculate the scores
for each alignment. Which alignment is best and
why?
11A Simple Example of Gap Scoring
Score -3
Score -1
Score 1
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) The third alignment
is best. From an evolutionary standpoint only
one genetic event (indel spanning 2 bases).
12Scoring Matrices How values are assigned for
each pair in an alignment
- DNA scoring matrices are fairly simple
13Scoring Matrices How values are assigned for
each pair in an alignment
- Protein matrices are far more complex
- There are 20 letters v. only 4 in DNA
- Far greater opportunity for conservative
substitutions - Some are based on observed substitutions
- Others are based on chemical/physical properties
of the amino acids - Others are based on the genetic code (how easily
could a codon specifying one amino acid be
changed to a codon specifying a different amino
acid?)
14Two Common Protein Scoring Matrices
- The Point Accepted Mutation (PAM) matrix
- Based on observed substitution rates
- Different variations are used based on
assumptions of the length of time since the
sequences diverged - PAM-1 may be best for comparing two closely
related sequences - Pam-1000 may be best for comparing sequences with
distant relationships - PAM-250 is a suitable compromise
15A PAM250 Scoring Matrix
16Two Common Protein Scoring Matrices(cont.)
- BLOSUM matrices are also commonly used
- Constructed by analyzing substitution rates for
sequences that cluster by phylogenetic analysis - Also appended with numbers (but different
meaning) - BLOSUM-62 is best for comparing sequences with
approximately 62 similarity - BLOSUM-80 is best for comparing sequences with
approximately 80 similarity
17Alignment Algorithms and Dynamic Programming
- Computer trickery!
- The straightforward approach is too intense
- For 2 sequences of 95 and 100 nucleotides there
are 55 million possible alignments! - (imagine a database search in this context!)
- Dynamic programming breaks the problem into a
series of small steps and adds the results of
these small steps to answer the problem
18Dynamic Programming (cont.)
When you run an alignment a dynamic programming
matrix is formed with the two sequences on the
sides. Scores for each pair are placed in the
matrix. If the sequences match, you would start
in the lower right corner and proceed
diagonally to the upper left corner.
AC--TCG ACAGTAG
Alignment score 2 Vertical arrows indicate
internal gaps
19Graphical Output Dot plots and Path Graphs
20Comparison
- Dot Plots
- Have been popular
- Reveal complex relationships involving multiple
regions - Difficult to interpret as they (may) show many
alignments - Hard to see gaps and visualize best alignment
- Path Diagrams
- More simple to interpret
- Show only one alignment
- (Some can show more)
- Gaps appear as horizontal or vertical segments of
the path line
21Example 1
X
Y
3
Y
5
3
5
X
22Example 2
X
Y
3
Y
5
3
5
X
23Example 3
X
Y
3
Y
5
3
5
X
24Some Useful Alignment Programs
- BLAST 2 Sequences (NCBI)
- CLUSTALW (Biology Workbench)
- MAP (Multiple Alignment Program) at Baylor, TX
- Many others
25A Nice BLAST 2 Sequences Example at
http//www.ncbi.nlm.nih.gov/blast/