Bioinformatics Unit 1: Data Bases and Alignments - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Unit 1: Data Bases and Alignments

Description:

Compare a new sequence against an established sequence from ... could a codon specifying one amino acid be changed to a codon specifying a different amino acid? ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 26
Provided by: marc47
Learn more at: http://bioweb.uwlax.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Unit 1: Data Bases and Alignments


1
BioinformaticsUnit 1 Data Bases and Alignments
  • Lecture 3
  • Homology Searches and Sequence Alignments
    (cont.)
  • The Mechanics of Alignments

2
Overview
  • Introduction/review
  • Reading alignment outputs
  • Scoring (substitution) matrices
  • More on alignment algorithms and dynamic
    programming
  • Useful alignment algorithms
  • Examples

3
Introduction
  • Sequence alignment is a useful tool with many,
    diverse applications.
  • Examples of sequence alignments
  • Compare a new sequence against an established
    sequence from a database
  • In sequencing a new gene one usually sequences
    both strands and then aligns (reversing one of
    them, of course!).  This ensures accuracy.

4
Examples of Sequence Alignments (cont.)
  • Compare the sequence homology to look for
    evolutionary relatedness.
  • To identify the sites of mutations
  • To find regions of overlapping sequence (cosmids
    or YACs for example)
  • To identify conserved functional domains in gene
    products
  • Others to be sure!

5
Understanding Alignment Outputs
  • One sequence is placed above another and the
    aligned vertical pairs are compared (scored)
  • Matching pairs are joined with a bar ( ) to
    indicate identity.
  • A colon ( ) is used to identify similar but
    nonidentical pairs.
  • IUB ambiguity codes are used (e.g. N pairs with
    G, C, T or A).
  • Nonidentical amino acids with similar physical
    properties can also be reported as similar.

6
Example
  • 330 CCTTNATTTCCTTTTTGACA 349
  •     
  •   991 CCTTAATTCCCTTTTTGACA 972
  • Only 20 bases of each sequence aligned (a local
    alignment)
  • The numbers at each end of the alignment
    corresponds to the nucleotide number in the
    original sequence. 
  • There was a 329 nucleotide non-identical prefix
    in the top query sequence and a 971 non-identical
    prefix in the lower query sequence.
  • There may have been non-identical suffixes too,
    or the entered sequences may only have been 341
    and 991 bases long, respectfully.

7
Example (cont.)
  • 330 CCTTNATTTCCTTTTTGACA 349
  •     
  •   991 CCTTAATTCCCTTTTTGACA 972
  • The lower sequence has been reversed (complement)
  • There are two non-identical pairs
  • Nucleotides number 334 and 987 are paired by a
    colon (). The nucleotide at this position on the
    upper strand is an N indicating that the
    sequencer was unable to determine the nucleotide
    identity.
  • The nucleotide pair between numbers 338 (top) and
    983 (bottom) comprises a T and a C. These do not
    match and no line has been drawn between them.
    This may be the result of a point mutation, or a
    mistake in determining or entering the sequence.

8
Scoring Alignments
  • Positive values are given for each identical
    match
  • Smaller positive values are given for
    conservative substitutions
  • Negative values are given for non-identical,
    non-conservative pairs
  • Gaps are penalized
  • Total score is the sum of the individual pair
    wise scores
  • Longer alignments give higher scores than shorter
    ones

9
Gaps and Scoring
  • Gaps may be caused by insertion in one sequence
    or deletion in the other (indel events). We
    dont know which.
  • Gaps in an alignment are indicated by a - in
    one or both of the sequences
  • Gaps are penalized in scoring an alignment in two
    ways
  • Origination penalty - the scoring penalty for
    creating a gap of any length (larger)
  • Length penalty - based on the length of the gap
    (smaller)

10
A Simple Example of Gap Scoring
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) Calculate the scores
for each alignment. Which alignment is best and
why?
11
A Simple Example of Gap Scoring
Score -3
Score -1
Score 1
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) The third alignment
is best. From an evolutionary standpoint only
one genetic event (indel spanning 2 bases).
12
Scoring Matrices How values are assigned for
each pair in an alignment
  • DNA scoring matrices are fairly simple

13
Scoring Matrices How values are assigned for
each pair in an alignment
  • Protein matrices are far more complex
  • There are 20 letters v. only 4 in DNA
  • Far greater opportunity for conservative
    substitutions
  • Some are based on observed substitutions
  • Others are based on chemical/physical properties
    of the amino acids
  • Others are based on the genetic code (how easily
    could a codon specifying one amino acid be
    changed to a codon specifying a different amino
    acid?)

14
Two Common Protein Scoring Matrices
  • The Point Accepted Mutation (PAM) matrix
  • Based on observed substitution rates
  • Different variations are used based on
    assumptions of the length of time since the
    sequences diverged
  • PAM-1 may be best for comparing two closely
    related sequences
  • Pam-1000 may be best for comparing sequences with
    distant relationships
  • PAM-250 is a suitable compromise

15
A PAM250 Scoring Matrix
16
Two Common Protein Scoring Matrices(cont.)
  • BLOSUM matrices are also commonly used
  • Constructed by analyzing substitution rates for
    sequences that cluster by phylogenetic analysis
  • Also appended with numbers (but different
    meaning)
  • BLOSUM-62 is best for comparing sequences with
    approximately 62 similarity
  • BLOSUM-80 is best for comparing sequences with
    approximately 80 similarity

17
Alignment Algorithms and Dynamic Programming
  • Computer trickery!
  • The straightforward approach is too intense
  • For 2 sequences of 95 and 100 nucleotides there
    are 55 million possible alignments!
  • (imagine a database search in this context!)
  • Dynamic programming breaks the problem into a
    series of small steps and adds the results of
    these small steps to answer the problem

18
Dynamic Programming (cont.)
When you run an alignment a dynamic programming
matrix is formed with the two sequences on the
sides.  Scores for each pair are placed in the
matrix.  If the sequences match, you would start
in the lower right corner and proceed
diagonally to the upper left corner.
AC--TCG ACAGTAG
Alignment score 2 Vertical arrows indicate
internal gaps
19
Graphical Output Dot plots and Path Graphs
20
Comparison
  • Dot Plots
  • Have been popular
  • Reveal complex relationships involving multiple
    regions
  • Difficult to interpret as they (may) show many
    alignments
  • Hard to see gaps and visualize best alignment
  • Path Diagrams
  • More simple to interpret
  • Show only one alignment
  • (Some can show more)
  • Gaps appear as horizontal or vertical segments of
    the path line

21
Example 1
X
Y
3
Y
5
3
5
X
22
Example 2
X
Y
3
Y
5
3
5
X
23
Example 3
X
Y
3
Y
5
3
5
X
24
Some Useful Alignment Programs
  • BLAST 2 Sequences (NCBI)
  • CLUSTALW (Biology Workbench)
  • MAP (Multiple Alignment Program) at Baylor, TX
  • Many others

25
A Nice BLAST 2 Sequences Example at
http//www.ncbi.nlm.nih.gov/blast/
Write a Comment
User Comments (0)
About PowerShow.com