Bioinformatics Unit 1: Data Bases and Alignments - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Bioinformatics Unit 1: Data Bases and Alignments

Description:

Compare a new sequence against an established sequence from ... could a codon specifying one amino acid be changed to a codon specifying a different amino acid? ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 26

Provided by: marc47

Learn more at: http://bioweb.uwlax.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Unit 1: Data Bases and Alignments

1
BioinformaticsUnit 1 Data Bases and Alignments

Lecture 3
Homology Searches and Sequence Alignments
(cont.)
The Mechanics of Alignments

2
Overview

Introduction/review
Reading alignment outputs
Scoring (substitution) matrices
More on alignment algorithms and dynamic
programming
Useful alignment algorithms
Examples

3
Introduction

Sequence alignment is a useful tool with many,
diverse applications.
Examples of sequence alignments
Compare a new sequence against an established
sequence from a database
In sequencing a new gene one usually sequences
both strands and then aligns (reversing one of
them, of course!). This ensures accuracy.

4
Examples of Sequence Alignments (cont.)

Compare the sequence homology to look for
evolutionary relatedness.
To identify the sites of mutations
To find regions of overlapping sequence (cosmids
or YACs for example)
To identify conserved functional domains in gene
products
Others to be sure!

5
Understanding Alignment Outputs

One sequence is placed above another and the
aligned vertical pairs are compared (scored)
Matching pairs are joined with a bar ( ) to
indicate identity.
A colon ( ) is used to identify similar but
nonidentical pairs.
IUB ambiguity codes are used (e.g. N pairs with
G, C, T or A).
Nonidentical amino acids with similar physical
properties can also be reported as similar.

6
Example

330 CCTTNATTTCCTTTTTGACA 349
991 CCTTAATTCCCTTTTTGACA 972
Only 20 bases of each sequence aligned (a local
alignment)
The numbers at each end of the alignment
corresponds to the nucleotide number in the
original sequence.
There was a 329 nucleotide non-identical prefix
in the top query sequence and a 971 non-identical
prefix in the lower query sequence.
There may have been non-identical suffixes too,
or the entered sequences may only have been 341
and 991 bases long, respectfully.

7
Example (cont.)

330 CCTTNATTTCCTTTTTGACA 349
991 CCTTAATTCCCTTTTTGACA 972
The lower sequence has been reversed (complement)
There are two non-identical pairs
Nucleotides number 334 and 987 are paired by a
colon (). The nucleotide at this position on the
upper strand is an N indicating that the
sequencer was unable to determine the nucleotide
identity.
The nucleotide pair between numbers 338 (top) and
983 (bottom) comprises a T and a C. These do not
match and no line has been drawn between them.
This may be the result of a point mutation, or a
mistake in determining or entering the sequence.

8
Scoring Alignments

Positive values are given for each identical
match
Smaller positive values are given for
conservative substitutions
Negative values are given for non-identical,
non-conservative pairs
Gaps are penalized
Total score is the sum of the individual pair
wise scores
Longer alignments give higher scores than shorter
ones

9
Gaps and Scoring

Gaps may be caused by insertion in one sequence
or deletion in the other (indel events). We
dont know which.
Gaps in an alignment are indicated by a - in
one or both of the sequences
Gaps are penalized in scoring an alignment in two
ways
Origination penalty - the scoring penalty for
creating a gap of any length (larger)
Length penalty - based on the length of the gap
(smaller)

10
A Simple Example of Gap Scoring
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) Calculate the scores
for each alignment. Which alignment is best and
why?
11
A Simple Example of Gap Scoring
Score -3
Score -1
Score 1
If scoring matrix says Match 1 Mismatch
0 Gap origination penalty -2 Gap length
penalty -1 (for each base) The third alignment
is best. From an evolutionary standpoint only
one genetic event (indel spanning 2 bases).
12
Scoring Matrices How values are assigned for
each pair in an alignment

DNA scoring matrices are fairly simple

13
Scoring Matrices How values are assigned for
each pair in an alignment

Protein matrices are far more complex
There are 20 letters v. only 4 in DNA
Far greater opportunity for conservative
substitutions
Some are based on observed substitutions
Others are based on chemical/physical properties
of the amino acids
Others are based on the genetic code (how easily
could a codon specifying one amino acid be
changed to a codon specifying a different amino
acid?)

14
Two Common Protein Scoring Matrices

The Point Accepted Mutation (PAM) matrix
Based on observed substitution rates
Different variations are used based on
assumptions of the length of time since the
sequences diverged
PAM-1 may be best for comparing two closely
related sequences
Pam-1000 may be best for comparing sequences with
distant relationships
PAM-250 is a suitable compromise

15
A PAM250 Scoring Matrix
16
Two Common Protein Scoring Matrices(cont.)

BLOSUM matrices are also commonly used
Constructed by analyzing substitution rates for
sequences that cluster by phylogenetic analysis
Also appended with numbers (but different
meaning)
BLOSUM-62 is best for comparing sequences with
approximately 62 similarity
BLOSUM-80 is best for comparing sequences with
approximately 80 similarity

17
Alignment Algorithms and Dynamic Programming

Computer trickery!
The straightforward approach is too intense
For 2 sequences of 95 and 100 nucleotides there
are 55 million possible alignments!
(imagine a database search in this context!)
Dynamic programming breaks the problem into a
series of small steps and adds the results of
these small steps to answer the problem

18
Dynamic Programming (cont.)
When you run an alignment a dynamic programming
matrix is formed with the two sequences on the
sides. Scores for each pair are placed in the
matrix. If the sequences match, you would start
in the lower right corner and proceed
diagonally to the upper left corner.
AC--TCG ACAGTAG
Alignment score 2 Vertical arrows indicate
internal gaps
19
Graphical Output Dot plots and Path Graphs
20
Comparison