Title: Roadmap
1Roadmap
- The topics
- basic concepts of molecular biology
- more on Perl
- overview of the field
- biological databases and database searching
- sequence alignments
- phylogenetics
- structure prediction
- microarray data analysis
2Sequence alignments
- Introduction
- What is an alignment?
- Why do alignments?
- A bit of history
- Dot matrix comparison
- Scoring alignments
- Alignment methods
- Significance of alignments
3What is Sequence alignment
- Sequence alignment is an arrangement of two or
more sequences, highlighting their similarity.
4Why do alignments?
- Sequence Alignment is useful for discovering
structural, functional and evolutional
information in biological sequences.
5Over time, genes accumulate mutations
- Environmental factors
- Radiation
- Oxidation
- Mistakes in replication/repair
- Deletions, Duplications
- Insertions
- Inversions
- Point mutations
6Comparing two sequences
- Point mutations, easyACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT - Insertions/deletions, must alignACGTCTGATACGCCGT
ATAGTCTATCTCTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTC
TATCT
7Sequence Alignment
- Doolittle RF, Hunkapiller MW, Hood LE,
- Devare SG, Robbins KC, Aaronson SA,
- Antoniades HN. Science 221275-277, 1983.
- A sequence for platelet derived
- growth factor (PDGF) from mammalian cells was
virtually identical to the sequence for the
retrovirus encoded oncogene known as v-sis (gene
causing cancer in animals). - Retrovirus had acquired the gene from the host
cell as some kind of genetic exchange event and
then had produced a mutant that could alter the
function of the normal protein when it infected
another animal.
8Dot Matrix Comparison
- A T C A G A G G T C T G
- B T C A G A G C T G
C
T
G
T
G
G
A
G
A
C
T
X
X
X
T
X
X
C
X
X
A
X
X
X
X
G
X
X
A
X
X
X
X
G
X
X
C
X
X
X
T
X
X
X
X
G
9Interpretation of dot matrix
- Regions of similarity appear as diagonal runs of
dots - Reverse diagonals (perpendicular to diagonal)
indicate inversions - Can link or "join" separate diagonals to form
alignment with "gaps"
10More on Dot Matrix
- Improving detection of matching regions by
filtering - using sliding window to compare the two
sequences. For example, print a dot at a matrix
position only if - 7 out of the next 11 positions in the sequence
are identical - Similarity score of the next 11 positions in the
sequence is greater than 5.
11Sequence repeats
- Many sequences contains repetitive regions.
a retrovirus vector sequence against itself using
a window size of 9 and mismatch limit of
2 (http//arbl.cvmbs.colostate.edu/molkit/dnadot/b
kg.html)
12More on Dot Matrix
- Dot matrix graphically presents regions of
identity or similarity between two sequences - The use of windows and thresholds can reduce
noise in dot matrix - Inversions and duplications have unique
signatures in dot matrix
13Software
- Dotlet (java applet)
- www.ch.embnet.org
- Dnadot
- arbl.cvmbs.colostate.edu/molkit/dnadot/
- Dotter
- www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
- Dottup
- www.emboss.org
14How to measure the similarity
- Basically three kinds of changes can occur at any
given position within a sequence - Mutation
- Insertion
- Deletion
- Insertion and deletion have been found to occur
in nature at a significantly lower frequency than
mutations.
15Scoring Matrices for Aligning DNA Sequences
- Transition --- substitutions in which a purine
(A/G) is replaced by another purine (A/G) or a
pyrimadine (C/T) is replaced by another
pyrimadine (C/T). - Transversions ---
- (A/G) ? (C/T)
16Scoring a sequence alignment
- Match score 1
- Mismatch score 0
- Gap penalty 1
- ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT - Matches 18 (1)
- Mismatches 2 0
- Gaps 7 ( 1)
Score 11
17Gap opening and extension penalties
- We want to find alignments that are
evolutionarily likely. - Which of the following alignments seems more
likely to you? - ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGT
CTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-C
GT-TA-TCTATCT - We can achieve this by penalizing more for a new
gap, than for extending an existing gap
?
?
18Scoring a sequence alignment
- Match/mismatch score 1/0
- Open/extension penalty 2/1ACGTCTGATACGCCGTATAG
TCTATCT ----CTGATTCGC-
--ATCGTCTATCT - Matches 18 (1)
- Mismatches 2 0
- Open 2 (2)
- Extension 5 (1)
Score 9
19Amino Acid Substitution Matrices
- PAM - point accepted mutation based on global
alignment evolutionary model - BLOSUM - block substitutions based on local
alignments similarity among conserved sequences
20Part of PAM 250 Matrix
C S T P A G
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
21PAM matrices
- PAM 1 Matrix reflects an amount of evolution
producing on average one mutation per hundred
amino acids (1 unit evolution). - PAM 250 --- 250 unit evolution
22Limitations of PAM Matrices
- Constructed based on the phylogenetic
relationships prior to scoring mutations - Difficulty of determining ancestral relationships
among sequences - Based on a small set of closely related proteins
23BLOSUM Matrices
- Based on the observed amino acid substitutions in
a large set of 2000 conserved amino acid
patterns (blocks). The blocks are found in a
database of protein sequences representing more
than 500 families of related proteins and act as
signatures of these protein families. - The matrices are measured on the multiple
alignment of the blocks. - The entries of the matrices are computed based on
the same principle used in PAM -- log(odds
ratio).
24Part of BLOSUM 62 Matrix
- BLOSUM62 was measured on pairs of sequences with
an average of 62 identical amino acids.
C S T P A G
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
25PAM vs. BLOSUM
- PAM
- Based on mutational model of evolution (Markov
process) - PAM1 is based on sequences of 85 similarity
- Designed to track the evolutionary origins
- BLOSUM
- Based on the multiple alignment of blocks
- Good to be used to compare distant sequences
- Designed to find proteins conserved domains
26Gap Penalty
- Optimal penalties vary from sequence to sequence,
and finding the most adequate value is a matter
of empirical trial and error. - When compare distantly related sequences, a high
gap-opening penalty and a very low gap-extension
penalty often give better results - When compare closely related sequences, gaps
should be penalized on both a gap-opening and
gap-extension