Roadmap - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Roadmap

Description:

Sequence alignments Introduction What is an alignment? ... Retrovirus had acquired the gene from the host cell as some kind of genetic ... Bioinformatics Author: – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 27
Provided by: duan85
Category:

less

Transcript and Presenter's Notes

Title: Roadmap


1
Roadmap
  • The topics
  • basic concepts of molecular biology
  • more on Perl
  • overview of the field
  • biological databases and database searching
  • sequence alignments
  • phylogenetics
  • structure prediction
  • microarray data analysis

2
Sequence alignments
  • Introduction
  • What is an alignment?
  • Why do alignments?
  • A bit of history
  • Dot matrix comparison
  • Scoring alignments
  • Alignment methods
  • Significance of alignments

3
What is Sequence alignment
  • Sequence alignment is an arrangement of two or
    more sequences, highlighting their similarity.

4
Why do alignments?
  • Sequence Alignment is useful for discovering
    structural, functional and evolutional
    information in biological sequences.

5
Over time, genes accumulate mutations
  • Environmental factors
  • Radiation
  • Oxidation
  • Mistakes in replication/repair
  • Deletions, Duplications
  • Insertions
  • Inversions
  • Point mutations

6
Comparing two sequences
  • Point mutations, easyACGTCTGATACGCCGTATAGTCTATCT
    ACGTCTGATTCGCCCTATCGTCTATCT
  • Insertions/deletions, must alignACGTCTGATACGCCGT
    ATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTC
TATCT
7
Sequence Alignment
  • Doolittle RF, Hunkapiller MW, Hood LE,
  • Devare SG, Robbins KC, Aaronson SA,
  • Antoniades HN. Science 221275-277, 1983.
  • A sequence for platelet derived
  • growth factor (PDGF) from mammalian cells was
    virtually identical to the sequence for the
    retrovirus encoded oncogene known as v-sis (gene
    causing cancer in animals).
  • Retrovirus had acquired the gene from the host
    cell as some kind of genetic exchange event and
    then had produced a mutant that could alter the
    function of the normal protein when it infected
    another animal.

8
Dot Matrix Comparison
  • A T C A G A G G T C T G
  • B T C A G A G C T G

C
T
G
T
G
G
A
G
A
C
T
X
X
X
T
X
X
C
X
X
A
X
X
X
X
G
X
X
A
X
X
X
X
G
X
X
C
X
X
X
T
X
X
X
X
G
9
Interpretation of dot matrix
  • Regions of similarity appear as diagonal runs of
    dots
  • Reverse diagonals (perpendicular to diagonal)
    indicate inversions
  • Can link or "join" separate diagonals to form
    alignment with "gaps"

10
More on Dot Matrix
  • Improving detection of matching regions by
    filtering
  • using sliding window to compare the two
    sequences. For example, print a dot at a matrix
    position only if
  • 7 out of the next 11 positions in the sequence
    are identical
  • Similarity score of the next 11 positions in the
    sequence is greater than 5.

11
Sequence repeats
  • Many sequences contains repetitive regions.

a retrovirus vector sequence against itself using
a window size of 9 and mismatch limit of
2 (http//arbl.cvmbs.colostate.edu/molkit/dnadot/b
kg.html)
12
More on Dot Matrix
  • Dot matrix graphically presents regions of
    identity or similarity between two sequences
  • The use of windows and thresholds can reduce
    noise in dot matrix
  • Inversions and duplications have unique
    signatures in dot matrix

13
Software
  • Dotlet (java applet)
  • www.ch.embnet.org
  • Dnadot
  • arbl.cvmbs.colostate.edu/molkit/dnadot/
  • Dotter
  • www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
  • Dottup
  • www.emboss.org

14
How to measure the similarity
  • Basically three kinds of changes can occur at any
    given position within a sequence
  • Mutation
  • Insertion
  • Deletion
  • Insertion and deletion have been found to occur
    in nature at a significantly lower frequency than
    mutations.

15
Scoring Matrices for Aligning DNA Sequences
  • Transition --- substitutions in which a purine
    (A/G) is replaced by another purine (A/G) or a
    pyrimadine (C/T) is replaced by another
    pyrimadine (C/T).
  • Transversions ---
  • (A/G) ? (C/T)

16
Scoring a sequence alignment
  • Match score 1
  • Mismatch score 0
  • Gap penalty 1
  • ACGTCTGATACGCCGTATAGTCTATCT
    ----CTGATTCGC---ATCGTCTATCT
  • Matches 18 (1)
  • Mismatches 2 0
  • Gaps 7 ( 1)

Score 11
17
Gap opening and extension penalties
  • We want to find alignments that are
    evolutionarily likely.
  • Which of the following alignments seems more
    likely to you?
  • ACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGT
    CTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-C
    GT-TA-TCTATCT
  • We can achieve this by penalizing more for a new
    gap, than for extending an existing gap

?
?
18
Scoring a sequence alignment
  • Match/mismatch score 1/0
  • Open/extension penalty 2/1ACGTCTGATACGCCGTATAG
    TCTATCT ----CTGATTCGC-
    --ATCGTCTATCT
  • Matches 18 (1)
  • Mismatches 2 0
  • Open 2 (2)
  • Extension 5 (1)

Score 9
19
Amino Acid Substitution Matrices
  • PAM - point accepted mutation based on global
    alignment evolutionary model
  • BLOSUM - block substitutions based on local
    alignments similarity among conserved sequences

20
Part of PAM 250 Matrix
C S T P A G
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
21
PAM matrices
  • PAM 1 Matrix reflects an amount of evolution
    producing on average one mutation per hundred
    amino acids (1 unit evolution).
  • PAM 250 --- 250 unit evolution

22
Limitations of PAM Matrices
  • Constructed based on the phylogenetic
    relationships prior to scoring mutations
  • Difficulty of determining ancestral relationships
    among sequences
  • Based on a small set of closely related proteins

23
BLOSUM Matrices
  • Based on the observed amino acid substitutions in
    a large set of 2000 conserved amino acid
    patterns (blocks). The blocks are found in a
    database of protein sequences representing more
    than 500 families of related proteins and act as
    signatures of these protein families.
  • The matrices are measured on the multiple
    alignment of the blocks.
  • The entries of the matrices are computed based on
    the same principle used in PAM -- log(odds
    ratio).

24
Part of BLOSUM 62 Matrix
  • BLOSUM62 was measured on pairs of sequences with
    an average of 62 identical amino acids.

C S T P A G
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
25
PAM vs. BLOSUM
  • PAM
  • Based on mutational model of evolution (Markov
    process)
  • PAM1 is based on sequences of 85 similarity
  • Designed to track the evolutionary origins
  • BLOSUM
  • Based on the multiple alignment of blocks
  • Good to be used to compare distant sequences
  • Designed to find proteins conserved domains

26
Gap Penalty
  • Optimal penalties vary from sequence to sequence,
    and finding the most adequate value is a matter
    of empirical trial and error.
  • When compare distantly related sequences, a high
    gap-opening penalty and a very low gap-extension
    penalty often give better results
  • When compare closely related sequences, gaps
    should be penalized on both a gap-opening and
    gap-extension
Write a Comment
User Comments (0)
About PowerShow.com