Introduction to Bioinformatics Introduction to - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Introduction to Bioinformatics Introduction to

Description:

Introduction to Bioinformatics Introduction to Bioinformatics. Introduction to Bioinformatics LECTURE 3: SEQUENCE ALIGNMENT Mutant Drosophila melanogaster: gene ... – PowerPoint PPT presentation

Number of Views:562
Avg rating:3.0/5.0
Slides: 62
Provided by: personeel
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics Introduction to


1
Introduction to
Bioinformatics
2
Introduction to Bioinformatics.
LECTURE 3 SEQUENCE ALIGNMENT Chapter 3 All
in the family
3
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.1 Eye of the tiger
  • In 1994 Walter Gehring et alum (Un. Basel) turn
    the gene eyeless on in various places on
    Drosophila melanogaster
  • Result on multiple places eyes are formed
  • eyeless is a master regulatory gene that
    controls /- 2000 other genes
  • eyeless on induces formation of an eye

4
(No Transcript)
5
Mutant Drosophila melanogaster gene EYELESS
turned on
6
LECTURE 3 SEQUENCE ALIGNMENTHomeoboxes and
Master regulatory genes
7
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • HOMEO BOX
  • A homeobox is a DNA sequence found within genes
    that are involved in the regulation of
    development (morphogenesis) of animals, fungi and
    plants.

8
LECTURE 3 SEQUENCE ALIGNMENTDrosophila
melanogaster HOX homeoboxes
9
LECTURE 3 SEQUENCE ALIGNMENTDrosophila
melanogaster PAX homeoboxes
10
LECTURE 3 SEQUENCE ALIGNMENTHomeoboxes and
Master regulatory genes
11
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.2 On sequence alignment
  • Sequence alignment is the most important task in
    bioinformatics!

12
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.2 On sequence alignment
  • Sequence alignment is important for
  • prediction of function
  • database searching
  • gene finding
  • sequence divergence
  • sequence assembly

13
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.3 On sequence similarity
  • Homology genes that derive from a common
    ancestor-gene are called homologs
  • Orthologous genes are homologous genes in
    different organisms
  • Paralogous genes are homologous genes in one
    organism that derive from gene duplication
  • Gene duplication one gene is duplicated in
    multiple copies that therefore free to evolve and
    assume new functions

14
LECTURE 3 SEQUENCE ALIGNMENTHOMOLOGOUS and
PARALOGOUS
15
LECTURE 3 SEQUENCE ALIGNMENTHOMOLOGOUS and
PARALOGOUS
16
LECTURE 3 SEQUENCE ALIGNMENTHOMOLOGOUS and
PARALOGOUS versus ANALOGOUS
17
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT sequence similarity
  • Causes for sequence (dis)similarity
  • mutation a nucleotide at a certain location is
    replaced by
  • another nucleotide (e.g. ATA ? AGA)
  • insertion at a certain location one new
    nucleotide is inserted inbetween two existing
    nucleotides (e.g. AA ? AGA)
  • deletion at a certain location one existing
    nucleotide is deleted (e.g. ACTG ? AC-G)
  • indel an insertion or a deletion

18
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.4 Sequence alignment global and local
  • Find the similarity between two (or
    more) DNA-sequences by finding a good
    alignment between them.

19
The biological problem of sequence alignment
DNA-sequence-1 tcctctgcctctgccatcat---caaccccaaa
gt
tcctgtgcatctgcaatcatgggcaaccccaaagt DNA-sequence-
2
20
Sequence alignment - definition
Sequence alignment is an arrangement of two or
more sequences, highlighting their similarity.
The sequences are padded with gaps (dashes) so
that wherever possible, columns contain identical
characters from the sequences involved
tcctctgcctctgccatcat---caaccccaaagt
tcctgtgcatctgcaatcatggg
caaccccaaagt
21
Algorithms
Needleman-Wunsch Pairwise global alignment
only. Smith-Waterman Pairwise, local (or global)
alignment. BLAST Pairwise heuristic local
alignment
22
Pairwise alignment
Pairwise sequence alignment methods are concerned
with finding the best-matching piecewise local or
global alignments of protein (amino acid) or DNA
(nucleic acid) sequences. Typically, the purpose
of this is to find homologues (relatives) of a
gene or gene-product in a database of known
examples. This information is useful for
answering a variety of biological questions 1.
The identification of sequences of unknown
structure or function. 2. The study of
molecular evolution.
23
Global alignment
A global alignment between two sequences is an
alignment in which all the characters in both
sequences participate in the alignment. Global
alignments are useful mostly for finding
closely-related sequences. As these sequences
are also easily identified by local alignment
methods global alignment is now somewhat
deprecated as a technique. Further, there are
several complications to molecular evolution
(such as domain shuffling) which prevent these
methods from being useful.
24
Global Alignment
Find the global best fit between two
sequences Example the sequences s
VIVALASVEGAS and t VIVADAVIS align
like A(s,t)
V I V A L A S V E G A S V
I V A D A - V - - I S
25
The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm (1970, J Mol Biol.
48(3)443-53) performs a global alignment on two
sequences (s and t) and is applied to align
protein or nucleotide sequences. The
Needleman-Wunsch algorithm is an example of
dynamic programming, and is guaranteed to find
the alignment with the maximum score.
26
The Needleman-Wunsch algorithm
Of course this works for both DNA-sequences as
for protein-sequences.
27
Alignment scoring function
The cost of aligning two symbols xi and yj is the
scoring function s(xi,yj )
28
Alignment cost
The cost of the entire alignment
29
A simple scoring function
s(-,a) s(a,-) -1 s(a,b) -1 if a ?
b s(a,b) 1 if a b
30
The substitution matrix
A more realistic scoring function is given by the
biologically inspired substitution matrix
- A G C T A 10 -1 -3 -4 G -1 7
-5 -3 C -3 -5 9 0 T -4 -3 0 8
Examples PAM (Point Accepted Mutation)
(Margaret Dayhoff) BLOSUM (BLOck SUbstitution
Matrix) (Henikoff and Henikoff)
31
Scoring function
The cost for aligning the two sequences s
VIVALASVEGAS and t VIVADAVIS A(s,t)
is M(A) 7 matches 2 mismatches 3
gaps 7 2
3 2
V I V A L A S V E G A S V
I V A D A - V - - I S
32
Optimal global alignment
The optimal global alignment A between two
sequences s and t is the alignment A(s,t) that
maximizes the total alignment score M(A) over all
possible alignments. A argmax M(A) Finding
the optimal alignment A looks a combinatorial
optimization problem i. generate all possible
allignments ii. compute the score M iii. select
the alignment A with the maximum score M
33
Local alignment
Local alignment methods find related regions
within sequences - they can consist of a subset
of the characters within each sequence. For
example, positions 20-40 of sequence A might be
aligned with positions 50-70 of sequence
B. This is a more flexible technique than global
alignment and has the advantage that related
regions which appear in a different order in the
two proteins (which is known as domain shuffling)
can be identified as being related. This is not
possible with global alignment methods.
34
The Smith Waterman algorithm
The Smith-Waterman algorithm (1981) is for
determining similar regions between two
nucleotide or protein sequences. Smith-Waterman
is also a dynamic programming algorithm and
improves on Needleman-Wunsch. As such, it has the
desirable property that it is guaranteed to find
the optimal local alignment with respect to the
scoring system being used (which includes the
substitution matrix and the gap-scoring scheme).
However, the Smith-Waterman algorithm is
demanding of time and memory resources in order
to align two sequences of lengths m and n, O(mn)
time and space are required. As a result, it
has largely been replaced in practical use by the
BLAST algorithm although not guaranteed to find
optimal alignments, BLAST is much more efficient.
35
Optimal local alignment
The optimal local alignment A between two
sequences s and t is the optimal global alignment
A(s(i1i2), t(j1j2) ) of the sub-sequences
s(i1i2) and t(j1j2) for some optimal choice of
i1, i2, j1 and j2.
36
Sequence alignment - meaning
Sequence alignment is used to study the evolution
of the sequences from a common ancestor such as
protein sequences or DNA sequences. Mismatches
in the alignment correspond to mutations, and
gaps correspond to insertions or deletions.
Sequence alignment also refers to the process
of constructing significant alignments in a
database of potentially unrelated sequences.
37
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.5 Statistical analysis of alignments
  • This works identical to gene finding
  • Generate randomized sequences based on the
    second string
  • Determine the optimal alignments of the first
    sequence with these randomized sequences
  • Compute a histogram and rank the observed
    score in this histogram
  • The relative position defines the p-value.

38
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT statistical analysis
39
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.6 BLAST fast approximate alignment
  • Fast but heuristic
  • Most used algorithm in bioinformatics
  • Verb to blast

40
(No Transcript)
41
(No Transcript)
42
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.7 Multiple sequence alignment
  • Determine the best alignment between multiple
    (more than two) DNA-sequences.

43
Multiple alignment
Multiple alignment is an extension of pairwise
alignment to incorporate more than two sequences
into an alignment. Multiple alignment methods
try to align all of the sequences in a specified
set. The most popular multiple alignment tool
is CLUSTAL. Multiple sequence alignment is
computationally difficult and is classified as an
NP-Hard problem.
44
Multiple alignment
45
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • 3.8 Computing the alignments
  • NW and SW are both based on Dynamic Programming
    (DP)
  • A recursive relation breaks down the
    computation

46
Dynamic Programming Approach to Sequence
Alignment
The dynamic programming approach to sequence
alignment always tries to follow the best
prior-result so far. Try to align two sequences
by inserting some gaps at different locations, so
as to maximize the score of this alignment.
Score measurement is determined by "match
award", "mismatch penalty" and "gap penalty". The
higher the score, the better the alignment. If
both penalties are set to 0, it aims to always
find an alignment with maximum matches so far.
Maximum match largest number matches can have
for one sequence by allowing all possible
deletion of another sequence. It is used to
compare the similarity between two sequences of
DNA or Protein, to predict similarity of their
functionalities. Examples Needleman-Wunsch(1970),
Sellers(1974), Smith-Waterman(1981)
47
The Needleman-Wunsch algorithm
The Needleman-Wunsch algorithm (1970, J Mol Biol.
48(3)443-53) performs a global alignment on two
sequences (A and B) and is applied to align
protein or nucleotide sequences. The
Needleman-Wunsch algorithm is an example of
dynamic programming, and is guaranteed to find
the alignment with the maximum score. Scores
for aligned characters are specified by the
transition matrix s (i,j) the similarity of
characters i and j.
48
The Needleman-Wunsch algorithm
For example, if the substitution matrix was -
A G C T A 10 -1 -3 -4 G -1 7 -5
-3 C -3 -5 9 0 T -4 -3 0 8 with a
gap penalty of -5, would have the following
score...
then the alignment AGACTAGTTAC
CGA---GACGT
49
The Needleman-Wunsch algorithm
  • Create a table of size (m1)x(n1) for sequences
    s and t of lengths m and n,
  • Fill table entries (m1) and (1n) with the
    values
  • Starting from the top left, compute each entry
    using the recursive relation
  • Perform the trace-back procedure from he
    bottom-right corner

50
The Needleman-Wunsch algorithm
Once the F matrix is computed, note that the
bottom right hand corner of the matrix is the
maximum score for any alignments. To compute
which alignment actually gives this score, you
can start from the bottom left cell, and compare
the value with the three possible
sources(Choice1, Choice2, and Choice3 above) to
see which it came from. If it was Choice1, then
A(i) and B(i) are aligned, if it was Choice2 then
A(i) is aligned with a gap, and if it was
Choice3, then B(i) is aligned with a gap.
51
The Needleman-Wunsch algorithm
52
(No Transcript)
53
The Smith-Waterman algorithm
  • Create a table of size (m1)x(n1) for sequences
    s and t of lengths m and n,
  • Fill table entries (1,1m1) and (1n1,1) with
    zeros.
  • Starting from the top left, compute each entry
    using the recursive relation
  • Perform the trace-back procedure from the maximum
    element in the table to the first zero element on
    the trace-back path.

54
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
  • EXAMPLE Eyeless Gene Homeobox
  • Compare the gene eyeless of Drosophila
    Melanoganster with the human gene aniridia. They
    are master regulatory genes producing proteins
    that control large cascade of other genes.
    Certain segments of genes eyeless of Drosophila
    melanogaster and human aniridia are almost
    identical. The most important of such segments
    encodes the PAX (paired-box) domain, a sequence
    of 128 amino acids whose function is to bind
    specific sequences of DNA. Another common segment
    is the HOX (homeobox) domain that is thougth to
    be part of more than 0.2 of the total nummber of
    vertebrate genes.

55
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
56
Introduction to BioinformaticsLECTURE 3 GLOBAL
ALIGNMENT
57
Introduction to BioinformaticsLECTURE 3 GLOBAL
ALIGNMENT
58
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
59
END of LECTURE 3
60
Introduction to BioinformaticsLECTURE 3
SEQUENCE ALIGNMENT
61
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com