Alignments and Comparative Genomics - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Alignments and Comparative Genomics

Description:

Biology in One Slide Twentieth Century ...ACGTGACTGAGGACCGTG. CGACTGAGACTGACTGGGT ... Construct a dictionary of all the words in the query ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 59
Provided by: Sera4
Category:

less

Transcript and Presenter's Notes

Title: Alignments and Comparative Genomics


1
Alignments and Comparative Genomics
2
Welcome to CS374!
  • Today
  • Serafim Alignments and Comparative Genomics
  • Omkar Administrivia

3
Biology in One Slide Twentieth Century
and today
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGAC
TACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG
ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
4
Complete DNA Sequences
nearly 200 complete genomes have been sequenced
5
Evolution
6
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
7
Evolutionary Rates



next generation
OK



OK



OK



X



X



Still OK?



8
Sequence conservation implies function
  • Alignment is the key to
  • Finding important regions
  • Determining function
  • Uncovering the evolutionary forces

9
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
10
What is a good alignment?
  • Alignment
  • The best way to match the letters of one
    sequence with those of the other
  • How do we define best?
  • Alignment
  • A hypothesis that the two sequences come from a
    common ancestor through sequence edits
  • Parsimonious explanation
  • Find the minimum number of edits that transform
    one sequence into the other

11
Scoring Function
  • Sequence edits AGGCCTC
  • Mutations
  • AGGACTC
  • Insertions
  • AGGGCCTC
  • Deletions
  • AGG.CTC
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap -d
  • Score F ( matches) ? m - ( mismatches) ? s
    (gaps) ? d

12
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible alignments O( 2MN)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
13
Dynamic Programming
  • Given two sequences x x1xM and y y1yN
  • Let F(i, j) Score of best alignment of x1xi
    to y1yj
  • Then, F(M, N) Score of best alignment
  • Idea
  • Compute F(i, j) for all i and j
  • Do this by using F(i1 , j), F(i, j1), F(i1,
    j1)

14
Dynamic Programming (contd)
  • Notice three possible cases
  • xi aligns to yj
  • x1xi-1 xi
  • y1yj-1 yj
  • 2. xi aligns to a gap
  • x1xi-1 xi
  • y1yj -
  • yj aligns to a gap
  • x1xi -
  • y1yj-1 yj

m, if xi yj F(i,j) F(i-1, j-1)
-s, if not
F(i,j) F(i-1, j) - d
F(i,j) F(i, j-1) - d
15
Dynamic Programming (contd)
  • How do we know which case is correct?
  • Inductive assumption
  • F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal
  • Then,
  • F(i-1, j-1) s(xi, yj)
  • F(i, j) max F(i-1, j) d
  • F( i, j-1) d
  • Where s(xi, yj) m, if xi yj -s, if not

i-1, j-1
i-1, j
i, j-1
i, j
16
Example
  • x AGTA m 1
  • y ATA s -1
  • d -1

F(i,j) i 0 1 2 3 4
Optimal Alignment F(4,3) 2 AGTA A - TA
j 0
1
2
3
17
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
18
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling-in partial alignments
  • For each i 1M
  • For each j 1N
  • F(i-1,j) d case 1
  • F(i, j) max F(i, j-1) d case
    2
  • F(i-1, j-1) s(xi, yj) case 3
  • UP, if case 1
  • Ptr(i,j) LEFT if case 2
  • DIAG if case 3
  • Termination. F(M, N) is the optimal score, and
  • from Ptr(M, N) can trace back optimal alignment

19
Performance
  • Time
  • O(NM)
  • Space
  • O(NM)

20
Alignment on a Large Scale
  • Given a newly sequenced organism,
  • Which subregions align with other organisms?
  • Potential genes
  • Other biological characteristics
  • Assume we use Dynamic Programming

Our newly sequenced mammal
3?109
The entire genomic database
1010 - 1011
21
Index-based Local Alignment
  • Main idea
  • Construct a dictionary of all the words in the
    query
  • Initiate a local alignment for each word match
    between query and DB
  • Running Time
  • Theoretical worst case O(MN)
  • Fast in practice

query
DB
22
Index-based Local Alignment BLAST
  • Dictionary
  • All words of length k (11)
  • Alignment initiated between exact-matching words
  • (more generally, between words of alignment
    score ? T)
  • Alignment
  • Ungapped extensions until score
  • below statistical threshold
  • Output
  • All local alignments with score
  • gt statistical threshold


query

scan
DB
query
23
Index-based Local Alignment BLAST
A C G A A G T A A G G T C
C A G T
Example k 4, T 4 The matching word GGTC
initiates an alignment Extension to the left and
right with no gaps until alignment falls lt
50 Output GTAAGGTCC GTTAGGTCC
C C C T T C C T G G A T T
G C G A
24
Gapped BLAST
A C G A A G T A A G G T C
C A G T
  • Added features
  • Pairs of words can initiate alignment
  • Nearby alignments are merged
  • Extensions with gaps until score lt T below best
    score so far
  • Output
  • GTAAGGTCCAGT
  • GTTAGGTC-AGT

C T G A T C C T G G A T T
G C G A
25
Example
  • Query gattacaccccgattacaccccgattaca (29 letters)
    2 mins
  • Database All GenBankEMBLDDBJPDB sequences
    (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
    sequences) 1,726,556 sequences 8,074,398,388
    total letters
  • gtgi28570323gbAC108906.9 Oryza sativa
    chromosome 3 BAC OSJNBa0087C10 genomic sequence,
    complete sequence Length 144487 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125138 tacacccagattacaccccga 125158
  • Score 34.2 bits (17),
  • Expect 4.5 Identities 20/21 (95) Strand
    Plus / Plus
  • Query 4 tacaccccgattacaccccga 24
  • Sbjct 125104 tacacccagattacaccccga 125124
  • gtgi28173089gbAC104321.7 Oryza sativa
    chromosome 3 BAC OSJNBa0052F07 genomic sequence,
    complete sequence Length 139823 Score 34.2
    bits (17), Expect 4.5 Identities 20/21 (95)
    Strand Plus / Plus

26
Efficient global alignment
27
Global alignment with the chaining approach
  • Find local alignments
  • Chain them into a rough global map
  • Align regions in-between

28
LAGAN 1. FIND Local Alignments
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Mike Brudno, Chuong B Do, et al.
29
LAGAN 2. CHAIN Local Alignments
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Mike Brudno, Chuong B Do, et al.
30
LAGAN 3. Restricted DP
  • Find Local Alignments
  • Chain Local Alignments
  • Restricted DP

Mike Brudno, Chuong B Do, et al.
31
Restricted DP (contd)
  • What if a box is too large?
  • Recursive application of LAGAN,
  • more sensitive word search

32
Multiple Alignment
33
(No Transcript)
34
Scoring Function Sum Of Pairs
  • Definition Induced pairwise alignment
  • A pairwise alignment induced by the multiple
    alignment
  • Example
  • x AC-GCGG-C
  • y AC-GC-GAG
  • z GCCGC-GAG
  • Induces
  • x ACGCGG-C x AC-GCGG-C y AC-GCGAG
  • y ACGC-GAC z GCCGC-GAG z GCCGCGAG

35
Sum Of Pairs (contd)
  • The sum-of-pairs score of an alignment is the
    sum of the scores of all induced pairwise
    alignments
  • S(m) ?kltl s(mk, ml)
  • s(mk, ml) score of induced alignment (k,l)

36
Dynamic Programming for Multiple Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
37
Progressive Alignment
  • Multiple Alignment is NP-complete
  • Most used heuristic Progressive Alignment
  • Algorithm
  • Until all sequences are aligned
  • Align two (multi-)sequences to each other, and
    treat the result as a new sequence
  • Example aligning AACTGTA with AATGTC, gives
  • AACTGTA
  • AA-TGTC, with letters (AA), (AA), (C-), (TT),
    (GG), (TT), (AC)
  • Running Time O(NL2), where N seqs, L length
    of a seq

38
MLAGAN Progressive Alignment
Human
Baboon
Mouse
Rat
  • Given N sequences, phylogenetic tree
  • Align pairwise, in order of the tree (LAGAN)
  • With needed generalizations for multi-anchoring
    scoring edit distance

39
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
40
Local Global Alignment
Global
Local
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
41
Glocal Alignment Problem
  • Find least cost transformation of one sequence
    into another using shuffle operations

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
  • Sequence edits
  • Inversions
  • Translocations
  • Duplications
  • Combinations of above

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
42
SLAGAN 1. Find Local Alignments
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

43
SLAGAN 2. Build Homology Map
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

44
SLAGAN 2. Build Homology Map
Chain using Sparse Dynamic Programming
  • Penalties
  • regular
  • translocation

c) inversion d) inverted translocation
45
SLAGAN 2. Build Homology Map
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

46
SLAGAN 3. Global Alignment
  • Find Local Alignments
  • Build Rough Homology Map
  • Globally Align Consistent Parts

47
SLAGAN Example Chromosome 20
  • Human Chromosome 20 versus Mouse Chromosome 2
  • 270 Segments of conserved synteny
  • 70 Inversions

48
SLAGAN example HOX cluster
  • 10 paralogous genes
  • Conserved order in Human/Mouse/Rat

49
SLAGAN example HOX cluster
  • 10 paralogous genes
  • Conserved order in Human/Mouse/Rat

50
Whole-genome alignment with SLAGAN
  • Two-step Shuffle
  • Shuffle for large-scale synteny map
  • Shuffle each syntenic region for
    microrearrangements

51
The ENCODE Project
52
ENCODE regions shuffled
Hum/Rat
Hum/Mus
53
ENCODE regions shuffled
Hum/Mus
Hum/Rat
54
ENCODE regions shuffled
Hum/Rat
Hum/Mus
55
ENCODE regions shuffled
Hum/Mus
Hum/Rat
56
ENCODE regions shuffled
Hum/Rat
Hum/Mus
57
Constrained Elements in Alignments
58
Human-Mouse-Rat
Berkeley Genome Pipeline http//pipeline.lbl.gov
59
Human-Mouse-Rat
60
More DNA is coming
Write a Comment
User Comments (0)
About PowerShow.com