Some new sequencing technologies - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Some new sequencing technologies

Description:

Clone Jurassic park! Study evolution of function. Find functional elements within a genome ... More general: Any 4-long stem, 3-5-long loop: S aW1u | gW1u ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 48

Provided by: root

Category:

more less

Transcript and Presenter's Notes

Title: Some new sequencing technologies

1
Some new sequencing technologies
2
Molecular Inversion Probes
3
Single Molecule Array for GenotypingSolexa
4
Nanopore Sequencing
http//www.mcb.harvard.edu/branton/index.htm
5
Pyrosequencing
6
Pyrosequencing on a chip
Mostafa Ronaghi, Stanford Genome Technologies
Center 454 Life Sciences
7
Polony Sequencing
8
Some future directions for sequencing

Personalized genome sequencing
Find your 1,000,000 single nucleotide
polymorphisms (SNPs)
Find your rearrangements
Goals
Link genome with phenotype
Provide personalized diet and medicine
(???) designer babies, big-brother insurance
companies
Timeline
Inexpensive sequencing 2010-2015
Genotypephenotype association 2010-???
Personalized drugs 2015-???

9
Some future directions for sequencing

2. Environmental sequencing
Find your flora organisms living in your body
External organs skin, mucous membranes
Gut, mouth, etc.
Normal flora gt200 species, gttrillions of
individuals
Floradisease, floranon-optimal health
associations
Timeline
Inexpensive research sequencing today
Research associations within next 10 years
Personalized sequencing 2015
Find diversity of organisms living in different
environments
Hard to isolate
Assembly of all organisms at once

10
Some future directions for sequencing

Organism sequencing
Sequence a large fraction of all organisms
Deduce ancestors
Reconstruct ancestral genomes
Synthesize ancestral genomes
CloneJurassic park!
Study evolution of function
Find functional elements within a genome
How those evolved in different organisms
Find how modules/machines composed of many genes
evolved

11
RNA Secondary Structure
aagacuucggaucuggcgacaccc uacacuucggaugacaccaaa
gug aggucuucggcacgggcaccauuc ccaacuucggauuuugc
uaccaua aagccuucggagcgggcguaacuc
12
RNA and Translation
13
RNA and Splicing
14
Hairpin Loops
Interior loops
Stems
Multi-branched loop
Bulge loop
15
Tertiary Structure
Secondary Structure
16
(No Transcript)
17
(No Transcript)
18
Modeling RNA Secondary StructureContext-Free
Grammars
19
A Context Free Grammar

S ? AB Nonterminals S, A, B
A ? aAc a Terminals a, b, c, d
B ? bBd b Production Rules 5 rules
Derivation
Start from the S nonterminal
Use any production rule replacing a nonterminal
with a terminal, until no more nonterminals are
present
S ? AB ? aAcB ? ? aaaacccB ? aaaacccbBd ? ?
aaaacccbbbbbdddd
Produces all strings ai1cibj1dj, for i, j ? 0

20
Example modeling a stem loop
AG U CG

S ? a W1 u
W1 ? c W2 g
W2 ? g W3 c
W3 ? g L c
L ? agugc
What if the stem loop can have other letters in
place of the ones shown?

ACGG UGCC
21
Example modeling a stem loop
AG U CG

S ? a W1 u g W1 u
W1 ? c W2 g
W2 ? g W3 c g W3 u
W3 ? g L c a L u
L ? agucg agccg cugugc
More general Any 4-long stem, 3-5-long loop
S ? aW1u gW1u gW1c cW1g uW1g
uW1a
W1 ? aW2u gW2u gW2c cW2g uW2g
uW2a
W2 ? aW3u gW3u gW3c cW3g uW3g
uW3a
W3 ? aLu gLu gLc cLg
uLg uLa
L ? aL1 cL1 gL1 uL1
L1 ? aL2 cL2 gL2 uL2
L2 ? a c g u aa uu aaa uuu

ACGG UGCC
AG C CG
GCGA UGCU
CUG U CG
GCGA UGUU
22
A parse tree alignment of CFG to sequence
AG U CG

S ? a W1 u
W1 ? c W2 g
W2 ? g W3 c
W3 ? g L c
L ? agucg

ACGG UGCC
S
W1
W2
W3
L
A C G G A G U G C C C G U
23
Alignment scores for parses!

We can define each rule X ? s, where s is a
string,
to have a score.
Example
W ? g W c 3 (forms 3 hydrogen bonds)
W ? a W u 2 (forms 2 hydrogen bonds)
W ? g W u 1 (forms 1 hydrogen bond)
W ? x W z -1, when (x, z) is not an a/u, g/c,
g/u pair
Questions
How do we best align a CFG to a sequence?
(DP)
How do we set the parameters? (Stochastic CFGs)

24
The Nussinov Algorithm
A
C
C
A

Lets forget CFGs for a moment
Problem
Find the RNA structure with the maximum
(weighted) number of nested pairings

G
C
C
G
G
C
A
U
A
U
U
A
U
A
A
C
C
A
A
G
C
A
G
U
A
A
G
G
C
U
C
G
U
U
C
G
A
C
U
C
G
U
C
G
A
G
U
G
G
A
G
G
C
G
A
G
C
G
A
U
G
C
A
U
C
A
A
U
U
G
A
ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACC
GCGAGAGGGAAGACUCGUAUAAGCG
25
The Nussinov Algorithm

Given sequence X x1xN,
Define DP matrix
F(i, j) maximum number of weighted bonds if
xixj folds optimally
Two cases, if i lt j
xi is paired with xj
F(i, j) s(xi, xj) F(i1, j 1)
xi is not paired with xj
F(i, j) max k i ? k lt j F(i, k) F(k1, j)

i
j
i
j
i
j
k
26
The Nussinov Algorithm

Initialization
F(i, i-1) 0 for i 2 to N
F(i, i) 0 for i 1 to N
Iteration
For l 2 to N
For i 1 to N l
j i l 1
F(i1, j 1) s(xi, xj)
F(i, j) max
max i ? k lt j F(i, k) F(k1, j)
Termination
Best structure is given by F(1, N)
(Need to trace back refer to the Durbin book)

27
The Nussinov Algorithm and CFGs

Define the following grammar, with scores
S ? g S c 3 c S g 3
a S u 2 u S a 2
g S u 1 u S g 1
S S 0
a S 0 c S 0 g S 0
u S 0 ? 0
Note ? is the string
Then, the Nussinov algorithm finds the optimal
parse of a string with this grammar

28
The Nussinov Algorithm

Initialization
F(i, i-1) 0 for i 2 to N
F(i, i) 0 for i 1 to N S ? a c
g u
Iteration
For l 2 to N
For i 1 to N l
j i l 1
F(i1, j 1) s(xi, xj) S ? a S u
F(i, j) max
max i ? k lt j F(i, k) F(k1, j)
S ? S S
Termination
Best structure is given by F(1, N)

29
Stochastic Context Free Grammars

In an analogy to HMMs, we can assign
probabilities to transitions
Given grammar
X1 ? s11 sin
Xm ? sm1 smn
Can assign probability to each rule, s.t.
P(Xi ? si1) P(Xi ? sin) 1

30
Example

S ? a S b ½
a ¼
b ¼
Probability distribution over all strings x
x anbn1,
then P(x) 2-n ? ¼ 2-(n2)
x an1bn,
same
Otherwise P(x) 0

31
Computational Problems

Calculate an optimal alignment of a sequence and
a SCFG
(DECODING)
Calculate Prob sequence grammar
(EVALUATION)
Given a set of sequences, estimate parameters of
a SCFG
(LEARNING)

32
Normal Forms for CFGs

Chomsky Normal Form
X ? YZ
X ? a
All productions are either to 2 nonterminals, or
to 1 terminal
Theorem (technical)
Every CFG has an equivalent one in Chomsky Normal
Form
(The grammar in normal form produces exactly the
same set of strings)

33
Example of converting a CFG to C.N.F.
S

S ? ABC
A ? Aa a
B ? Bb b
C ? CAc c
Converting
S ? AS
S ? BC
A ? AA a
B ? BB b
C ? DC c
C ? c
D ? CA

A
B
C
a
b
c
A
B
C
A
a
b
c
a
B
b
S
A
S
B
C
A
A
a
a
B
B
D
C
b
c
B
B
C
A
b
b
c
a
34
Another example

S ? ABC
A ? C aA
B ? bB b
C ? cCd c
Converting
S ? AS
S ? BC
A ? CC c AA
A ? a
B ? BB b
B ? b
C ? CC c
C ? c
C ? CD
D ? d

35
Decoding the CYK algorithm

Given x x1....xN, and a SCFG G,
Find the most likely parse of x
(the most likely alignment of G to x)
Dynamic programming variable
?(i, j, V) likelihood of the most likely parse
of xixj,
rooted at nonterminal V
Then,
?(1, N, S) likelihood of the most likely
parse of x by the grammar

36
The CYK algorithm (Cocke-Younger-Kasami)

Initialization
For i 1 to N, any nonterminal V,
?(i, i, V) log P(V ? xi)
Iteration
For i 1 to N 1
For j i1 to N
For any nonterminal V,
?(i, j, V) maxXmaxYmaxi?kltj ?(i,k,X)
?(k1,j,Y) log P(V?XY)
Termination
log P(x ?, ?) ?(1, N, S)
Where ? is the optimal parse tree (if traced
back appropriately from above)

37
A SCFG for predicting RNA structure

S ? a S c S g S u S ?
? S a S c S g S u
? a S u c S g g S u u S g
g S c u S a
? SS
Adjust the probability parameters to reflect bond
strength etc
No distinction between non-paired bases, bulges,
loops
Can modify to model these events
L loop nonterminal
H hairpin nonterminal
B bulge nonterminal
etc

38
CYK for RNA folding

Initialization
?(i, i-1) log P(?)
Iteration
For i 1 to N
For j i to N
?(i1, j1) log P(xi S xj)
?(i, j1) log P(S xi)
?(i, j) max
?(i1, j) log P(xi S)
maxi lt k lt j ?(i, k) ?(k1, j) log P(S
S)

39
Evaluation

Recall HMMs
Forward fl(i) P(x1xi, ?i l)
Backward bk(i) P(xi1xN ?i k)
Then,
P(x) ?k fk(N) ak0 ?l a0l el(x1) bl(1)
Analogue in SCFGs
Inside a(i, j, V) P(xixj is generated by
nonterminal V)
Outside b(i, j, V) P(x, excluding xixj is
generated by S and the excluded part is
rooted at V)

40
The Inside Algorithm