Title: GigAssembler
1GigAssembler
2Genome Assembly A big picture
http//www.nature.com/scitable/content/anatomy-of-
whole-genome-assembly-20429
3GigAssembler Preprocessing
- Decontaminating Repeat Masking.
- Aligning of mRNAs, ESTs, BAC ends paired reads
against initial sequence contigs. - psLayout ? BLAT
- Creating an input directory (folder) structure.
4http//www.triazzle.com The image from
http//www.dangilbert.com/port_fun.html Reference
Jones NC, Pevzner PA, Introduction to
Bioinformatics Algorithms, MIT press
5RepBase RepeatMasker
6GigAssembler Build merged sequence contigs
(rafts)
7Sequencing quality (Phred Score)
8Sequencing quality (Phred Score)
Base-calling Error Probability
http//en.wikipedia.org/wiki/Phred_quality_score
9GigAssembler Build merged sequence contigs
(rafts)
10GigAssembler Build merged sequence contigs
(rafts)
11GigAssembler Build sequenced clone contigs
(barges)
12GigAssembler Build a raft-ordering graph
13GigAssembler Build a raft-ordering graph
- Add information from mRNAs, ESTs, paired plasmid
reads, BAC end pairs building a bridge - Different weight to different data type (mRNA
highest) - Conflicts with the graph as constructed so far
are rejected. - Build a sequence path through each raft.
- Fill the gap with N.
- 100 between rafts
- 50,000 between bridged barges
14Bellman-Ford algorithm
http//compprog.wordpress.com/2007/11/29/one-sourc
e-shortest-path-the-bellman-ford-algorithm/
15Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
-2
6
8
A
-3
7
-4
7
D
E
2
9
16Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
-2
6
8
A
-3
7
-4
7
D
E
2
9
17Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
Inf.
Inf.
-2
6
8
A
-3
7
START
-4
7
D
E
2
Inf.
Inf.
9
18Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
6 (? A)
Inf.
-2
6
8
A
-3
0 START
7
-4
7
D
E
2
7 (? A)
Inf.
9
19Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
6 (? A)
4 (? D)
-2
6
8
A
-3
0 START
7
-4
7
D
E
2
7 (? A)
2 (? B)
9
20Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
4 (? D)
2 (? C)
-2
6
8
A
-3
0 START
7
-4
7
D
E
2
7 (? A)
2 (? B)
9
21Find the shortest path to all nodes.
Take every edge and try to relax it (N 1 times
where N is the count of nodes)
5
B
C
4 (? D)
2 (? C)
-2
6
8
A
-3
0 START
7
-4
7
D
E
2
7 (? A)
-2 (? B)
9
22Answer A-D-C-B-E
5
B
C
4 (? D)
2 (? C)
-2
6
8
A
-3
0 START
7
-4
7
D
E
2
7 (? A)
-2 (? B)
9
23Next-generation sequencing
24Mardis ER, Annu. Rev. Genomics Hum. Genet., 2008
Illumina
25Mardis ER, Annu. Rev. Genomics Hum. Genet., 2008
Illumina
26Mardis ER, Annu. Rev. Genomics Hum. Genet., 2008
Roche/454
27Mardis ER, Annu. Rev. Genomics Hum. Genet., 2008
SOLiD
28An example of single molecule DNA
sequencing, from Helicos (approx. 1 billion reads
/ run)
Pushkarev, D., N.F. Neff, and S.R. Quake. Nat
Biotechnol (2009) 27, 847-50 Harris, T.D., et al.
Science (2008) 320, 106-9
29Mapping program
Trapnell C, Salzberg SL, Nat. Biotech., 2009
30Two strategies in mapping
Trapnell C, Salzberg SL, Nat. Biotech., 2009