Title: Genome Sequencing and Assembly High throughput Sequencing
1Genome Sequencing and AssemblyHigh throughput
Sequencing
- Xiaole Shirley Liu
- Jun Liu
- STAT115, STAT215
2Outline
- Genome sequencing strategy
- Clone-by-clone sequencing
- Whole-genome shotgun sequencing
- Hybrid method
- Next generation sequencing technologies
3Competing Sequencing Strategies
- Clone-by-clone and whole-genome shotgun
4Clone-by-Clone Shotgun Sequencing
- E.g. Human genome project
- Map construction
- Clone selection
- Subclone library construction
- Random shotgun phase
- Directed finishing phase and sequence
authentication
5Map Construction
- Clone genomic DNA in YACs (1MB) or BACs (200KB)
- Map the relative location of clones
- Sequenced-tagged sites (STS, e.g. EST) mapping
- PCR or probe hybridization to screen STS
- Restriction site fingerprint
- Most time consuming
- 1990-98 to generate physical maps for human
http//www.ncbi.nlm.nih.gov/genemap99/
6Resolve Clone Relative Location
- Find a column permutation in the binary
hybridization matrix, all ones each row are
located in a block
STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
3 5 1 4 2
1 1 0 0 1 0
2 0 0 1 0 1
3 1 0 0 1 1
4 1 1 0 1 0
1 2 3 4 5
1 0 0 1 1 0
2 1 1 0 0 0
3 0 1 1 1 0
4 0 0 1 1 1
STS Clone
7Clone Selection
- Based on clone map, select authentic clones to
generate a minimum tiling path - Most important criteria authentic
8Subclone Library Construction
- DNA fragmented by sonication or RE cut
- Fragment size 2-5 KB
9Random Shotgun Phase
- Dideoxy termination reaction
- Informatics programs
- Coverage and contigs
10Dideoxy Termination
- Method invented by Fred Sanger
- Automated sequencing developed by Leroy Hood
(Caltech) and Michael Hunkapiller (ABI)
11Bioinformatics Programs
- Developed at Univ. Wash
- Phred
- Base calling
- Phil Green
- Phrap
- Assembly
- Brent Ewing
- Consed
- Viewing and editing
- David Gordon
12Coverage and contigs
- Coverage sequenced bp / fragment size
- E.g. 200KB BAC, sequenced 1000 x 500bp subclones,
coverage 1000 x 500bp / 200KB 2.5X - Lander-Waterman curve
13Directed Finishing Phase
- David Gordon auto-finish
- Deign primers at gap 2 ends, PCR amplify, and
sequence the two ends until they meet - Sequence authentication verify STS and RE sites
- Finished lt 1 error (or ambiguity) in 10,000bp,
in the right order and orientation along a
chromosome, almost no gaps.
14Genome-Shotgun Sequencing
- Celera human and drosophila genomes
- No physical map
- Jigsaw puzzle assembly
- Coverage 7-10X
15Shotgun Assembly
- Screener
- Identify low quality reads, contamination, and
repeats - Overlapper
- gt 40bp overlap with lt 6 mismatches
- Unitigger
- Combine the easy (unique assembly) subset first
- Scaffolder repeat resolution
- Generate different sized-clone libraries, and
just sequence the clone ends (read pairs) - Use physical map information if available
- Consensus
16Hybrid Method
17Hybrid Method
- Optimal mixture of clone-by-clone vs whole-genome
shotgun not established - Still need 8-10X overall coverage
- Bacteria genomes can be sequenced WGS alone
- Higher eukaryotes need more clone-by-clone
- Comparative genomics can reduce the physical
mapping (clone-by-clone) need - Sequencing cost decreasing quickly
- Goal 1000 / genome
18First Generation Sequencing
19Second Generation Sequencing
202nd Gen Sequencing Tech
- Traditional sequencing machine
- 384 reads 1kb / 3 hours
- 454 (Roche)
- 1M reads 400bp / 5 hours
- Solexa (Illumina)
- 100M-1B reads of 30-100bp / 3-8 days, 8-16
samples - SOLiD (Applied Biosystems)
- 1.4B reads of 35-50bp / 5-8 days, 16 samples
- Helicos (single molecule sequencing)
- 500M reads of 30 bp / week, 50 samples
- Moving targets
21Illumina (Solexa) Workflow
22Illumina HiSeq2000
- Throughput
- 1100 / lane
- 35-100 bp / read
- 16 lanes (2 flow cells) / run
- 60-80 million reads / lane
- Sequencing a human genome 10000, 1 week
- Bioinfo challenges
- Very large files
- CPU and RAM hungry
- Sequence quality filtering
- Mapping and downstream analysis
23Seq Files
_at_HWI-EAS3051119910/1 GCTGGAGGTTCAGGCTGGCCGGAT
TTAAACGTAT HWI-EAS3051119910/1 MVXUWVRKTWWUL
RQQMMWWBBBBBBBBBBBBBB _at_HWI-EAS3051112010/1 AA
GACAAAGATGTGCTTTCTAAATCTGCACTAAT HWI-EAS305111
2010/1 PXXXTXYXTTWYYYXXWWWTMTVXWBBB HWUSI
-EAS366_0112611298188280/1 Â Â 16 Â Â Â chr9 Â
 98116600     255   38M      0   Â
0 Â Â Â TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG
 Y\bcdab\_UULbTUT\ccLbbYaYcWLYW  XAi1
 MDZ3C30T3   NMi2 HWUSI-EAS366_0112611257
188190/1 Â Â 4 Â Â Â Â Â Â 0 Â Â Â 0 Â Â Â Â
     0    0    AGACCACATGAAGCTCAAGAAG
AAGGAAGACAAAAGTG Â ecedddT\cTcaccdK\c__Yb\_c
KS_W\ Â XMi1 HWUSI-EAS366_0112611315195290/
1 Â Â 16 Â Â Â chr9 Â Â 102610263 Â Â Â 255 Â Â 38M
     0    0    GCACTCAAGGGTACAGGAAAAG
GGTCAGAAGTGTGGCC Â c_Yc\LcbbbYdTa\dd\ddacdd\Y\d
ddcT Â XAi0 Â MDZ38 NMi0 chr1 123450 123500
chr5 28374615 28374615 -
- Raw FASTQ
- Sequence ID, sequence
- Quality ID, quality score
- Mapped SAM
- Map 0 OK, 4 unmapped, 16 mapped reverse strand
- MD mismatch info
- NM number of mismatch
- Mapped BED
- Chr, start, end, strand
24(Potential) Applications
- Metagenomics and infectious disease
- Ancient DNA, recreate extinct species
- Comparative genomics (between species) and
personal genomes (within species) - Genetic tests and forensics
- Circulating nucleic acids
- Risk, diagnosis, and prognosis prediction
- Transcriptome and transcriptional regulation
- More later in the semester
25Third Generation Sequencing
- Single molecule sequencing (no amplification
needed) - Some can read very long sequences
- In 2-3 years, the cost of sequencing a human
genome will drop below 1000 - Personal genome sequencing will be a key
component of public health in every developed
country - The cost of sequencing will be lower than the
cost of storing the sequences - Bioinformatics will be key to convert data into
knowledge
26HW Q for Graduate Students
- Write the first page (Specific Aims) of an NIH
research proposal with 1 million budget that
uses high throughput sequencing and
bioinformatics analysis to solve some interesting
biomedical problems
27How to Write a Specific Aims Page
- Grant title and your name
- Introductory (1-2) paragraphs (1/4-1/3 page)
- A is very important in bio/medicine/disease
- Recent development in A has made some really
significant findings or improvements - However, something is still lacking or not known
about A - The long term goal of our research is
- The focus of this proposal is
- The central hypothesis of this proposal is
- Therefore, we plan to do investigate / develop
28How to Write a Specific Aims Page
- Specific Aims (1/3 to ½ page)
- Specifically, we plan to
- Aim 1 profile the genome-wide xxx of xxx
- 1.1 establish xxx
- 1.2
- Aim 2 develop a computer algorithm or knowledge
base - 2.1 model xxx
-
- Aim 3 identify the mechanism of xxx
29Specific Aims
- Sound like you can definitely do it
- Do not use words that sound like a fishing
expedition such as try, explore (find) or words
that sound too trivial such as download (collect) - Try not to let your aims successes depend on
each other, but it is also important to be
intellectually coherent among the aims (not a
collection of unrelated topics) - Propose as many aims as years for the grant
30How to Write a Specific Aims Page
- Last paragraph (lt ¼ page)
- Whats novel about our approach
- Deliverables (what will the scientific community
see at the end) - A software, database, map, mechanism, resource
- Whats the (potential) significance about our
proposal - Best possible outcome
31Summary
- Genome sequencing and assembly
- Clone-by-clone HGP
- Map big clones, find path, shotgun sequence
subclones, assemble and finish - Sequencing dideoxy termination
- Whole genome shotgun Celera
- Massively Parallel Sequencing
- 454, Solexa, SOLiD, Helicos
- Many opportunities and many challenges
- Project proposal
32Acknowledgement
- Fritz Roth
- Dannie Durand
- Larry Hunter
- Richard Davis
- Wei Li
- Jarek Meller
- Stefan Bekiranov
- Stuart M. Brown
- Rob Mitra