Genome Sequencing and Assembly High throughput Sequencing - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Genome Sequencing and Assembly High throughput Sequencing

Description:

Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu Jun Liu STAT115, STAT215 * * [Enter any extra notes here; leave the item ID line at the ... – PowerPoint PPT presentation

Number of Views:854

Avg rating:3.0/5.0

Slides: 33

Provided by: stat115Or

Category:

more less

Transcript and Presenter's Notes

Title: Genome Sequencing and Assembly High throughput Sequencing

1
Genome Sequencing and AssemblyHigh throughput
Sequencing

Xiaole Shirley Liu
Jun Liu
STAT115, STAT215

2
Outline

Genome sequencing strategy
Clone-by-clone sequencing
Whole-genome shotgun sequencing
Hybrid method
Next generation sequencing technologies

3
Competing Sequencing Strategies

Clone-by-clone and whole-genome shotgun

4
Clone-by-Clone Shotgun Sequencing

E.g. Human genome project
Map construction
Clone selection
Subclone library construction
Random shotgun phase
Directed finishing phase and sequence
authentication

5
Map Construction

Clone genomic DNA in YACs (1MB) or BACs (200KB)
Map the relative location of clones
Sequenced-tagged sites (STS, e.g. EST) mapping
PCR or probe hybridization to screen STS
Restriction site fingerprint
Most time consuming
1990-98 to generate physical maps for human
http//www.ncbi.nlm.nih.gov/genemap99/

6
Resolve Clone Relative Location

Find a column permutation in the binary
hybridization matrix, all ones each row are
located in a block

STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
3 5 1 4 2
1 1 0 0 1 0
2 0 0 1 0 1
3 1 0 0 1 1
4 1 1 0 1 0
1 2 3 4 5
1 0 0 1 1 0
2 1 1 0 0 0
3 0 1 1 1 0
4 0 0 1 1 1
STS Clone
7
Clone Selection

Based on clone map, select authentic clones to
generate a minimum tiling path
Most important criteria authentic

8
Subclone Library Construction

DNA fragmented by sonication or RE cut
Fragment size 2-5 KB

9
Random Shotgun Phase

Dideoxy termination reaction
Informatics programs
Coverage and contigs

10
Dideoxy Termination

Method invented by Fred Sanger
Automated sequencing developed by Leroy Hood
(Caltech) and Michael Hunkapiller (ABI)

11
Bioinformatics Programs

Developed at Univ. Wash
Phred
Base calling
Phil Green
Phrap
Assembly
Brent Ewing
Consed
Viewing and editing
David Gordon

12
Coverage and contigs

Coverage sequenced bp / fragment size
E.g. 200KB BAC, sequenced 1000 x 500bp subclones,
coverage 1000 x 500bp / 200KB 2.5X
Lander-Waterman curve

13
Directed Finishing Phase

David Gordon auto-finish
Deign primers at gap 2 ends, PCR amplify, and
sequence the two ends until they meet
Sequence authentication verify STS and RE sites
Finished lt 1 error (or ambiguity) in 10,000bp,
in the right order and orientation along a
chromosome, almost no gaps.

14
Genome-Shotgun Sequencing

Celera human and drosophila genomes
No physical map
Jigsaw puzzle assembly
Coverage 7-10X

15
Shotgun Assembly

Screener
Identify low quality reads, contamination, and
repeats
Overlapper
gt 40bp overlap with lt 6 mismatches
Unitigger
Combine the easy (unique assembly) subset first
Scaffolder repeat resolution
Generate different sized-clone libraries, and
just sequence the clone ends (read pairs)
Use physical map information if available
Consensus

16
Hybrid Method
17
Hybrid Method

Optimal mixture of clone-by-clone vs whole-genome
shotgun not established
Still need 8-10X overall coverage
Bacteria genomes can be sequenced WGS alone
Higher eukaryotes need more clone-by-clone
Comparative genomics can reduce the physical
mapping (clone-by-clone) need
Sequencing cost decreasing quickly
Goal 1000 / genome

18
First Generation Sequencing
19
Second Generation Sequencing
20
2nd Gen Sequencing Tech

Traditional sequencing machine
384 reads 1kb / 3 hours
454 (Roche)
1M reads 400bp / 5 hours
Solexa (Illumina)
100M-1B reads of 30-100bp / 3-8 days, 8-16
samples
SOLiD (Applied Biosystems)
1.4B reads of 35-50bp / 5-8 days, 16 samples
Helicos (single molecule sequencing)
500M reads of 30 bp / week, 50 samples
Moving targets

21
Illumina (Solexa) Workflow
22
Illumina HiSeq2000

Throughput
1100 / lane
35-100 bp / read
16 lanes (2 flow cells) / run
60-80 million reads / lane
Sequencing a human genome 10000, 1 week
Bioinfo challenges
Very large files
CPU and RAM hungry
Sequence quality filtering
Mapping and downstream analysis

23
Seq Files
_at_HWI-EAS3051119910/1 GCTGGAGGTTCAGGCTGGCCGGAT
TTAAACGTAT HWI-EAS3051119910/1 MVXUWVRKTWWUL
RQQMMWWBBBBBBBBBBBBBB _at_HWI-EAS3051112010/1 AA
GACAAAGATGTGCTTTCTAAATCTGCACTAAT HWI-EAS305111
2010/1 PXXXTXYXTTWYYYXXWWWTMTVXWBBB HWUSI
-EAS366_0112611298188280/1 16 chr9
98116600 255 38M 0
0 TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG
Y\bcdab\_UULbTUT\ccLbbYaYcWLYW XAi1
MDZ3C30T3 NMi2 HWUSI-EAS366_0112611257
188190/1 4 0 0
0 0 AGACCACATGAAGCTCAAGAAG
AAGGAAGACAAAAGTG ecedddT\cTcaccdK\c__Yb\_c
KS_W\ XMi1 HWUSI-EAS366_0112611315195290/
1 16 chr9 102610263 255 38M
0 0 GCACTCAAGGGTACAGGAAAAG
GGTCAGAAGTGTGGCC c_Yc\LcbbbYdTa\dd\ddacdd\Y\d
ddcT XAi0 MDZ38 NMi0 chr1 123450 123500
chr5 28374615 28374615 -

Raw FASTQ
Sequence ID, sequence
Quality ID, quality score
Mapped SAM
Map 0 OK, 4 unmapped, 16 mapped reverse strand
MD mismatch info
NM number of mismatch
Mapped BED
Chr, start, end, strand

24
(Potential) Applications

Metagenomics and infectious disease
Ancient DNA, recreate extinct species
Comparative genomics (between species) and
personal genomes (within species)
Genetic tests and forensics
Circulating nucleic acids
Risk, diagnosis, and prognosis prediction
Transcriptome and transcriptional regulation
More later in the semester

25
Third Generation Sequencing

Single molecule sequencing (no amplification
needed)
Some can read very long sequences
In 2-3 years, the cost of sequencing a human
genome will drop below 1000
Personal genome sequencing will be a key
component of public health in every developed
country
The cost of sequencing will be lower than the
cost of storing the sequences
Bioinformatics will be key to convert data into
knowledge

26
HW Q for Graduate Students

Write the first page (Specific Aims) of an NIH
research proposal with 1 million budget that
uses high throughput sequencing and
bioinformatics analysis to solve some interesting
biomedical problems

27
How to Write a Specific Aims Page

Grant title and your name
Introductory (1-2) paragraphs (1/4-1/3 page)
A is very important in bio/medicine/disease
Recent development in A has made some really
significant findings or improvements
However, something is still lacking or not known
about A
The long term goal of our research is
The focus of this proposal is
The central hypothesis of this proposal is
Therefore, we plan to do investigate / develop

28
How to Write a Specific Aims Page

Specific Aims (1/3 to ½ page)
Specifically, we plan to
Aim 1 profile the genome-wide xxx of xxx
1.1 establish xxx
1.2
Aim 2 develop a computer algorithm or knowledge
base
2.1 model xxx
Aim 3 identify the mechanism of xxx

29
Specific Aims

Sound like you can definitely do it
Do not use words that sound like a fishing
expedition such as try, explore (find) or words
that sound too trivial such as download (collect)
Try not to let your aims successes depend on
each other, but it is also important to be
intellectually coherent among the aims (not a
collection of unrelated topics)
Propose as many aims as years for the grant

30
How to Write a Specific Aims Page