Title: Genome Sequence Acquisition and Analysis Lecture3
1Genome Sequence Acquisition and AnalysisLecture-3
2Data Acquisition
- Sequencing the Whole Genome
- Sequencing technology
- Polymerase chain reaction leads to cycle
sequencing. - Capillary electrophoresis
- Sequencing strategies for mapping whole genomes
- Hierarchical Shotgun Sequencing
- Shotgun Sequencing
- Assembling strategies of the first draft of the
Human genome - How do we make sense of all these bases?
3Reading
- Chapter II,III, XIII Baxevanis
4Automated DNA sequencing
- DNA polymerase copy a strand of DNA
- The insertion of dye terminator stop the DNA
synthesis. The different fragments are run on a
gel one lane contain each 4 bases terminator - The terminator are labeled with different dye
depending whether it is A,G,C,or T - The sequence is read by a computer. It generate
a sequence trace, the computer translate the
fluorescence signal into DNA sequence.
5Thermocycler for PCR and sequencing
6Automated cycling DNA sequencing
Terminator Dye
- Using a DNA polymerase thermostable, it is
possible to use the polymerase chain reaction
method for sequencing. The advantage is that we
can generate sequencing from very little amount
of DNA since the template is amplified during the
process.
Polymerase
550C anealling of the
Primer with DNA
720C
DNA synthesis Of different fragments With
different dye Terminator and length
950C
Dissociation of Duplex DNA amplification Of the
template
7Capillary Electrophoresis
Electrophoresis is a powerful method that was
developed long before molecular biology came into
existence. The principle behind electrophoresis
is that charged molecules will migrate toward the
opposite pole and separate from each other based
on physical characteristics. For example,
SDS-PAGE separates proteins according to their
molecular weights. The two limiting factors of
traditional electrophoresis were 1) detection of
molecules upon completion of electrophoretic
separation and 2) only low voltages could be used
to prevent heat damage of the samples. Capillary
electrophoresis solved both of these problems.
Because the capillary tube has a high surface to
volume ratio (25-100 µm diameter), it radiates
heat readily and thus samples do not over heat.
Detection of the migrating molecules is
accomplished by shining a light source through a
portion of the tubing and detecting the light
emitted from the other side (figure 1).
Scanner
8The Mega-sequencer
9The Mega-sequencer
10Data acquisition
11Mapping strategies
12Definitions
- BAC Bacterial Artificial Chromosomes are used to
clone large fragments of DNA up to 150kb and can
replicate in E.Coli. Insert are shorter than the
YAC vector. They are the most used today for
hierarchical sequencing. - YAC Yeast Artificial Chromosomes are used to
clone large fragments of DNA for sequencing from
150kb to 1.5Mb. The problem is that the insert
are sometime instable and rearrangment of the DNA
can occur. - STS To allow better assembling of the different
fragment obtained from sequencing of a genome,
sequence tag sites were mapped all other the
human genome. They are short unique sequences of
DNA all along the chromosomes and defined by a
pair of primers that amplify only one segment of
the chromosome. - EST When the genomes started to be annotated
means define the coding region of the genes,
several labs started to sequence short cDNA
transcript lt500bp from either 3 or 5 end of the
DNA. The short segments of DNA were called
Expressed Sequence Tags (EST), it did not matter
if they knew from which genes they came since
their goal was to compile every EST. EST have
been very useful in cloning genes identify
splicing variants of genes. If investigator
sequenced a piece of DNA that contained an EST,
they would know that the DNA would be a coding
region. Those EST grown very rapidely and are
now organized in an EST database called unigene. - Today many sequencing consortium are dayly
producing full length cDNA that are added to
UNIGENE database the goal is to have all the full
length cDNA of all the genes available for the
research community. These genes can be bought
for (300US) - FISH Fluorescence in situ hybridization.
Chromosomes are isolated, fixed on a glass slide
and hybridized to a fluorescent molecular probe
this experiment allow the position of markers
(STS, BAC) onto the cytogenetic map of the
chromosome. - CONTIGS BAC and YAC vectors contains contiguous
fragments of DNA that overlap
13BACs
- Large insert size
- No Chimeras
- Stable in host cells
BAC ends
14Chromosome Map Cytogenetic Map
Centromere
Distant Markers
Closely linked markers
BAC Fragment collection Physical Map
Mostly found on same fragments
Rarely found on same fragments
Phred and Phrap (Phil Green) suite of programs
are used to assemble the DNA fragments and
assesss the quality of the assembly. (Baxevanis
Chapter XIII).
15Feature annotationFluorescence in Situ
HybridizationFISH
- When a particular marker (STS, BAC) is positioned
on the cytogenetic map of a chromosome by FISH
all DNA fragments (BAC) that contain the marker
and are overlapping will be ultimately positioned
on the chromosome to build the physical map
An example of Fluorescence in situ hybridization
using a BACs subsequences as a fluorescent
probe
16Physical Map assembly
- Library of BACs containing genome fragments
- STS linkage to position fragments
- Physical Map assembly
- Fragment BACs and assemble sequenced contigs
- Align sequenced fragments to physical map by
Phred and Phrap.
17Hierarchical shotgun sequencing
The method preferred by the Human Genome Project
is the hierarchical shotgun sequencing method. In
this approach, genomic DNA is cut into pieces of
about 150 kb and inserted into BAC vectors,
transformed into E. coli where they are
replicated and stored. The BAC inserts are
isolated and mapped to determine the order of
each cloned 150 kb fragment. This is referred to
as the Golden Tiling Path. Each BAC fragment in
the Golden Path is fragmented randomly into
smaller pieces and each piece is cloned into a
plasmid and sequenced on both strands. These
sequences are aligned so that identical sequences
are overlapping. These contiguous pieces are then
assembled into finished sequence once each strand
has been sequenced about 4 times to produce 8X
coverage of high quality data.
18Shotgun sequencingCelera method
- The method developed and preferred by Celera is
simply called shotgun sequencing. This approach
was developed and perfected on prokaryotic
genomes which are smaller in size and contain
less repetitive DNA. Shotgun sequencing randomly
shears genomic DNA into small pieces which are
cloned into plasmids and sequenced on both
strands, thus eliminating the BAC step from the
HGP's approach. Once the sequences are obtained,
they are aligned and assembled into finished
sequence.
19Which strategy is the best
- When the human project were published in nature
2001(HGP academic) and science (celera) many of
the contigs were not overlaping creating gaps in
the genome. - Many believe that the HPG method is better suited
for filling gap between the contig since the
fragments are ordered. - For example duplicated genes could be combined in
a single loci by the celera method, while the HPG
method could identify two different loci from the
duplicated genes. - The completed human genome by the HPG has been
released in april 2003 for the 50th anniversary
of the double helix. Celera has been able to
update their genome using the HPG genome. The
human genome produced by the HPG consortium did
not took advantage of the celera genome since
celera did not share their data for intellectual
property reasons. - Today the HPG genome is available on public
database, while the celera genome is only
available by subscription (10,000 USD)
20Assembling the Human Genome
- The HPG was not limited to the human genome it
also targeted,fly,worm,mouse,plants,and yeast.
By comparing different genomes we should be able
to learn more about genome in general and human
genome in particular - Big genomes contain repetitives regions from old
viruses and transposons that have been
accumulated during millena. Piecing together the
full genome would be better if some markers
existed. For this purpose STS were designed to
position the different pieces on the genome. - Since many regions of the DNA do not code for RNA
or proteins, EST were used to differentiate
between junk DNA and DNA coding regions. -
21Assembling the Human Genome
- Assemble the sequence Data
- Annotate the features
- Provide a dataset of assemble genomic
sequence,RNAs, and proteins - Create a mapviewer of the genomic sequences
22Genomics sequence assemblies
- Public effortObjectives Assemble overlapping
fragments from different BAC sequencesTwo
organizations - The Human Genome Project Working Draft (also
known as the 'Golden Path' assemblies) at
University of California, Santa Cruz (UCSC) - The National Center for Biotechnology
Information (NCBI) Contig Assemblies.
23Genomics sequence assemblies
- Public effortThe UCSC strategy begins with human
genomic sequences from GenBank composed of BAC
clones at a given point (a 'freeze' dataset),
ordered and oriented according to the appropriate
fingerprinting contigs from Washington University
Genome Sequencing Center (WUGSC) were obtained.
-
A typical fingerprint gel containing 121 lanes of
DNA. Every 5th lane is a marker lane with 37
bands to calibrate the gel metric. Each of the 96
data lanes contains a digested BAC, with 30-60
bands.
24Genomics sequence assemblies
- Public effort Within each WUGSC contig, draft
sequence fragments are assembled into consensus
'raft' sequences using overlaps between fragments
and bridging mRNA, expressed sequence tag (EST),
plasmid and BAC-end-pair sequences. Repeated
tracts of the letter 'N' are inserted between
non-overlapping rafts to give a longer consensus
sequence for each WUGSC contig.
25Genomics sequence assemblies
- Public effortThe NCBI approach also begins by
finding an order for adjacent BACs but in this
case it is derived from BAC sequence overlaps
chromosome assignment by fluorescence in situ
hybridization (FISH) and sequence-tagged site
(STS) content. The sequence fragments from these
overlapping BACs are then merged into consensus
'meld' sequences. As with the UCSC method, these
melds are then ordered and orientated using ESTs,
mRNAs, etc.., before being combined into a single
NCBI genomic sequence contig with melds separated
by runs of the letter 'N'. Because NCBI will
often be dealing with fewer sequence fragments
than UCSC in the construction of a given contig
(since NCBI only assemble contigs from
overlapping BAC sequences, not from all BAC
sequences within an FPC contig as does UCSC,
there should be less opportunity for misassembly
(erroneously ordering or orientating sequence
fragments).
26Genomics sequence assemblies
- Private effortCelera Genomics has also published
a draft assembly for the genome. It is available,
under a variety of restrictions, only from their
web site. - They used a whole-genome shotgun sequencing
approach to obtain fragmentary and unmapped
sequence data, and combined these with
International Human Genome Sequencing Consortium
(IHGSC) sequence and mapping data. - Celera assembled its own data with those produced
by the IHGSC using two different assembly
protocols either with or without mapping
information. They found that using the public
mapping data gave the assembly greater sequence
coverage. - The Celera sequences may provide a useful
resource for plugging gaps in the IHGSC draft
genome. There were up to 0.14 of Celera
sequence was not present in the public data set
in the first draft of the human genome.
27Genomics sequence assemblies
- The UCSC effort provided an exhaustive account of
the shortcomings of their assemblies. Summary
statistics are given for almost every measurable
characteristic, including numbers and lengths of
gaps and the estimated number of misassemblies. - The UCSC group has examined the performance of
their assembly algorithms on artificially
produced fragmentary sequence (fragmented
sequences from finished regions of the genome). - Their assemblies reproduced the correct ordering
of fragments around 85 of the time and the
correct orientation of fragments around 90 of
the time. - The quality of the assemblies degraded rapidly
where the algorithms had to deal with many small
fragments, with both correct order and
orientation being achieved only 50 of the time
(effectively the same as random assembly) in the
worst cases. - It was also identified possible misassemblies in
the Celera assembly. Neither Celera nor NCBI
publishes statistics analogous to UCSC's for
draft sequence assembly, nor have they published
estimates of the rate of error involved.
28Genomics Mapping Data
- WUGSC(Washington University Genome Science
Center) has produced a physical map of the
genome using 'fingerprinting' analysis of BACs
from a library of genomic sequence clones called
RPCI-11 - Overlaps between clones are calculated using
different algorithms on the basis of clone
restriction-fragment patterns, or fingerprints.
In this way, fingerprinting-based contigs
covering at least 96 of the euchromatic genome
have been constructed. The first published draft
genome sequence assemblies contain hundreds of
thousands of gaps, whereas this physical map
contains less than 1000. - The International Human Genome Mapping Consortium
(IHGMC) physical map were supplemented with data
from other sources. Many fully or partially
sequenced BAC clones have been cytogenetically
mapped to particular regions using FISH, and this
cytogenetic location is given in the GenBank
sequence entry for the clone. - The end sequences from many BAC clones, including
those from libraries other than RPCI-11, are
available from The Institute for Genomic Research
(TIGR) Human BAC Ends site. - The physical and genetic marker content of a
sequence of interest can be determined online
using the electronic-PCR (e-PCR) program at NCBI.
This is a rapid sequence-search algorithm that
searches your sequence for occurrences of marker
sequences in GenBank. If you enter a BAC clone
accession number, the output consists of an
ordered list of markers (and the chromosome they
map to) down the clone sequence.
29Possible stage of BAC clone sequence
- Phase 1 Unfinishedunordered/unoriented contigs
with gaps - Phase 2 Unfinished ordered/oriented contigs
with or without gaps - Phase 3 Finished no gaps
30Writing the parts listgenomic sequence
annotation
- "We've called the human genome the blueprint, the
Holy Grail, all sorts of things. It's a parts
list," said Eric Lander at the Millennium Evening
at the White House, 14 October 1999 about the
first draft of the human genome. - The IHGSC process of predicting genes were
estimated to have a sensitivity of 68-85 (so
that 15-32 of genes present in the genome were
missed) and an accuracy of around 79 on average
(so that 21 of the sequence of each predicted
gene was missed). - Since then gene prediction has been constantly
improved
31Writing the parts listgenomic sequence
annotation
- To obtain a better gene prediction in the human
genome cDNA libraries were made and screened to
verify or refute gene predictions made for the
first completed human chromosomes, at the Sanger
Centre. - Many genes from a large collection of full-length
mouse cDNAs and large human cDNA collections were
produced. - The IHGSC has assembled an initial, non-redundant
set of predicted and known human genes called the
integrated gene index (IGI), which is to be
updated as more data accumulate and is made
accessible via the Ensembl site. - Beyond the question of where exactly the genes
are, it would also be useful to have some idea of
what they do. - Only around 40 of the IGI-predicted proteins
appeared initially to contain known domains
implying that we may be able to predict at least
some of their functions and even in these cases
the computational predictions must be supported
by lab work before they are regarded as accurate.
32Genomes Browsers
- There are three major well-designed websites
offering users the chance to browse annotations
of the human genome.They offer a graphical
interface to display the results of various
analyses, such as gene predictions and similarity
searches, for draft and finished genomic
sequence. - They allow rapid intuitive comparisons between
the features predicted by different programs. - One can see at once where an exon prediction
overlaps with interspersed repeats or a
single-nucleotide polymorphism (SNP). - There are differences between the 3 different
Browser.
33Ensembl
- The Ensembl database was the first to provide a
window on the draft genome and started by
curating 'confirmed' genes that were
computationally predicted and also supported by a
significant match to one or more expressed
sequences. - Uses gene structures from public sequence
database entries (many of which are
experimentally verified), so the total set of
Ensembl genes should be a more accurate
reflection of reality than computational
predictions alone. - Other features includes repetitive sequence,
cytological bands, genetic markers, CpG island
predictions, tRNA gene predictions, UniGene
(expressed sequence) clusters, SNPs from the
dbSNP database, disease genes found in the draft
genome (identified by the Online Mendelian
Inheritance in Man database, OMIM, and regions of
homology to mouse draft genomic sequences. - Ensembl uses the UCSC draft sequence assemblies
as its starting point, so its description of a
region can only be as accurate as those
assemblies allow. Gaps and misassemblies in the
genomic sequence could lead to Ensembl missing or
wrongly positioning genes and other features. - Misassemblies are expected to arise most
frequently among small BAC fragments, but more
broadly the BAC clones mapped to a given region
by UCSC (on the basis of FPC data, FISH data, STS
content and sequence overlap) should be very
accurate. Thus, Ensembl annotation may not be
accurate on the finest, local scales but should
give a very good idea of what is present in a
larger region, perhaps at the level of megabases.
34UCSC Human Genome Browser
- The UCSC Human Genome Browser (HGB Figure 2) 4
bears many similarities to Ensembl - it too provides annotation of the UCSC assemblies
- it displays a similar array of features.
- Additional features of HGB not yet found in
Ensembl includes. - Predictions from more than one ab initio
gene-prediction program (programs that predict
coding sequence on the basis of statistical
measures of features such as codon usage,
initiation or polyadenylation signals, rather
than by homology to known genes) and indicates
regions with significant homology to other
organisms. - These features can provide useful information
when dealing with gene predictions that are not
well supported by similarity to known mRNA
sequences. - Detailed description of the genomic sequence
assemblies. Graphical representations of the
fragments making up a region of draft genome can
be displayed, showing the relative size and
sequence quality of each fragment and also
whether any gaps between fragments are bridged by
mRNAs or paired BAC end sequences.ne can get an
idea of the likely degree of misassembly in a
region. Data retrieval is possible via text, BLAT
searches (a faster, less accurate algorithm than
BLAST) and bulk downloads of annotation or
sequence data.
35NCBI Map Viewer
- Whereas Ensembl and HGB both show annotation of
the UCSC draft genome assemblies the NCBI Map
Viewer (NMV) displays features present in the
NCBI assemblies. The NMV shows - Comparisons between cytogenetic, genetic and
radiation hybrid maps in parallel with NCBI draft
and finished sequence contigs. The locations of
genes, STSs, and SNPs are indicated on the contig
sequences. The NCBI approach to gene prediction
is more conservative than those in Ensembl and
HGB. - No ab initio gene-prediction-program is used
instead of known genes, mRNAs and ESTs are
aligned to genomic sequence using a program
called Acembly. The program also attempts to give
alternative splice variants of genes where its
alignments suggest them. All annotated genes are
connected to NCBI LocusLink, which provides links
to associated information such as related
sequence accession numbers, expression data,
known phenotypes, polymorphisms, and so on.
36NCBI Map Viewer
- In spite of difficulties with the quality of
genomic sequence assemblies, the three browsers
remain extremely useful tools for the cautious
biologist. They undoubtedly indicate the presence
of most of the coding sequence in a given
fragment of genomic sequence and indicate its
location in the genome as determined by the best
available mapping data. - Most aspects of the analysis carried out are the
subjects of active research, and improvements in
performance resulting from the inclusion of new
sequence data and algorithms will be ongoing
annotations are constantly subject to changes.
37End Product
- The NCBI annotation project provide sequences and
resources support via AceView,LocusLink and
Mapviewer. - RefSeq provides accession prefixes for each of
the type of sequences - XM_XXXmRNA
- XR_XXXXnon-coding transcript
- XP_XXX Protein, model reference sequence
- The NCBI equivalent accession number on the NCBI
contigs server are NT_ or NW which may be
generated from incomplete data with mistakes.
38NCBI Assembly Pipeline
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Useful Websites
- NCBI - http//www.ncbi.nlm.nih.gov
- Ensemble - http//www.ensembl.org/
- UCSC - http//genome.ucsc.edu/