Genome Sequence Acquisition and Analysis Lecture3 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Genome Sequence Acquisition and Analysis Lecture3

Description:

The insertion of dye terminator stop the DNA synthesis. ... different fragments are run on a gel one lane contain each 4 bases terminator ... – PowerPoint PPT presentation

Number of Views:279
Avg rating:3.0/5.0
Slides: 44
Provided by: freder8
Category:

less

Transcript and Presenter's Notes

Title: Genome Sequence Acquisition and Analysis Lecture3


1
Genome Sequence Acquisition and AnalysisLecture-3
2
Data Acquisition
  • Sequencing the Whole Genome
  • Sequencing technology
  • Polymerase chain reaction leads to cycle
    sequencing.
  • Capillary electrophoresis
  • Sequencing strategies for mapping whole genomes
  • Hierarchical Shotgun Sequencing
  • Shotgun Sequencing
  • Assembling strategies of the first draft of the
    Human genome
  • How do we make sense of all these bases?

3
Reading
  • Chapter II,III, XIII Baxevanis

4
Automated DNA sequencing
  • DNA polymerase copy a strand of DNA
  • The insertion of dye terminator stop the DNA
    synthesis. The different fragments are run on a
    gel one lane contain each 4 bases terminator
  • The terminator are labeled with different dye
    depending whether it is A,G,C,or T
  • The sequence is read by a computer. It generate
    a sequence trace, the computer translate the
    fluorescence signal into DNA sequence.

5
Thermocycler for PCR and sequencing
6
Automated cycling DNA sequencing
Terminator Dye
  • Using a DNA polymerase thermostable, it is
    possible to use the polymerase chain reaction
    method for sequencing. The advantage is that we
    can generate sequencing from very little amount
    of DNA since the template is amplified during the
    process.

Polymerase
550C anealling of the
Primer with DNA
720C
DNA synthesis Of different fragments With
different dye Terminator and length
950C
Dissociation of Duplex DNA amplification Of the
template
7
Capillary Electrophoresis
Electrophoresis is a powerful method that was
developed long before molecular biology came into
existence. The principle behind electrophoresis
is that charged molecules will migrate toward the
opposite pole and separate from each other based
on physical characteristics. For example,
SDS-PAGE separates proteins according to their
molecular weights. The two limiting factors of
traditional electrophoresis were 1) detection of
molecules upon completion of electrophoretic
separation and 2) only low voltages could be used
to prevent heat damage of the samples. Capillary
electrophoresis solved both of these problems.
Because the capillary tube has a high surface to
volume ratio (25-100 µm diameter), it radiates
heat readily and thus samples do not over heat.
Detection of the migrating molecules is
accomplished by shining a light source through a
portion of the tubing and detecting the light
emitted from the other side (figure 1).
Scanner
8
The Mega-sequencer
9
The Mega-sequencer
10
Data acquisition
11
Mapping strategies
12
Definitions
  • BAC Bacterial Artificial Chromosomes are used to
    clone large fragments of DNA up to 150kb and can
    replicate in E.Coli. Insert are shorter than the
    YAC vector. They are the most used today for
    hierarchical sequencing.
  • YAC Yeast Artificial Chromosomes are used to
    clone large fragments of DNA for sequencing from
    150kb to 1.5Mb. The problem is that the insert
    are sometime instable and rearrangment of the DNA
    can occur.
  • STS To allow better assembling of the different
    fragment obtained from sequencing of a genome,
    sequence tag sites were mapped all other the
    human genome. They are short unique sequences of
    DNA all along the chromosomes and defined by a
    pair of primers that amplify only one segment of
    the chromosome.
  • EST When the genomes started to be annotated
    means define the coding region of the genes,
    several labs started to sequence short cDNA
    transcript lt500bp from either 3 or 5 end of the
    DNA. The short segments of DNA were called
    Expressed Sequence Tags (EST), it did not matter
    if they knew from which genes they came since
    their goal was to compile every EST. EST have
    been very useful in cloning genes identify
    splicing variants of genes. If investigator
    sequenced a piece of DNA that contained an EST,
    they would know that the DNA would be a coding
    region. Those EST grown very rapidely and are
    now organized in an EST database called unigene.
  • Today many sequencing consortium are dayly
    producing full length cDNA that are added to
    UNIGENE database the goal is to have all the full
    length cDNA of all the genes available for the
    research community. These genes can be bought
    for (300US)
  • FISH Fluorescence in situ hybridization.
    Chromosomes are isolated, fixed on a glass slide
    and hybridized to a fluorescent molecular probe
    this experiment allow the position of markers
    (STS, BAC) onto the cytogenetic map of the
    chromosome.
  • CONTIGS BAC and YAC vectors contains contiguous
    fragments of DNA that overlap

13
BACs
  • Large insert size
  • No Chimeras
  • Stable in host cells

BAC ends
14
Chromosome Map Cytogenetic Map
Centromere
Distant Markers
Closely linked markers
BAC Fragment collection Physical Map
Mostly found on same fragments
Rarely found on same fragments
Phred and Phrap (Phil Green) suite of programs
are used to assemble the DNA fragments and
assesss the quality of the assembly. (Baxevanis
Chapter XIII).
15
Feature annotationFluorescence in Situ
HybridizationFISH
  • When a particular marker (STS, BAC) is positioned
    on the cytogenetic map of a chromosome by FISH
    all DNA fragments (BAC) that contain the marker
    and are overlapping will be ultimately positioned
    on the chromosome to build the physical map

An example of Fluorescence in situ hybridization
using a BACs subsequences as a fluorescent
probe
16
Physical Map assembly
  • Library of BACs containing genome fragments
  • STS linkage to position fragments
  • Physical Map assembly
  • Fragment BACs and assemble sequenced contigs
  • Align sequenced fragments to physical map by
    Phred and Phrap.

17
Hierarchical shotgun sequencing
The method preferred by the Human Genome Project
is the hierarchical shotgun sequencing method. In
this approach, genomic DNA is cut into pieces of
about 150 kb and inserted into BAC vectors,
transformed into E. coli where they are
replicated and stored. The BAC inserts are
isolated and mapped to determine the order of
each cloned 150 kb fragment. This is referred to
as the Golden Tiling Path. Each BAC fragment in
the Golden Path is fragmented randomly into
smaller pieces and each piece is cloned into a
plasmid and sequenced on both strands. These
sequences are aligned so that identical sequences
are overlapping. These contiguous pieces are then
assembled into finished sequence once each strand
has been sequenced about 4 times to produce 8X
coverage of high quality data.
18
Shotgun sequencingCelera method
  • The method developed and preferred by Celera is
    simply called shotgun sequencing. This approach
    was developed and perfected on prokaryotic
    genomes which are smaller in size and contain
    less repetitive DNA. Shotgun sequencing randomly
    shears genomic DNA into small pieces which are
    cloned into plasmids and sequenced on both
    strands, thus eliminating the BAC step from the
    HGP's approach. Once the sequences are obtained,
    they are aligned and assembled into finished
    sequence.

19
Which strategy is the best
  • When the human project were published in nature
    2001(HGP academic) and science (celera) many of
    the contigs were not overlaping creating gaps in
    the genome.
  • Many believe that the HPG method is better suited
    for filling gap between the contig since the
    fragments are ordered.
  • For example duplicated genes could be combined in
    a single loci by the celera method, while the HPG
    method could identify two different loci from the
    duplicated genes.
  • The completed human genome by the HPG has been
    released in april 2003 for the 50th anniversary
    of the double helix. Celera has been able to
    update their genome using the HPG genome. The
    human genome produced by the HPG consortium did
    not took advantage of the celera genome since
    celera did not share their data for intellectual
    property reasons.
  • Today the HPG genome is available on public
    database, while the celera genome is only
    available by subscription (10,000 USD)

20
Assembling the Human Genome
  • The HPG was not limited to the human genome it
    also targeted,fly,worm,mouse,plants,and yeast.
    By comparing different genomes we should be able
    to learn more about genome in general and human
    genome in particular
  • Big genomes contain repetitives regions from old
    viruses and transposons that have been
    accumulated during millena. Piecing together the
    full genome would be better if some markers
    existed. For this purpose STS were designed to
    position the different pieces on the genome.
  • Since many regions of the DNA do not code for RNA
    or proteins, EST were used to differentiate
    between junk DNA and DNA coding regions.

21
Assembling the Human Genome
  • Assemble the sequence Data
  • Annotate the features
  • Provide a dataset of assemble genomic
    sequence,RNAs, and proteins
  • Create a mapviewer of the genomic sequences

22
Genomics sequence assemblies
  • Public effortObjectives Assemble overlapping
    fragments from different BAC sequencesTwo
    organizations
  • The Human Genome Project Working Draft (also
    known as the 'Golden Path' assemblies) at
    University of California, Santa Cruz (UCSC)
  • The National Center for Biotechnology
    Information (NCBI) Contig Assemblies.

23
Genomics sequence assemblies
  • Public effortThe UCSC strategy begins with human
    genomic sequences from GenBank composed of BAC
    clones at a given point (a 'freeze' dataset),
    ordered and oriented according to the appropriate
    fingerprinting contigs from Washington University
    Genome Sequencing Center (WUGSC) were obtained.

-
A typical fingerprint gel containing 121 lanes of
DNA. Every 5th lane is a marker lane with 37
bands to calibrate the gel metric. Each of the 96
data lanes contains a digested BAC, with 30-60
bands.

24
Genomics sequence assemblies
  • Public effort Within each WUGSC contig, draft
    sequence fragments are assembled into consensus
    'raft' sequences using overlaps between fragments
    and bridging mRNA, expressed sequence tag (EST),
    plasmid and BAC-end-pair sequences. Repeated
    tracts of the letter 'N' are inserted between
    non-overlapping rafts to give a longer consensus
    sequence for each WUGSC contig.

25
Genomics sequence assemblies
  • Public effortThe NCBI approach also begins by
    finding an order for adjacent BACs but in this
    case it is derived from BAC sequence overlaps
    chromosome assignment by fluorescence in situ
    hybridization (FISH) and sequence-tagged site
    (STS) content. The sequence fragments from these
    overlapping BACs are then merged into consensus
    'meld' sequences. As with the UCSC method, these
    melds are then ordered and orientated using ESTs,
    mRNAs, etc.., before being combined into a single
    NCBI genomic sequence contig with melds separated
    by runs of the letter 'N'. Because NCBI will
    often be dealing with fewer sequence fragments
    than UCSC in the construction of a given contig
    (since NCBI only assemble contigs from
    overlapping BAC sequences, not from all BAC
    sequences within an FPC contig as does UCSC,
    there should be less opportunity for misassembly
    (erroneously ordering or orientating sequence
    fragments).

26
Genomics sequence assemblies
  • Private effortCelera Genomics has also published
    a draft assembly for the genome. It is available,
    under a variety of restrictions, only from their
    web site.
  • They used a whole-genome shotgun sequencing
    approach to obtain fragmentary and unmapped
    sequence data, and combined these with
    International Human Genome Sequencing Consortium
    (IHGSC) sequence and mapping data.
  • Celera assembled its own data with those produced
    by the IHGSC using two different assembly
    protocols either with or without mapping
    information. They found that using the public
    mapping data gave the assembly greater sequence
    coverage.
  • The Celera sequences may provide a useful
    resource for plugging gaps in the IHGSC draft
    genome. There were up to 0.14 of Celera
    sequence was not present in the public data set
    in the first draft of the human genome.

27
Genomics sequence assemblies
  • The UCSC effort provided an exhaustive account of
    the shortcomings of their assemblies. Summary
    statistics are given for almost every measurable
    characteristic, including numbers and lengths of
    gaps and the estimated number of misassemblies.
  • The UCSC group has examined the performance of
    their assembly algorithms on artificially
    produced fragmentary sequence (fragmented
    sequences from finished regions of the genome).
  • Their assemblies reproduced the correct ordering
    of fragments around 85 of the time and the
    correct orientation of fragments around 90 of
    the time.
  • The quality of the assemblies degraded rapidly
    where the algorithms had to deal with many small
    fragments, with both correct order and
    orientation being achieved only 50 of the time
    (effectively the same as random assembly) in the
    worst cases.
  • It was also identified possible misassemblies in
    the Celera assembly. Neither Celera nor NCBI
    publishes statistics analogous to UCSC's for
    draft sequence assembly, nor have they published
    estimates of the rate of error involved.

28
Genomics Mapping Data
  • WUGSC(Washington University Genome Science
    Center) has produced a physical map of the
    genome using 'fingerprinting' analysis of BACs
    from a library of genomic sequence clones called
    RPCI-11
  • Overlaps between clones are calculated using
    different algorithms on the basis of clone
    restriction-fragment patterns, or fingerprints.
    In this way, fingerprinting-based contigs
    covering at least 96 of the euchromatic genome
    have been constructed. The first published draft
    genome sequence assemblies contain hundreds of
    thousands of gaps, whereas this physical map
    contains less than 1000.
  • The International Human Genome Mapping Consortium
    (IHGMC) physical map were supplemented with data
    from other sources. Many fully or partially
    sequenced BAC clones have been cytogenetically
    mapped to particular regions using FISH, and this
    cytogenetic location is given in the GenBank
    sequence entry for the clone.
  • The end sequences from many BAC clones, including
    those from libraries other than RPCI-11, are
    available from The Institute for Genomic Research
    (TIGR) Human BAC Ends site.
  • The physical and genetic marker content of a
    sequence of interest can be determined online
    using the electronic-PCR (e-PCR) program at NCBI.
    This is a rapid sequence-search algorithm that
    searches your sequence for occurrences of marker
    sequences in GenBank. If you enter a BAC clone
    accession number, the output consists of an
    ordered list of markers (and the chromosome they
    map to) down the clone sequence.

29
Possible stage of BAC clone sequence
  • Phase 1 Unfinishedunordered/unoriented contigs
    with gaps
  • Phase 2 Unfinished ordered/oriented contigs
    with or without gaps
  • Phase 3 Finished no gaps

30
Writing the parts listgenomic sequence
annotation
  • "We've called the human genome the blueprint, the
    Holy Grail, all sorts of things. It's a parts
    list," said Eric Lander at the Millennium Evening
    at the White House, 14 October 1999 about the
    first draft of the human genome.
  • The IHGSC process of predicting genes were
    estimated to have a sensitivity of 68-85 (so
    that 15-32 of genes present in the genome were
    missed) and an accuracy of around 79 on average
    (so that 21 of the sequence of each predicted
    gene was missed).
  • Since then gene prediction has been constantly
    improved

31
Writing the parts listgenomic sequence
annotation
  • To obtain a better gene prediction in the human
    genome cDNA libraries were made and screened to
    verify or refute gene predictions made for the
    first completed human chromosomes, at the Sanger
    Centre.
  • Many genes from a large collection of full-length
    mouse cDNAs and large human cDNA collections were
    produced.
  • The IHGSC has assembled an initial, non-redundant
    set of predicted and known human genes called the
    integrated gene index (IGI), which is to be
    updated as more data accumulate and is made
    accessible via the Ensembl site.
  • Beyond the question of where exactly the genes
    are, it would also be useful to have some idea of
    what they do.
  • Only around 40 of the IGI-predicted proteins
    appeared initially to contain known domains
    implying that we may be able to predict at least
    some of their functions and even in these cases
    the computational predictions must be supported
    by lab work before they are regarded as accurate.

32
Genomes Browsers
  • There are three major well-designed websites
    offering users the chance to browse annotations
    of the human genome.They offer a graphical
    interface to display the results of various
    analyses, such as gene predictions and similarity
    searches, for draft and finished genomic
    sequence.
  • They allow rapid intuitive comparisons between
    the features predicted by different programs.
  • One can see at once where an exon prediction
    overlaps with interspersed repeats or a
    single-nucleotide polymorphism (SNP).
  • There are differences between the 3 different
    Browser.

33
Ensembl
  • The Ensembl database was the first to provide a
    window on the draft genome and started by
    curating 'confirmed' genes that were
    computationally predicted and also supported by a
    significant match to one or more expressed
    sequences.
  • Uses gene structures from public sequence
    database entries (many of which are
    experimentally verified), so the total set of
    Ensembl genes should be a more accurate
    reflection of reality than computational
    predictions alone.
  • Other features includes repetitive sequence,
    cytological bands, genetic markers, CpG island
    predictions, tRNA gene predictions, UniGene
    (expressed sequence) clusters, SNPs from the
    dbSNP database, disease genes found in the draft
    genome (identified by the Online Mendelian
    Inheritance in Man database, OMIM, and regions of
    homology to mouse draft genomic sequences.
  • Ensembl uses the UCSC draft sequence assemblies
    as its starting point, so its description of a
    region can only be as accurate as those
    assemblies allow. Gaps and misassemblies in the
    genomic sequence could lead to Ensembl missing or
    wrongly positioning genes and other features.
  • Misassemblies are expected to arise most
    frequently among small BAC fragments, but more
    broadly the BAC clones mapped to a given region
    by UCSC (on the basis of FPC data, FISH data, STS
    content and sequence overlap) should be very
    accurate. Thus, Ensembl annotation may not be
    accurate on the finest, local scales but should
    give a very good idea of what is present in a
    larger region, perhaps at the level of megabases.

34
UCSC Human Genome Browser
  • The UCSC Human Genome Browser (HGB Figure 2) 4
    bears many similarities to Ensembl
  • it too provides annotation of the UCSC assemblies
  • it displays a similar array of features.
  • Additional features of HGB not yet found in
    Ensembl includes.
  • Predictions from more than one ab initio
    gene-prediction program (programs that predict
    coding sequence on the basis of statistical
    measures of features such as codon usage,
    initiation or polyadenylation signals, rather
    than by homology to known genes) and indicates
    regions with significant homology to other
    organisms.
  • These features can provide useful information
    when dealing with gene predictions that are not
    well supported by similarity to known mRNA
    sequences.
  • Detailed description of the genomic sequence
    assemblies. Graphical representations of the
    fragments making up a region of draft genome can
    be displayed, showing the relative size and
    sequence quality of each fragment and also
    whether any gaps between fragments are bridged by
    mRNAs or paired BAC end sequences.ne can get an
    idea of the likely degree of misassembly in a
    region. Data retrieval is possible via text, BLAT
    searches (a faster, less accurate algorithm than
    BLAST) and bulk downloads of annotation or
    sequence data.

35
NCBI Map Viewer
  • Whereas Ensembl and HGB both show annotation of
    the UCSC draft genome assemblies the NCBI Map
    Viewer (NMV) displays features present in the
    NCBI assemblies. The NMV shows
  • Comparisons between cytogenetic, genetic and
    radiation hybrid maps in parallel with NCBI draft
    and finished sequence contigs. The locations of
    genes, STSs, and SNPs are indicated on the contig
    sequences. The NCBI approach to gene prediction
    is more conservative than those in Ensembl and
    HGB.
  • No ab initio gene-prediction-program is used
    instead of known genes, mRNAs and ESTs are
    aligned to genomic sequence using a program
    called Acembly. The program also attempts to give
    alternative splice variants of genes where its
    alignments suggest them. All annotated genes are
    connected to NCBI LocusLink, which provides links
    to associated information such as related
    sequence accession numbers, expression data,
    known phenotypes, polymorphisms, and so on.

36
NCBI Map Viewer
  • In spite of difficulties with the quality of
    genomic sequence assemblies, the three browsers
    remain extremely useful tools for the cautious
    biologist. They undoubtedly indicate the presence
    of most of the coding sequence in a given
    fragment of genomic sequence and indicate its
    location in the genome as determined by the best
    available mapping data.
  • Most aspects of the analysis carried out are the
    subjects of active research, and improvements in
    performance resulting from the inclusion of new
    sequence data and algorithms will be ongoing
    annotations are constantly subject to changes.

37
End Product
  • The NCBI annotation project provide sequences and
    resources support via AceView,LocusLink and
    Mapviewer.
  • RefSeq provides accession prefixes for each of
    the type of sequences
  • XM_XXXmRNA
  • XR_XXXXnon-coding transcript
  • XP_XXX Protein, model reference sequence
  • The NCBI equivalent accession number on the NCBI
    contigs server are NT_ or NW which may be
    generated from incomplete data with mistakes.

38
NCBI Assembly Pipeline
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
Useful Websites
  • NCBI - http//www.ncbi.nlm.nih.gov
  • Ensemble - http//www.ensembl.org/
  • UCSC - http//genome.ucsc.edu/
Write a Comment
User Comments (0)
About PowerShow.com