Automated sequencing machines, - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Automated sequencing machines,

Description:

Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once. – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 64
Provided by: Pev99
Category:

less

Transcript and Presenter's Notes

Title: Automated sequencing machines,


1
  • Automated sequencing machines,
  • particularly those made by PE Applied
    Biosystems, use 4 colors, so they can read all 4
    bases at once.

2
All the Genes?
  • Any human gene can now be found in the genome by
    similarity searching with over 95 certainty.
  • However, the sequence still has many gaps
  • unlikely to find an uninterrupted genomic segment
    for any gene
  • still cant identify pseudogenes with certainty
  • This will improve as more sequence data
    accumulates

3
Finding Genes in genome Sequence is Not Easy
  • About 2 of human DNA encodes functional genes.
  • Genes are interspersed among long stretches of
    non-coding DNA.
  • Repeats, pseudo-genes, and introns confound
    matters

4
Impact on Bioinformatics
  • Genomics produces high-throughput, high-quality
    data, and bioinformatics provides the analysis
    and interpretation of these massive data sets.
  • It is impossible to separate genomics laboratory
    technologies from the computational tools
    required for data analysis.

5
Six basic questions about genomes
1 how is a genome sequenced? 2 when is the
project finished? 3 sequence one individual or
many? 4 what information is in the DNA? 5 how
many genes are in the genome? 6 how can whole
genomes be compared?
6
1 Genome projects sequencing strategies
Hierarchical shotgun method Assemble contigs from
various chromosomes, then sequence and assemble
them. A contig is a set of overlapping clones or
sequences from which a sequence can be obtained.
The sequence may be draft or finished. A contig
is thus a chromosome map showing the locations of
those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to
study a complete, and often large segment of the
genome by examining a series of overlapping
clones which then provide an unbroken succession
of information about that
region. Scaffold an ordered set of contigs
placed on a chromosome.
Shotgun An approach used to decode an organism's
genome by shredding it into smaller fragments of
DNA which can be sequenced individually. The
sequences of these fragments are then ordered,
based on overlaps in the genetic code, and
finally reassembled into the complete sequence.
The 'whole genome shotgun' method is applied to
the entire genome all at once, while the
'hierarchical shotgun' method is applied to
large, overlapping DNA fragments of known
location in the genome.
http//www.genome.gov/glossary.cfm
7
3. Whole Genome Shotgun Sequencing
genome
forward-reverse linked reads
8
ARACHNE Whole Genome Shotgun Assembly
http//www-genome.wi.mit.edu/wga/
9
2 When is the project finished?
Get five to ten-fold coverage
Finished sequence a clone insert is
contiguously sequenced with high quality standard
of error rate 0.01. There are usually no gaps in
the sequence. Draft sequence clone sequences
may contain several regions separated by gaps.
The true order and orientation of the pieces may
not be known.
10
(No Transcript)
11
Repetitive DNA sequences five classes
1 Interspersed repeats transposon-derived
repeats -- 45 of human genome LTR, SINE,
LINE 2 Processed pseudogenes 3 Simple
sequence repeats -- micro- and
minisatellites -- ACAAACT, 11 million times in a
Drosophila -- Human genome has 50,000 CA
dinucleotide repeats 4 Segmental duplications
(about 5 of human genome) 5 Tandem repeats
(e.g. telomeres, centromeres)
12
  • LINE and SINE repeats. A LINE (long interspersed
    nuclear element) encodes a reverse transcriptase
    (RT) and perhaps other proteins. Mammalian
    genomes contain an old LINE family, called LINE2,
    which apparently stopped transposing before the
    mammalian radiation, and a younger family, called
    L1 or LINE1, many of which were inserted after
    the mammalian radiation (and are still being
    inserted). A SINE (short interspersed nuclear
    element) generally moves using RT from a LINE.
    Examples include the MIR elements, which
    co-evolved with the LINE2 elements. Since the
    mammalian radiation, each lineage has evolved its
    own SINE family. Primates have Alu elements and
    mice have B1, B2, etc. The process of insertion
    of a LINE or SINE into the genome causes a short
    sequence (7-21 bp for Alus) to be repeated, with
    one copy (in the same orientation) at each end of
    the inserted sequence. Alus have accumulated
    preferentially in GC-rich regions, L1s in GC-poor
    regions.

13
What is the function of nongenic DNA?
  • Hypotheses
  • Nongenic DNA performs essential functions, such
    as
  • regulation of gene expression.
  • Nongenic DNA is inert, genetically and
    physiologically.
  • Excess DNA is incidental and is called junk
    DNA.
  • Nongenic DNA is a functional parasite or selfish
    DNA
  • (retrotransposons).
  • Nongenic DNA has a structural function.

14
Clasificación del ADN
  • FUNCIONAL (secuencias que cumplen una función)
  • - Codante (se traducen en proteínas)
  • -No codante (no se traducen)
  • Transcrito (cumple función a nivel de RNA
    subun. ribos.)
  • No transcrito (cumple función a nivel de
    DNA intrón, promotor,
    enhancer, etc.)
  • NO-FUNCIONAL (secuencias que no cumplen ninguna
    función Junk DNA basura)

15
Gene-finding algorithms
Homology-based searches (extrinsic) Rely on
previously identified genes Algorithm-based
searches (intrinsic) Investigate nucleotide
composition, open- reading frames, and other
intrinsic properties of genomic DNA
16
DNA
RNA
intron
Mature RNA
protein
17
Homology-based searching compare DNA to
expressed genes (ESTs)
DNA
RNA
intron
RNA
protein
18
DNA
RNA
Algorithm-based searching compare DNA in
exons (unique codon usage) to introns (unique
splices sites) to noncoding DNA. Identify open
reading frames (ORFs).
19
(No Transcript)
20
(No Transcript)
21
6 how can whole genomes be compared?
-- molecular phylogeny -- You can BLAST (or
PSI-BLAST) all the DNA and/or protein in one
genome against another -- We looked at TaxPlot
and COG for bacterial (and for some
eukaryotic) genomes
22
Orthologue Paralogue
  • Orthologue- homologous genes with identical
    function in different organisms.
  • Paralogue- homologous genes in the same organism
    originated from gene duplication.

23
Orthologue Paralogue
Gene A
24
Orthologue Paralogue
25
Orthologue Paralogue
26
Orthologue Paralogue
Species 1
Species 2
Gene A
Gene B
27
Comparative GenomicsUsing ACTThe Artemis
Comparison Tool
28
Artemis
  • Artemis is a free DNA sequence viewer and
    annotation tool that allows visualization of
    sequence features and the results of analyses
    within the context of the sequence, and its
    six-frame translation.
  • http//www.sanger.ac.uk/Software/Artemis/

29
Artemis comparison tool ACT
  • Based on artemis and coded in java.
  • Allows visualisation of two sequences or more and
    a comparison file.
  • The comparison file can be BLASTn or tBLASTx.
  • Retains all the functionality of artemis.

30
Running ACT
Sequence 1
Sequence 2
BLASTn tBLASTx
MSPcrunch
Reformat
31
DNA sequence
Gene finders
Blastn
Halfwise
Blastx
tRNA scan
RepeatMasker
Repeats
Promoters
Pseudo-Genes
rRNA
Genes
tRNA
Fasta
BlastP
Pfam
Prosite
Psort
SignalP
TMHMM
32
The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
33
DNA in Artemis
AT content
Forward translations
Reverse Translations
DNA and amino acids
34
Gene structure
  • IN TRYPANOSOMATIDS
  • Polycistronic structure
  • Genes occur on a single strand at a time.
  • Inflection points
  • No splicing

35
(No Transcript)
36
Trypanosome gene structure
37
GENE STRUCTURE IN MALARIA
  • Splicing
  • No polycistronic units
  • Can have small exons
  • Low complexity regions

38
AT content
  • Coding regions have higher GC content in AT rich
    genomes

39
AT content
40
CODON USAGE
  • Codon bias is different for each organisms.
  • DNA content in coding regions is restricted but
    not in non coding regions.
  • The codon usage for any particular gene can
    influence expression.

41
Codon usage
  • All organisms have a preferred set of codons.
  • Malaria Trypanosoma
  • GUU 0.41 GUU 0.28
  • GUC 0.06 GUC 0.19
  • GUA 0.42 GUA 0.14
  • GUG 0.11 GUG 0.39

42
Codon Usage
  • http//www.kazusa.or.jp/codon/

43
Codon Usage in Artemis
Forward frames
Reverse frames
44
GC frame plot
  • Plots the third position GC content of each frame
    of a DNA sequence.
  • In coding DNA the GC content of the 3rd base is
    often higher.
  • Good prediction of coding in malaria and
    trypanosomes.

45
(No Transcript)
46
Genefinding programs
  • Genefinding software packages use hidden markov
    models.
  • Predict coding, intergenic and intron sequences
  • Need to be trained on a specific organism.
  • Never perfect!

47
PhatCawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
  • Based on a generalised hidden Markov model (GHMM)
  • Free easily installed and run.
  • Is good at predicting multiexon genes but will in
    some cases miss out genes altogether and will
    over predict.

48
Whant is an HMM
  • A statistical model that represents a gene.
  • Similar to a weight Matrix that can recognise
    gaps and treat them in a systematic way.
  • Has a different states that represent
    introns,exons and intergenic regions.

49
GlimmerMSalzberg et al. (1999) genomics 59 24-31
  • Adaption of the prokaryotic genefinder Glimmer.
  • Delcher et al. (1999) NAR 2 4363-4641
  • Based on a interpolated HMM (IHMM).
  • Only used short chains of bases (markov chains)
    to generate probabilities.
  • Trained identically to Phat

50
GlimmerM
  • Under predicts splicing
  • Hardly hardly ever misses a gene completely.
  • Does over predict.
  • Free with licence.

51
Homology Data
  • Coding regions are more conserved than non coding
    regions due to selective pressure.
  • Comparing all possible translations against all
    known proteins will give clues to known genes.
  • Blastx

52
The Gene Prediction Process
ESTs
FASTA
BlastX
DNA SEQUENCE
Good Gene Models
ANNALYSIS SOFTWARE
Phat
GlimmerM
DNA Plots
Annotator
53
T. brucei vs L. major (cont.)
54
T. brucei vs T. cruzi
55
L. major has break in synteny that is conserved
in T. brucei and T. cruzi
T. cruzi Chr3.
T. Brucei chr1
L. Major chr12
T. Brucei chr6
56
The ACT Display
genome1
Zoom scroll bar
Filter scroll bar
genome2
Genome2
Blast HSPs
genome3
57
ACT
  • Designed for looking at complete bacterial
    genomes.

58
Knowlesi contgs
tblastx
Falciparum Chr 3
tblastx
Yoelii Contigs (TIGR)
59
(No Transcript)
60
AG-FMVZ-USP
61
(No Transcript)
62
(No Transcript)
63
Software
  • www.sanger.ac.uk/Software/Artemis
  • www.sanger.ac.uk/Software/ACT
  • www.genome.nghri.nih.gov/blastall
  • www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html
Write a Comment
User Comments (0)
About PowerShow.com