Title: Automated sequencing machines,
1- Automated sequencing machines,
- particularly those made by PE Applied
Biosystems, use 4 colors, so they can read all 4
bases at once.
2All the Genes?
- Any human gene can now be found in the genome by
similarity searching with over 95 certainty. - However, the sequence still has many gaps
- unlikely to find an uninterrupted genomic segment
for any gene - still cant identify pseudogenes with certainty
- This will improve as more sequence data
accumulates
3Finding Genes in genome Sequence is Not Easy
- About 2 of human DNA encodes functional genes.
- Genes are interspersed among long stretches of
non-coding DNA. - Repeats, pseudo-genes, and introns confound
matters
4Impact on Bioinformatics
- Genomics produces high-throughput, high-quality
data, and bioinformatics provides the analysis
and interpretation of these massive data sets. - It is impossible to separate genomics laboratory
technologies from the computational tools
required for data analysis.
5Six basic questions about genomes
1 how is a genome sequenced? 2 when is the
project finished? 3 sequence one individual or
many? 4 what information is in the DNA? 5 how
many genes are in the genome? 6 how can whole
genomes be compared?
61 Genome projects sequencing strategies
Hierarchical shotgun method Assemble contigs from
various chromosomes, then sequence and assemble
them. A contig is a set of overlapping clones or
sequences from which a sequence can be obtained.
The sequence may be draft or finished. A contig
is thus a chromosome map showing the locations of
those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to
study a complete, and often large segment of the
genome by examining a series of overlapping
clones which then provide an unbroken succession
of information about that
region. Scaffold an ordered set of contigs
placed on a chromosome.
Shotgun An approach used to decode an organism's
genome by shredding it into smaller fragments of
DNA which can be sequenced individually. The
sequences of these fragments are then ordered,
based on overlaps in the genetic code, and
finally reassembled into the complete sequence.
The 'whole genome shotgun' method is applied to
the entire genome all at once, while the
'hierarchical shotgun' method is applied to
large, overlapping DNA fragments of known
location in the genome.
http//www.genome.gov/glossary.cfm
73. Whole Genome Shotgun Sequencing
genome
forward-reverse linked reads
8ARACHNE Whole Genome Shotgun Assembly
http//www-genome.wi.mit.edu/wga/
92 When is the project finished?
Get five to ten-fold coverage
Finished sequence a clone insert is
contiguously sequenced with high quality standard
of error rate 0.01. There are usually no gaps in
the sequence. Draft sequence clone sequences
may contain several regions separated by gaps.
The true order and orientation of the pieces may
not be known.
10(No Transcript)
11 Repetitive DNA sequences five classes
1 Interspersed repeats transposon-derived
repeats -- 45 of human genome LTR, SINE,
LINE 2 Processed pseudogenes 3 Simple
sequence repeats -- micro- and
minisatellites -- ACAAACT, 11 million times in a
Drosophila -- Human genome has 50,000 CA
dinucleotide repeats 4 Segmental duplications
(about 5 of human genome) 5 Tandem repeats
(e.g. telomeres, centromeres)
12- LINE and SINE repeats. A LINE (long interspersed
nuclear element) encodes a reverse transcriptase
(RT) and perhaps other proteins. Mammalian
genomes contain an old LINE family, called LINE2,
which apparently stopped transposing before the
mammalian radiation, and a younger family, called
L1 or LINE1, many of which were inserted after
the mammalian radiation (and are still being
inserted). A SINE (short interspersed nuclear
element) generally moves using RT from a LINE.
Examples include the MIR elements, which
co-evolved with the LINE2 elements. Since the
mammalian radiation, each lineage has evolved its
own SINE family. Primates have Alu elements and
mice have B1, B2, etc. The process of insertion
of a LINE or SINE into the genome causes a short
sequence (7-21 bp for Alus) to be repeated, with
one copy (in the same orientation) at each end of
the inserted sequence. Alus have accumulated
preferentially in GC-rich regions, L1s in GC-poor
regions.
13What is the function of nongenic DNA?
- Hypotheses
- Nongenic DNA performs essential functions, such
as - regulation of gene expression.
- Nongenic DNA is inert, genetically and
physiologically. - Excess DNA is incidental and is called junk
DNA. - Nongenic DNA is a functional parasite or selfish
DNA - (retrotransposons).
- Nongenic DNA has a structural function.
14Clasificación del ADN
- FUNCIONAL (secuencias que cumplen una función)
- - Codante (se traducen en proteínas)
- -No codante (no se traducen)
- Transcrito (cumple función a nivel de RNA
subun. ribos.) - No transcrito (cumple función a nivel de
DNA intrón, promotor,
enhancer, etc.) - NO-FUNCIONAL (secuencias que no cumplen ninguna
función Junk DNA basura)
15 Gene-finding algorithms
Homology-based searches (extrinsic) Rely on
previously identified genes Algorithm-based
searches (intrinsic) Investigate nucleotide
composition, open- reading frames, and other
intrinsic properties of genomic DNA
16DNA
RNA
intron
Mature RNA
protein
17Homology-based searching compare DNA to
expressed genes (ESTs)
DNA
RNA
intron
RNA
protein
18DNA
RNA
Algorithm-based searching compare DNA in
exons (unique codon usage) to introns (unique
splices sites) to noncoding DNA. Identify open
reading frames (ORFs).
19(No Transcript)
20(No Transcript)
216 how can whole genomes be compared?
-- molecular phylogeny -- You can BLAST (or
PSI-BLAST) all the DNA and/or protein in one
genome against another -- We looked at TaxPlot
and COG for bacterial (and for some
eukaryotic) genomes
22Orthologue Paralogue
- Orthologue- homologous genes with identical
function in different organisms. - Paralogue- homologous genes in the same organism
originated from gene duplication.
23Orthologue Paralogue
Gene A
24Orthologue Paralogue
25Orthologue Paralogue
26Orthologue Paralogue
Species 1
Species 2
Gene A
Gene B
27Comparative GenomicsUsing ACTThe Artemis
Comparison Tool
28Artemis
- Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its
six-frame translation. - http//www.sanger.ac.uk/Software/Artemis/
29Artemis comparison tool ACT
- Based on artemis and coded in java.
- Allows visualisation of two sequences or more and
a comparison file. - The comparison file can be BLASTn or tBLASTx.
- Retains all the functionality of artemis.
30Running ACT
Sequence 1
Sequence 2
BLASTn tBLASTx
MSPcrunch
Reformat
31DNA sequence
Gene finders
Blastn
Halfwise
Blastx
tRNA scan
RepeatMasker
Repeats
Promoters
Pseudo-Genes
rRNA
Genes
tRNA
Fasta
BlastP
Pfam
Prosite
Psort
SignalP
TMHMM
32The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
33DNA in Artemis
AT content
Forward translations
Reverse Translations
DNA and amino acids
34Gene structure
- IN TRYPANOSOMATIDS
- Polycistronic structure
- Genes occur on a single strand at a time.
- Inflection points
- No splicing
35(No Transcript)
36Trypanosome gene structure
37GENE STRUCTURE IN MALARIA
- Splicing
- No polycistronic units
- Can have small exons
- Low complexity regions
38AT content
- Coding regions have higher GC content in AT rich
genomes
39AT content
40CODON USAGE
- Codon bias is different for each organisms.
- DNA content in coding regions is restricted but
not in non coding regions. - The codon usage for any particular gene can
influence expression.
41Codon usage
- All organisms have a preferred set of codons.
- Malaria Trypanosoma
- GUU 0.41 GUU 0.28
- GUC 0.06 GUC 0.19
- GUA 0.42 GUA 0.14
- GUG 0.11 GUG 0.39
42Codon Usage
- http//www.kazusa.or.jp/codon/
43Codon Usage in Artemis
Forward frames
Reverse frames
44GC frame plot
- Plots the third position GC content of each frame
of a DNA sequence. - In coding DNA the GC content of the 3rd base is
often higher. - Good prediction of coding in malaria and
trypanosomes.
45(No Transcript)
46Genefinding programs
- Genefinding software packages use hidden markov
models. - Predict coding, intergenic and intron sequences
- Need to be trained on a specific organism.
- Never perfect!
47PhatCawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
- Based on a generalised hidden Markov model (GHMM)
- Free easily installed and run.
- Is good at predicting multiexon genes but will in
some cases miss out genes altogether and will
over predict.
48Whant is an HMM
- A statistical model that represents a gene.
- Similar to a weight Matrix that can recognise
gaps and treat them in a systematic way. - Has a different states that represent
introns,exons and intergenic regions.
49GlimmerMSalzberg et al. (1999) genomics 59 24-31
- Adaption of the prokaryotic genefinder Glimmer.
- Delcher et al. (1999) NAR 2 4363-4641
- Based on a interpolated HMM (IHMM).
- Only used short chains of bases (markov chains)
to generate probabilities. - Trained identically to Phat
50GlimmerM
- Under predicts splicing
- Hardly hardly ever misses a gene completely.
- Does over predict.
- Free with licence.
51Homology Data
- Coding regions are more conserved than non coding
regions due to selective pressure. - Comparing all possible translations against all
known proteins will give clues to known genes. - Blastx
52The Gene Prediction Process
ESTs
FASTA
BlastX
DNA SEQUENCE
Good Gene Models
ANNALYSIS SOFTWARE
Phat
GlimmerM
DNA Plots
Annotator
53T. brucei vs L. major (cont.)
54T. brucei vs T. cruzi
55L. major has break in synteny that is conserved
in T. brucei and T. cruzi
T. cruzi Chr3.
T. Brucei chr1
L. Major chr12
T. Brucei chr6
56The ACT Display
genome1
Zoom scroll bar
Filter scroll bar
genome2
Genome2
Blast HSPs
genome3
57ACT
- Designed for looking at complete bacterial
genomes.
58Knowlesi contgs
tblastx
Falciparum Chr 3
tblastx
Yoelii Contigs (TIGR)
59(No Transcript)
60AG-FMVZ-USP
61(No Transcript)
62(No Transcript)
63Software
- www.sanger.ac.uk/Software/Artemis
- www.sanger.ac.uk/Software/ACT
- www.genome.nghri.nih.gov/blastall
- www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html