Title: The Genome Access Course Genome Analysis
1TheGenomeAccessCourseGenome Analysis
From a 13th century French Bible
2- Genome Sequencing and Assembly
- Genome Analysis
- Genomes on Display
3Poliovirus 1981 7,433 nt First eukaryotic virus
Haemophilus influenza 1995 1.83 Mbp First cellular organism
Saccharomyces cerevisiae 1996 13 Mbp First eukaryote
Escherichia coli 1997 4.6 Mbp
Caenorhabditis elegans 1998 97 Mbp First multicellular organism
Drosophila melanogaster 2000 137 Mbp
Arabidopsis thaliana 2000 125 Mbp First plant
Homo sapiens 2001 3.2 Bbp First vertebrate
Oryza sativa 2002 430 Mbp
4Hierarchical vs. Whole Genome Shotgun
5Hierarchical Shotgun Sequencing
6Sequencing Software Examples
- Phred base-calling
- Phrap assembler
- Cross_Match
- Consed graphical editor
- AutoFinish finishing
7Sequence Trace
8Phrap Aligns Reads
AGCTNGTATCGTAGCTNGATCGCAA
GTAGCTAGATCGCTATACGTACNACGT
GATCGCTATACGTACCACGT
9(No Transcript)
10Finishing A BAC
Multiple clone coverage of both strands
Area of single clone coverage
Area of single strand coverage
Alignment
Gap
Gap
11(No Transcript)
12Constructing Maps Using Fingerprints
- Restriction digest (HinDIII) of BACs generates a
series of fragments - Determine size of fragments by gel
electrophoresis - Fingerprint comparison determines overlap
- STS markers localize fingerprint clone contigs
13BAC Fingerprints
14Contig Assembly
- Use FPC map
- Find (potential) overlapping sequences
- Order the fragments
- Generate the sequence
15Directed Sequencing
- Sequence walking
- Use a primer near the end of the contig
- Extend the sequence
- Repeat if the gap is not covered
16Genome Analysis
- Whole genome analysis
- Gene count
- Gene classification
- Repeat content
- Chromosomal duplications
- Multi-Genome Analysis
- Synteny
- Sequence similarity
- Gene classification comparisons
17Gene Count-How do we find genes in genomic
sequences?
Map cDNA sequences to a genome. Sim4
(http//pbil.univ-lyon1.fr/sim4.html) EST2Genome
(http//bioweb.pasteur.fr/seqanal/interfaces/est2g
enome.html) Genomewise
18Finding Genes Cont.
Gene Predictions Fgenesh (http//www.softberry.
com) GenemarkHMM (http//opal.biology.gatech.edu/
GeneMark/eukhmm.cgi) Genscan (http//genes.mit.ed
u/GENSCAN.html) Grail (http//compbio.ornl.gov/Gr
ail-1.3/) Glimmer (http//www.tigr.org/softlab/gl
immer/glimmer.html) Homology blastx
19Gene Prediction Types
Known cDNA evidence/homology Putative Gene
prediction which has homology to
known gene Unknown EST matching a gene
prediction Hypothetical Gene prediction(s)
only
20Gene Classification
- Automated
- Similarity search against an annotated database
- Swiss-Prot
- Nr
- Protein Domain search
- i. Pfam (http//www.sanger.ac.uk/Software/Pfam/)
- ii. Prosite
- iii. Prints
- iv. Prodom
- v. Interpro (http//www.ebi.ac.uk/interpro/scan.ht
ml)
21Gene Classification Cont.
- 2) Curated
- Similar to above but usually people will verify
results through literature searches
22Looking for Repeats
- RepeatMasker can find and mask repeats in DNA
sequence - Can be run on cerebus or at
- http//repeatmasker.genome.washington.edu/cgi-bi
n/RepeatMasker - 3. RepeatMasker is often run on genomic sequences
before doing gene predictions
23Comparative Genome Analysis
24MUMmer
- Whole genome alignments
- Compares closely related sequences
- Maximally Unique Matching subsequences
- agctcgatGGGCTTTAGACTCTCGATAggcgcagagGCTCGCTAGAATCG
CTAGATCac - agacctaaGGGCTTTAGACTCTCGATAagtctatccGCTCGCTAGAATCG
CTAGATCta
25(No Transcript)
26Segmentally duplicated regions in the Arabidopsis
genome, detected using MUMmer
Individual chromosomes are depicted as horizontal
grey bars (with chromosome 1 at the top),
centromeres are marked black. Coloured bands
connect corresponding duplicated segments.
Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana. 2000.Nature
408796-815
27Aligning Human Chromosomes using MUMer
- Regular MUMer for chromosome level is not
sufficiently sensitive to align chromosome, as it
was designed to align similar sequences - Modification
- Concatenate all proteins in the order they occur
on each chromosome (on any strand) - The concatenated strings were aligned using
MUMer. - The resulting matched were clustered to extract
all sets of three or more that occur in close
proximity on each chromosome these are potential
duplications
Science 291 February 2001.
28(No Transcript)
29(No Transcript)
30PIPMaker
- PIP stands for Percent Identity Plot
- Graphical view of similarity between two or more
sequences - http//bio.cse.psu.edu/pipmaker/
31Alignment
PIP Plot
Dot Plot
32(No Transcript)
33Fugu PTEN
2-6
A
B
1
7
8
9
5
100
100
H. sapiens
50
50
M. musculus
D. melanogaster
C. briggsae
C. elegans
A. thaliana 2
A. thaliana 3
L. major
S. pombe
2kb
4kb
6kb
8479
1
X. laevis
1239
1
34Vista
- Similar to PipMaker
- http//www-gsd.lbl.gov/vista/
35(No Transcript)
36Genomes on Display
- UCSC Browser
- Ensembl browser
- NCBI Browser
- GMOD
37UCSC Browser
38Ensembl Browser
39NCBI Browser
- http//www.ncbi.nlm.nih.gov/cgi-bin/Entrez/hum_src
h?chrhum_chr.infquery
40GMOD
- Generic Model Organism Database
- Attempt to make a common set of tools for
databases/browsers for various species - www.gmod.org