Title: Identification and Quantification of Polypeptide Similarity Tim
1BioinformaticsTim Stevens, November
2008Department of BiochemistryUniversity of
Cambridge
http//www.bio.cam.ac.uk/tjs23/ tjs23_at_mole.bio.ca
m.ac.uk
2- Genome assembly
- Sequence annotation
- Comparative genomics
- Computational methods
- Sequence similarity
- Homology detection
- Protein structure prediction
- Protein-protein interactions
3Genome Sequence to Biological Knowledge
ACATTTGCTTCTGACACAACTGTGTTCACT AGCAACCTCAAACAGACAC
CATGGTGCACC TGACTCCTGAGGAGAAGTCTGCCGTTACTG CCCTGTG
GGGCAAGGTGAACGTGGATGAAG
Whole Genome Sequence
Gene prediction
Open Reading Frames
MVHLTPEEKSAVTALWGKVNVDEVGGEALG RLLVVYPWTQRFFESFGDL
STPDAVMGNPK VKAHGKKVLGAFSDGLAHLDNLKGTFATLS
Regulation
Expression
Expressed Proteins
Homology Detection
Structure Prediction
Functional Knowledge
Genome Context
4Genome Assembly
5The Genome Milestones
Year Class Organiam Size (Mb) Genes
1976 RNA Virus Phage MS2 0.003 4 1977 DNA
Virus Phage Phi X174 0.005 11 1995 Bacterium H.
influenzae 2 1,700 1996 Eukaryote S.
cerevisiae 13 6,000 1998 Metazoan C.
elegans 100 18,000 2000 Plant A.
thaliana 157 27,000 2003 Mammal H.
sapiens 3000 30,000
6Value of Genomic Sequence
- Complete information we can get it right once
and for all - Complete gene catalogue
- An index reference
- A reference archive
- New entry points
- Extend families across species
- Gene disruption and expression studies in model
organisms - Comparative studies conservation between
organisms - Genome structure and archaeology
- Long range structure chromosome organisation
and function - Evolutionary studies fossil genes
- Materials
- Design experiments in advance
- Computational knowledge extraction (i.e. via
databases)
7Hierarchical Sequencing Strategy (human)
Chromosome
24
Overlapping BACs
354,510
29,298
15 contigs per clone
1 contig less than one error in 10,000
8Whole Genome Shotgun (WGS) Sequencing Strategy
Reads
Contigs
Read pairs
Scaffold
Order scaffolds on chromosomes Using genetic
markers and other Maps
9WGS v Hierarchical
- Hierarchical
- High quality continuous sequence
- FPC (FingerPrint Contig) maps provide valuable
experimental BAC clone resources - Slow/More expensive need to construct FPC map
before starting - WGS
- Fast/Cheaper no initial maps required (though
marker map ultimately required to anchor
assemblies) - Quality variable, depending on genome (e.g.
amount of segmental duplication) - Hard to finish and close gaps from initial WGS
- In Practice
- WGS initially, with Hierarchical if high quality
required. - Frequently WGS with targeted clone based
sequencing in regions of interest - Various hybrid strategies
10Completed Genomes 2008
- Viruses
- 2100
- Prokaryotes
- 21 Archeans
- 200 Eubacteria
- Organelles
- 1900 Mitochondria
- 160 Plastids
- Ekaryotes
- 17 Protists
- 8 Plants
- 15 Fungi
- 23 Animals
Pedersen at al. J Mol Biol. 2000
11Sequencing Status
- Human Genome Sequence
- Finished, at least the first one ?
- Nearly all chromosomes now independently
published (Nature) - Global checks reveal very little missing (find
all Refseq mRNAs check against independent
fosmid library) - Genome Maintenance system being setup
- Other Vertebrate Genome Sequences
- Range of quality from Mouse (orginally whole
genome shotgun (WGS), now mostly finished with
clone sequencing) to Elephant et al (currently
low coverage WGS). - Automatic annotations available via genome
browsers - Ensembl http//www.ensembl.org
- Curated annotation available
- Vega human, mouse, zebrafish
- Model organism databases FlyBase, Wormbase, Zfin
- Also through Ensembl.
12H. sapiens (human) 3Gb
5
23
P. troglodytes (common chimpanzee) 3Gb
91
M. mulatta (rhesus macaque)
92
M. musculus (house mouse) 2.6Gb
41
R. norvegicus (Norway rat) 2.6Gb
C. familiaris (domestic dog) 2.5Gb
45
170
74
F. catus (domestic cat)
83
E. caballus (horse)
310
65
S. scrofa (domestic pig)
B. taurus (domestic cattle)
360
20
O. aries (domestic sheep)
M. domestica (opposum)
450
G. gallus (domestic fowl) 1.2Gb
197
550
X. laevis (African clawed frog) 3.1Gb
X. tropicalis (tropical clawed frog) 1.7Gb
D. rerio (zebrafish) 1.7Gb
140
O. latipes (Japanese medaka) 800Mb
70
T. nigroviridis (Water fresh pufferfish) 400Mb
25
990
F. rubripes (tiger pufferfish) 400Mb
C. savignyi (sea squirt) 180Mb
?
C. intestinalis (sea squirt) 180Mb
200?
1500?
A. aegypti (yellow fever mosquito)
250
A. gambiae (African malaria mosquito) 230Mb
340
D. melanogaster (fruitfly, FLYBASE) 125Mb
A. mellifera (honey bee) 200Mb
I. scapularis (tick)
C. elegans (nematode, WORMBASE) 100Mb
S. cerevisiae (yeast, SGD) 12Mb
Million years
100
200
300
400
500
1000
40 species currently in Ensembl (includes
Elephant 2x not shown)
Blue finished assembly available or planned Red
whole genome assembly available Green whole
genome assembly due in the next 2 years
13DNA sequencing revolution
- Genome sequencing costs are falling very fast
- ABI 3730 old Sanger technology
- 80kb per run in 800bp reads, 500/Mb
- 454 introduced in 2005
- 100Mb per run in 250bp reads, 100/Mb
- Illumina/Solexa introduced in 2006 ABI SOLiD in
2007 - 1Gb per run in 35bp reads, 5/Mb
- Substantial informatics requirements
- Raw output of each run is 1 Tb (every 3 days)
- Storage of output after processing (trace format)
from 30 Illumina machines 200 Tb per year
14Appetite for Natural Variation Data
- Reference genome
- Sequence variation (collecting SNPs)
- Sanger ExoSeq project 35,000 novel rare SNPs
identified from exons from 14 human chromosomes
in 48 Caucasian individuals. - Cancer Genome Project Greenman et al. Patterns
of somatic mutation in human cancer genomes
Nature 446, 153 (2007). - Haplotypes (genotyping from reference SNPs)
- HapMap project
- Wellcome Trust Case Control Consortium(WTCCC)
Genome-wide association study of 14,000 cases of
seven common diseases and 3,000 shared controls
Nature 447, 661 (2007) - Copy number variations (CNVs)
- Redon et al. Global variation in copy number in
the human genome Nature 444, 444 (2006) - Multiple complete genomes of individuals
15Planned use of new technology
- Sequencing 200 individuals as part of proposed
1000 humans consortium (Richard Durbin) - Ancestral Recombination Graph (ARG) algorithm
will allow low coverage sequencing on many
individuals with missing data inferred to high
accuracy. Piloted on yeast strains and human
chromosome X. - Opportunities for larger scale cancer
resequencing - Cancer Genomics Consortium Meeting in Toronto
held to plan resequencing entire cancer cell
lines. - Sequence all expressed RNA in a cell in a single
step
16Changing healthcare research
- Genome sequencing costs are falling fast
- 2000 1,000,000,000 per genome
- 2004 10,000,000 per genome
- 2008 100,000 per genome
- 2012 ?
- Sequencing expected to displace genotyping as
costs drop - Already 1,000,000 SNP chips, which allow whole
genome association studies through genotyping,
however will not necessarily identify causal
variations. - Already seeing companies starting to sell
personal genome services (23andme, Decode) - Future human health research will be increasingly
driven by the availability of this data
17Craig Venter Goes Boating
- Trawl the ocean for microbes
- Atlantic via Panama to Pacific
- Untargeted environmental sampling
- 6.3 GB of sequences
- Assembled into contigs/genomes where possible
- Mass sequence comparisons
- Massive sequence diversity
- 60 common ribotypes
- Cladistics of oceanic microbial taxa
- Large sampling of sequence space
- Little genome assembly
- The Sorcerer II Global Ocean Sampling expedition
northwest Atlantic through eastern tropical
Pacific. PLoS Biol. 2007
18Genome sequence annotation
19Methods for gene annotation
- Ab initio gene prediction
- Use general knowledge of gene structure rules
and statistics - Current best methods are all based on hidden
Markov models, which use Dynamic Programming - Genscan (Burge)
- FGENES (Solovyev)
- HMMGene (Krogh)
- Similarity based annotation
- Comparison to known proteins, cDNAs, ESTs
- Better, but only possible if you have similar
data to compare to - Best if the sequence comes from this gene
(verification, not prediction)
20Searching for Genes Bacteria vs Human
Promoter 5utr
3utr
Bacterial gene continuous coding region, known
signals Human gene fragmented coding region,
unknown signals, contained in much more DNA
?? 5utr
3utr polyAAA site
Predicted
?? 5utr ----------- 3utr polyAAA site
Real
?? 5utr ----------- 3utr polyAAA site
21Searching for Exons
22Genscan (Burge and Karlin, 1998)
- Dramatic improvement over previous methods
- Generalised HMM
- Different parameter sets for different GC content
regions (intron length distribution and exon
stats)
23Performance of ab initio methods
- Can confirm gene structures experimentally by
sequencing cDNA - Current methods are not really good enough
- 75 correct per exon, worse with initial and
final exons - 20 correct per gene
- Easier for simpler organisms, e.g. C. elegans
- Options are to improve methods, or get extra
information - An attractive source of new information is whole
genome sequence from a related species, e.g.
mouse for man
24GeneWise (Birney)
- GeneWise aligns a protein sequence (or HMM) to
genomic DNA taking into account splicing
information
25-20bp
3bp
- 6bp
- 8bp
- 66bp
- 1bp
0bp
- 3bp
-1 bp
2bp
1bp
1bp
26Other Comparative Approaches
- Procrustes (Gelfand, Mironov and Pevzner, 1996)
- Find possible exons, align and piece together
homologue - Similar sensitivity, lower specificity to
GeneWise - GenomeScan (Yeh, Lim and Burge, 2001)
- Extension of GenScan to use protein matches where
available to add to the GenScan score for an exon - Higher sensitivity, especially when match is weak
(it always predicts something) lower specificity
27Targetted Genewise UTRs
PMATCH all genome
Genewise
BLAST
Human Protein Seqs Uniprot/TrEmbl/RefSeq
Est2Genome
cDNAs
BLAST
Genewise phases, no UTRs
Est2genome UTRs, no phases
Translateable gene with UTRs
28Conservation Helps Gene Structure Prediction
Test on Chromosome 22 13472 mouse hits 4978 exons
- Specificity (accuracy)
- Coding
- 79 correct
- 21 wrong
- Non coding
- 85 correct
- 15 wrong
- Sensitivity (coverage)
- Coding
- 1266 out of 2991 exons found (42) ?
29Twinscan (Korf et al., 2001)
Fit a conservation sequence alongside the
target sequence
30Alternative Splicing
- Alternative splicing is very prevalent in
vertebrates historically underestimated. - For Human Genome paper reconstruct full (coding)
length transcripts from cDNAs and ESTs on two
chromosomes - Chr 22 642 transcripts map to 245 genes
- Average 2.6 transcripts per gene
- Two or more transcripts for 145 (90) genes
- Chr 19 1859 transcripts map to 544 genes
- Average 3.2 transcripts per gene
- 70 alternatives affect coding sequence
- Compare C. elegans data
- 22 genes have multiple transcripts, average 1.34
transcripts per gene
31Genome Annotation Strategy
32Annotation process
- Automated analysis
- Repeat detection
- RepeatMasker (Smit), tandem, inverted
- Gene prediction
- Genscan (Burge), FGENESH (Solovyev)
- Database searches
- Initial protein and DNA matches using BLAST
- Refined protein matches using genewise (wise2)
- Refined EST matches using ESTGENOME, spangle
- Pfam annotation using halfwise (wise2)
- Integrate results, display, annotate
- ACEDB, web-based tools (e.g. spangle)
- Investigate gene predictions experimentally
- Submit to EMBL
33Ensembl What do you get?
- Genome Annotation
- Protein coding gene structure
- Consistent with genome, predicted across all
vertebrates - RNA genes (including miRNA)
- Consistent with genome, predicted in across
mammals - Additional identifiers per genes
- Affymetrix, EntrezGene, Uniprot
- Comparative Genomics
- Genome alignments
- Blastz, Blat, coordinated with UCSC
- Orthologs between genomes
- Protein evolution rates
- Dn/ds rates between species
- Variants (SNPs), strains, genotypes
- Functional Genomics datasets
- Infrastructure
- Website, Data mining tool, database and data dump
- Portable, extendable, open source system with
database, API, website, pipeline
34Beyond Classical ab initio Computational Gene
Prediction
- Ensembl-style automatic gene annotation relies on
alignment of supporting evidence mRNA, ESTs,
proteins sequences that have been independently
experimentally determined. - Classical ab initio gene prediction partly relies
on statistics of coding potentials, derived from
a database - From the point of view of the cellular
transcription machinery, genes are just a series
of short signals - Transcription start site
- Translation start site
- 5 3 Intron splicing signals
- Termination signals
- Short signal sequences historically difficult to
recognise over background noise in large genomes
can we recognise them better with todays machine
learning approaches?
35Machine Learning of Promoter Elements
Method Predictions Starts found () Accuracy
() Eponine 215 53.5 73.0 Pro'spector 278 55.5 64.
0 CpG islands 306 65.8 62.0 TATA
-6.5 39869 99.6 5.7 TATA -2.6 540 13.0 7.4
36Gene Expression
37Ensembl Regulatory Build
- Assumes punctuated regulatory sites elements
- Union of all sites used in any cell/tissue type
are assigned start and ends on the genome - Element may have a cell/tissue specific
annotation - Build steps
- Define focus elements (DNase, FAIRE, CTCF, TFBS)
- Create functional annotations with overlap
elements (Histone modifications) - First Build Ensembl release 45 (June 2007)
- 110,000 elements, 2 Mb of DNA
- 6,000 promoter associated by inherent pattern
(DNaseI H3K36me3)
Flicek Birney et al
38Microarray Experiments
- Multiplex On-chip binding
- Hybridisation
- Dual fluorescent tag red/green
- Relatively new technology
- Computational issues
- Normalisation
- Clustering
- Significance
- False positive/negative
- Reference spikes
- Relation to test axis (e.g disease, drug etc.)
- Data standards
- MIAME (Minimum Information About a Microarray
Experiment)
39Microarrays SAM
- Significance Analysis of Microarrays
- Per-gene score each gene has individual
signal/noise - Better than plain T-tests
- Assumes no underlying model
- Based on repeat consistency
- Low false discovery rate 12 in radiation test
xAi - xBi si s0 xAi Signal,
condition A xBi Signal, condition B siStandard
deviation s0Pseudocount
di
Tusher, Tibshirani Chu Proc Natl Acad Sci USA.
2001