Identification and Quantification of Polypeptide Similarity Tim - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Identification and Quantification of Polypeptide Similarity Tim

Description:

O. aries (domestic sheep) G. gallus (domestic fowl) 1.2Gb ... information is whole genome sequence from a related species, e.g. mouse for man ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 40

Provided by: timst7

Category:

more less

Transcript and Presenter's Notes

Title: Identification and Quantification of Polypeptide Similarity Tim

1
BioinformaticsTim Stevens, November
2008Department of BiochemistryUniversity of
Cambridge
http//www.bio.cam.ac.uk/tjs23/ tjs23_at_mole.bio.ca
m.ac.uk
2

Genome assembly
Sequence annotation
Comparative genomics
Computational methods
Sequence similarity
Homology detection
Protein structure prediction
Protein-protein interactions

3
Genome Sequence to Biological Knowledge
ACATTTGCTTCTGACACAACTGTGTTCACT AGCAACCTCAAACAGACAC
CATGGTGCACC TGACTCCTGAGGAGAAGTCTGCCGTTACTG CCCTGTG
GGGCAAGGTGAACGTGGATGAAG
Whole Genome Sequence
Gene prediction
Open Reading Frames
MVHLTPEEKSAVTALWGKVNVDEVGGEALG RLLVVYPWTQRFFESFGDL
STPDAVMGNPK VKAHGKKVLGAFSDGLAHLDNLKGTFATLS
Regulation
Expression
Expressed Proteins
Homology Detection
Structure Prediction
Functional Knowledge
Genome Context
4
Genome Assembly
5
The Genome Milestones
Year Class Organiam Size (Mb) Genes
1976 RNA Virus Phage MS2 0.003 4 1977 DNA
Virus Phage Phi X174 0.005 11 1995 Bacterium H.
influenzae 2 1,700 1996 Eukaryote S.
cerevisiae 13 6,000 1998 Metazoan C.
elegans 100 18,000 2000 Plant A.
thaliana 157 27,000 2003 Mammal H.
sapiens 3000 30,000
6
Value of Genomic Sequence

Complete information we can get it right once
and for all
Complete gene catalogue
An index reference
A reference archive
New entry points
Extend families across species
Gene disruption and expression studies in model
organisms
Comparative studies conservation between
organisms
Genome structure and archaeology
Long range structure chromosome organisation
and function
Evolutionary studies fossil genes
Materials
Design experiments in advance
Computational knowledge extraction (i.e. via
databases)

7
Hierarchical Sequencing Strategy (human)
Chromosome
24
Overlapping BACs
354,510
29,298
15 contigs per clone
1 contig less than one error in 10,000
8
Whole Genome Shotgun (WGS) Sequencing Strategy
Reads
Contigs
Read pairs
Scaffold
Order scaffolds on chromosomes Using genetic
markers and other Maps
9
WGS v Hierarchical

Hierarchical
High quality continuous sequence
FPC (FingerPrint Contig) maps provide valuable
experimental BAC clone resources
Slow/More expensive need to construct FPC map
before starting
WGS
Fast/Cheaper no initial maps required (though
marker map ultimately required to anchor
assemblies)
Quality variable, depending on genome (e.g.
amount of segmental duplication)
Hard to finish and close gaps from initial WGS
In Practice
WGS initially, with Hierarchical if high quality
required.
Frequently WGS with targeted clone based
sequencing in regions of interest
Various hybrid strategies

10
Completed Genomes 2008

Viruses
2100
Prokaryotes
21 Archeans
200 Eubacteria
Organelles
1900 Mitochondria
160 Plastids
Ekaryotes
17 Protists
8 Plants
15 Fungi
23 Animals

Pedersen at al. J Mol Biol. 2000
11
Sequencing Status

Human Genome Sequence
Finished, at least the first one ?
Nearly all chromosomes now independently
published (Nature)
Global checks reveal very little missing (find
all Refseq mRNAs check against independent
fosmid library)
Genome Maintenance system being setup
Other Vertebrate Genome Sequences
Range of quality from Mouse (orginally whole
genome shotgun (WGS), now mostly finished with
clone sequencing) to Elephant et al (currently
low coverage WGS).
Automatic annotations available via genome
browsers
Ensembl http//www.ensembl.org
Curated annotation available
Vega human, mouse, zebrafish
Model organism databases FlyBase, Wormbase, Zfin
Also through Ensembl.

12
H. sapiens (human) 3Gb
5
23
P. troglodytes (common chimpanzee) 3Gb
91
M. mulatta (rhesus macaque)
92
M. musculus (house mouse) 2.6Gb
41
R. norvegicus (Norway rat) 2.6Gb
C. familiaris (domestic dog) 2.5Gb
45
170
74
F. catus (domestic cat)
83
E. caballus (horse)
310
65
S. scrofa (domestic pig)
B. taurus (domestic cattle)
360
20
O. aries (domestic sheep)
M. domestica (opposum)
450
G. gallus (domestic fowl) 1.2Gb
197
550
X. laevis (African clawed frog) 3.1Gb
X. tropicalis (tropical clawed frog) 1.7Gb
D. rerio (zebrafish) 1.7Gb
140
O. latipes (Japanese medaka) 800Mb
70
T. nigroviridis (Water fresh pufferfish) 400Mb
25
990
F. rubripes (tiger pufferfish) 400Mb
C. savignyi (sea squirt) 180Mb
?
C. intestinalis (sea squirt) 180Mb
200?
1500?
A. aegypti (yellow fever mosquito)
250
A. gambiae (African malaria mosquito) 230Mb
340
D. melanogaster (fruitfly, FLYBASE) 125Mb
A. mellifera (honey bee) 200Mb
I. scapularis (tick)
C. elegans (nematode, WORMBASE) 100Mb
S. cerevisiae (yeast, SGD) 12Mb
Million years
100
200
300
400
500
1000
40 species currently in Ensembl (includes
Elephant 2x not shown)
Blue finished assembly available or planned Red
whole genome assembly available Green whole
genome assembly due in the next 2 years
13
DNA sequencing revolution

Genome sequencing costs are falling very fast
ABI 3730 old Sanger technology
80kb per run in 800bp reads, 500/Mb
454 introduced in 2005
100Mb per run in 250bp reads, 100/Mb
Illumina/Solexa introduced in 2006 ABI SOLiD in
2007
1Gb per run in 35bp reads, 5/Mb
Substantial informatics requirements
Raw output of each run is 1 Tb (every 3 days)
Storage of output after processing (trace format)
from 30 Illumina machines 200 Tb per year

14
Appetite for Natural Variation Data

Reference genome
Sequence variation (collecting SNPs)
Sanger ExoSeq project 35,000 novel rare SNPs
identified from exons from 14 human chromosomes
in 48 Caucasian individuals.
Cancer Genome Project Greenman et al. Patterns
of somatic mutation in human cancer genomes
Nature 446, 153 (2007).
Haplotypes (genotyping from reference SNPs)
HapMap project
Wellcome Trust Case Control Consortium(WTCCC)
Genome-wide association study of 14,000 cases of
seven common diseases and 3,000 shared controls
Nature 447, 661 (2007)
Copy number variations (CNVs)
Redon et al. Global variation in copy number in
the human genome Nature 444, 444 (2006)
Multiple complete genomes of individuals

15
Planned use of new technology

Sequencing 200 individuals as part of proposed
1000 humans consortium (Richard Durbin)
Ancestral Recombination Graph (ARG) algorithm
will allow low coverage sequencing on many
individuals with missing data inferred to high
accuracy. Piloted on yeast strains and human
chromosome X.
Opportunities for larger scale cancer
resequencing
Cancer Genomics Consortium Meeting in Toronto
held to plan resequencing entire cancer cell
lines.
Sequence all expressed RNA in a cell in a single
step

16
Changing healthcare research

Genome sequencing costs are falling fast
2000 1,000,000,000 per genome
2004 10,000,000 per genome
2008 100,000 per genome
2012 ?
Sequencing expected to displace genotyping as
costs drop
Already 1,000,000 SNP chips, which allow whole
genome association studies through genotyping,
however will not necessarily identify causal
variations.
Already seeing companies starting to sell
personal genome services (23andme, Decode)
Future human health research will be increasingly
driven by the availability of this data

17
Craig Venter Goes Boating

Trawl the ocean for microbes
Atlantic via Panama to Pacific
Untargeted environmental sampling
6.3 GB of sequences
Assembled into contigs/genomes where possible
Mass sequence comparisons
Massive sequence diversity
60 common ribotypes
Cladistics of oceanic microbial taxa
Large sampling of sequence space
Little genome assembly

The Sorcerer II Global Ocean Sampling expedition
northwest Atlantic through eastern tropical
Pacific. PLoS Biol. 2007

18
Genome sequence annotation
19
Methods for gene annotation

Ab initio gene prediction
Use general knowledge of gene structure rules
and statistics
Current best methods are all based on hidden
Markov models, which use Dynamic Programming
Genscan (Burge)
FGENES (Solovyev)
HMMGene (Krogh)
Similarity based annotation
Comparison to known proteins, cDNAs, ESTs
Better, but only possible if you have similar
data to compare to
Best if the sequence comes from this gene
(verification, not prediction)

20
Searching for Genes Bacteria vs Human
Promoter 5utr
3utr
Bacterial gene continuous coding region, known
signals Human gene fragmented coding region,
unknown signals, contained in much more DNA
?? 5utr
3utr polyAAA site
Predicted
?? 5utr ----------- 3utr polyAAA site
Real
?? 5utr ----------- 3utr polyAAA site
21
Searching for Exons
22
Genscan (Burge and Karlin, 1998)

Dramatic improvement over previous methods
Generalised HMM
Different parameter sets for different GC content
regions (intron length distribution and exon
stats)

23
Performance of ab initio methods

Can confirm gene structures experimentally by
sequencing cDNA
Current methods are not really good enough
75 correct per exon, worse with initial and
final exons
20 correct per gene
Easier for simpler organisms, e.g. C. elegans
Options are to improve methods, or get extra
information
An attractive source of new information is whole
genome sequence from a related species, e.g.
mouse for man

24
GeneWise (Birney)

GeneWise aligns a protein sequence (or HMM) to
genomic DNA taking into account splicing
information

25
-20bp
3bp
- 6bp
- 8bp
- 66bp
- 1bp
0bp
- 3bp
-1 bp
2bp
1bp
1bp
26
Other Comparative Approaches

Procrustes (Gelfand, Mironov and Pevzner, 1996)
Find possible exons, align and piece together
homologue
Similar sensitivity, lower specificity to
GeneWise
GenomeScan (Yeh, Lim and Burge, 2001)
Extension of GenScan to use protein matches where
available to add to the GenScan score for an exon
Higher sensitivity, especially when match is weak
(it always predicts something) lower specificity

27
Targetted Genewise UTRs

PMATCH all genome
Genewise
BLAST
Human Protein Seqs Uniprot/TrEmbl/RefSeq
Est2Genome
cDNAs
BLAST
Genewise phases, no UTRs
Est2genome UTRs, no phases
Translateable gene with UTRs
28
Conservation Helps Gene Structure Prediction
Test on Chromosome 22 13472 mouse hits 4978 exons

Specificity (accuracy)
Coding
79 correct
21 wrong
Non coding
85 correct
15 wrong

Sensitivity (coverage)
Coding
1266 out of 2991 exons found (42) ?

29
Twinscan (Korf et al., 2001)
Fit a conservation sequence alongside the
target sequence
30
Alternative Splicing

Alternative splicing is very prevalent in
vertebrates historically underestimated.
For Human Genome paper reconstruct full (coding)
length transcripts from cDNAs and ESTs on two
chromosomes
Chr 22 642 transcripts map to 245 genes
Average 2.6 transcripts per gene
Two or more transcripts for 145 (90) genes
Chr 19 1859 transcripts map to 544 genes
Average 3.2 transcripts per gene
70 alternatives affect coding sequence
Compare C. elegans data
22 genes have multiple transcripts, average 1.34
transcripts per gene

31
Genome Annotation Strategy
32
Annotation process

Automated analysis
Repeat detection
RepeatMasker (Smit), tandem, inverted
Gene prediction
Genscan (Burge), FGENESH (Solovyev)
Database searches
Initial protein and DNA matches using BLAST
Refined protein matches using genewise (wise2)
Refined EST matches using ESTGENOME, spangle
Pfam annotation using halfwise (wise2)
Integrate results, display, annotate
ACEDB, web-based tools (e.g. spangle)
Investigate gene predictions experimentally
Submit to EMBL

33
Ensembl What do you get?

Genome Annotation
Protein coding gene structure
Consistent with genome, predicted across all
vertebrates
RNA genes (including miRNA)
Consistent with genome, predicted in across
mammals
Additional identifiers per genes
Affymetrix, EntrezGene, Uniprot
Comparative Genomics
Genome alignments
Blastz, Blat, coordinated with UCSC
Orthologs between genomes
Protein evolution rates
Dn/ds rates between species
Variants (SNPs), strains, genotypes
Functional Genomics datasets
Infrastructure
Website, Data mining tool, database and data dump
Portable, extendable, open source system with
database, API, website, pipeline

34
Beyond Classical ab initio Computational Gene
Prediction

Ensembl-style automatic gene annotation relies on
alignment of supporting evidence mRNA, ESTs,
proteins sequences that have been independently
experimentally determined.
Classical ab initio gene prediction partly relies
on statistics of coding potentials, derived from
a database
From the point of view of the cellular
transcription machinery, genes are just a series
of short signals
Transcription start site
Translation start site
5 3 Intron splicing signals
Termination signals
Short signal sequences historically difficult to
recognise over background noise in large genomes
can we recognise them better with todays machine
learning approaches?

35
Machine Learning of Promoter Elements
Method Predictions Starts found () Accuracy
() Eponine 215 53.5 73.0 Pro'spector 278 55.5 64.
0 CpG islands 306 65.8 62.0 TATA
-6.5 39869 99.6 5.7 TATA -2.6 540 13.0 7.4
36
Gene Expression
37
Ensembl Regulatory Build

Assumes punctuated regulatory sites elements
Union of all sites used in any cell/tissue type
are assigned start and ends on the genome
Element may have a cell/tissue specific
annotation
Build steps
Define focus elements (DNase, FAIRE, CTCF, TFBS)
Create functional annotations with overlap
elements (Histone modifications)
First Build Ensembl release 45 (June 2007)
110,000 elements, 2 Mb of DNA
6,000 promoter associated by inherent pattern
(DNaseI H3K36me3)

Flicek Birney et al
38
Microarray Experiments

Multiplex On-chip binding
Hybridisation
Dual fluorescent tag red/green
Relatively new technology
Computational issues
Normalisation
Clustering
Significance
False positive/negative
Reference spikes
Relation to test axis (e.g disease, drug etc.)
Data standards
MIAME (Minimum Information About a Microarray
Experiment)

39
Microarrays SAM

Significance Analysis of Microarrays
Per-gene score each gene has individual
signal/noise
Better than plain T-tests
Assumes no underlying model
Based on repeat consistency
Low false discovery rate 12 in radiation test

xAi - xBi si s0 xAi Signal,
condition A xBi Signal, condition B siStandard
deviation s0Pseudocount
di
Tusher, Tibshirani Chu Proc Natl Acad Sci USA.
2001

Write a Comment

User Comments (0)