Identification and Quantification of Polypeptide Similarity Tim - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Identification and Quantification of Polypeptide Similarity Tim

Description:

O. aries (domestic sheep) G. gallus (domestic fowl) 1.2Gb ... information is whole genome sequence from a related species, e.g. mouse for man ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 40
Provided by: timst7
Category:

less

Transcript and Presenter's Notes

Title: Identification and Quantification of Polypeptide Similarity Tim


1
BioinformaticsTim Stevens, November
2008Department of BiochemistryUniversity of
Cambridge
http//www.bio.cam.ac.uk/tjs23/ tjs23_at_mole.bio.ca
m.ac.uk
2
  • Genome assembly
  • Sequence annotation
  • Comparative genomics
  • Computational methods
  • Sequence similarity
  • Homology detection
  • Protein structure prediction
  • Protein-protein interactions

3
Genome Sequence to Biological Knowledge
ACATTTGCTTCTGACACAACTGTGTTCACT AGCAACCTCAAACAGACAC
CATGGTGCACC TGACTCCTGAGGAGAAGTCTGCCGTTACTG CCCTGTG
GGGCAAGGTGAACGTGGATGAAG
Whole Genome Sequence
Gene prediction
Open Reading Frames
MVHLTPEEKSAVTALWGKVNVDEVGGEALG RLLVVYPWTQRFFESFGDL
STPDAVMGNPK VKAHGKKVLGAFSDGLAHLDNLKGTFATLS
Regulation
Expression
Expressed Proteins
Homology Detection
Structure Prediction
Functional Knowledge
Genome Context
4
Genome Assembly
5
The Genome Milestones
Year Class Organiam Size (Mb) Genes
1976 RNA Virus Phage MS2 0.003 4 1977 DNA
Virus Phage Phi X174 0.005 11 1995 Bacterium H.
influenzae 2 1,700 1996 Eukaryote S.
cerevisiae 13 6,000 1998 Metazoan C.
elegans 100 18,000 2000 Plant A.
thaliana 157 27,000 2003 Mammal H.
sapiens 3000 30,000
6
Value of Genomic Sequence
  • Complete information we can get it right once
    and for all
  • Complete gene catalogue
  • An index reference
  • A reference archive
  • New entry points
  • Extend families across species
  • Gene disruption and expression studies in model
    organisms
  • Comparative studies conservation between
    organisms
  • Genome structure and archaeology
  • Long range structure chromosome organisation
    and function
  • Evolutionary studies fossil genes
  • Materials
  • Design experiments in advance
  • Computational knowledge extraction (i.e. via
    databases)

7
Hierarchical Sequencing Strategy (human)
Chromosome
24
Overlapping BACs
354,510
29,298
15 contigs per clone
1 contig less than one error in 10,000
8
Whole Genome Shotgun (WGS) Sequencing Strategy
Reads
Contigs
Read pairs
Scaffold
Order scaffolds on chromosomes Using genetic
markers and other Maps
9
WGS v Hierarchical
  • Hierarchical
  • High quality continuous sequence
  • FPC (FingerPrint Contig) maps provide valuable
    experimental BAC clone resources
  • Slow/More expensive need to construct FPC map
    before starting
  • WGS
  • Fast/Cheaper no initial maps required (though
    marker map ultimately required to anchor
    assemblies)
  • Quality variable, depending on genome (e.g.
    amount of segmental duplication)
  • Hard to finish and close gaps from initial WGS
  • In Practice
  • WGS initially, with Hierarchical if high quality
    required.
  • Frequently WGS with targeted clone based
    sequencing in regions of interest
  • Various hybrid strategies

10
Completed Genomes 2008
  • Viruses
  • 2100
  • Prokaryotes
  • 21 Archeans
  • 200 Eubacteria
  • Organelles
  • 1900 Mitochondria
  • 160 Plastids
  • Ekaryotes
  • 17 Protists
  • 8 Plants
  • 15 Fungi
  • 23 Animals

Pedersen at al. J Mol Biol. 2000
11
Sequencing Status
  • Human Genome Sequence
  • Finished, at least the first one ?
  • Nearly all chromosomes now independently
    published (Nature)
  • Global checks reveal very little missing (find
    all Refseq mRNAs check against independent
    fosmid library)
  • Genome Maintenance system being setup
  • Other Vertebrate Genome Sequences
  • Range of quality from Mouse (orginally whole
    genome shotgun (WGS), now mostly finished with
    clone sequencing) to Elephant et al (currently
    low coverage WGS).
  • Automatic annotations available via genome
    browsers
  • Ensembl http//www.ensembl.org
  • Curated annotation available
  • Vega human, mouse, zebrafish
  • Model organism databases FlyBase, Wormbase, Zfin
  • Also through Ensembl.

12
H. sapiens (human) 3Gb
5
23
P. troglodytes (common chimpanzee) 3Gb
91
M. mulatta (rhesus macaque)
92
M. musculus (house mouse) 2.6Gb
41
R. norvegicus (Norway rat) 2.6Gb
C. familiaris (domestic dog) 2.5Gb
45
170
74
F. catus (domestic cat)
83
E. caballus (horse)
310
65
S. scrofa (domestic pig)
B. taurus (domestic cattle)
360
20
O. aries (domestic sheep)
M. domestica (opposum)
450
G. gallus (domestic fowl) 1.2Gb
197
550
X. laevis (African clawed frog) 3.1Gb
X. tropicalis (tropical clawed frog) 1.7Gb
D. rerio (zebrafish) 1.7Gb
140
O. latipes (Japanese medaka) 800Mb
70
T. nigroviridis (Water fresh pufferfish) 400Mb
25
990
F. rubripes (tiger pufferfish) 400Mb
C. savignyi (sea squirt) 180Mb
?
C. intestinalis (sea squirt) 180Mb
200?
1500?
A. aegypti (yellow fever mosquito)
250
A. gambiae (African malaria mosquito) 230Mb
340
D. melanogaster (fruitfly, FLYBASE) 125Mb
A. mellifera (honey bee) 200Mb
I. scapularis (tick)
C. elegans (nematode, WORMBASE) 100Mb
S. cerevisiae (yeast, SGD) 12Mb
Million years
100
200
300
400
500
1000
40 species currently in Ensembl (includes
Elephant 2x not shown)
Blue finished assembly available or planned Red
whole genome assembly available Green whole
genome assembly due in the next 2 years
13
DNA sequencing revolution
  • Genome sequencing costs are falling very fast
  • ABI 3730 old Sanger technology
  • 80kb per run in 800bp reads, 500/Mb
  • 454 introduced in 2005
  • 100Mb per run in 250bp reads, 100/Mb
  • Illumina/Solexa introduced in 2006 ABI SOLiD in
    2007
  • 1Gb per run in 35bp reads, 5/Mb
  • Substantial informatics requirements
  • Raw output of each run is 1 Tb (every 3 days)
  • Storage of output after processing (trace format)
    from 30 Illumina machines 200 Tb per year

14
Appetite for Natural Variation Data
  • Reference genome
  • Sequence variation (collecting SNPs)
  • Sanger ExoSeq project 35,000 novel rare SNPs
    identified from exons from 14 human chromosomes
    in 48 Caucasian individuals.
  • Cancer Genome Project Greenman et al. Patterns
    of somatic mutation in human cancer genomes
    Nature 446, 153 (2007).
  • Haplotypes (genotyping from reference SNPs)
  • HapMap project
  • Wellcome Trust Case Control Consortium(WTCCC)
    Genome-wide association study of 14,000 cases of
    seven common diseases and 3,000 shared controls
    Nature 447, 661 (2007)
  • Copy number variations (CNVs)
  • Redon et al. Global variation in copy number in
    the human genome Nature 444, 444 (2006)
  • Multiple complete genomes of individuals

15
Planned use of new technology
  • Sequencing 200 individuals as part of proposed
    1000 humans consortium (Richard Durbin)
  • Ancestral Recombination Graph (ARG) algorithm
    will allow low coverage sequencing on many
    individuals with missing data inferred to high
    accuracy. Piloted on yeast strains and human
    chromosome X.
  • Opportunities for larger scale cancer
    resequencing
  • Cancer Genomics Consortium Meeting in Toronto
    held to plan resequencing entire cancer cell
    lines.
  • Sequence all expressed RNA in a cell in a single
    step

16
Changing healthcare research
  • Genome sequencing costs are falling fast
  • 2000 1,000,000,000 per genome
  • 2004 10,000,000 per genome
  • 2008 100,000 per genome
  • 2012 ?
  • Sequencing expected to displace genotyping as
    costs drop
  • Already 1,000,000 SNP chips, which allow whole
    genome association studies through genotyping,
    however will not necessarily identify causal
    variations.
  • Already seeing companies starting to sell
    personal genome services (23andme, Decode)
  • Future human health research will be increasingly
    driven by the availability of this data

17
Craig Venter Goes Boating
  • Trawl the ocean for microbes
  • Atlantic via Panama to Pacific
  • Untargeted environmental sampling
  • 6.3 GB of sequences
  • Assembled into contigs/genomes where possible
  • Mass sequence comparisons
  • Massive sequence diversity
  • 60 common ribotypes
  • Cladistics of oceanic microbial taxa
  • Large sampling of sequence space
  • Little genome assembly
  • The Sorcerer II Global Ocean Sampling expedition
    northwest Atlantic through eastern tropical
    Pacific. PLoS Biol. 2007

18
Genome sequence annotation
19
Methods for gene annotation
  • Ab initio gene prediction
  • Use general knowledge of gene structure rules
    and statistics
  • Current best methods are all based on hidden
    Markov models, which use Dynamic Programming
  • Genscan (Burge)
  • FGENES (Solovyev)
  • HMMGene (Krogh)
  • Similarity based annotation
  • Comparison to known proteins, cDNAs, ESTs
  • Better, but only possible if you have similar
    data to compare to
  • Best if the sequence comes from this gene
    (verification, not prediction)

20
Searching for Genes Bacteria vs Human
Promoter 5utr
3utr
Bacterial gene continuous coding region, known
signals Human gene fragmented coding region,
unknown signals, contained in much more DNA
?? 5utr
3utr polyAAA site
Predicted
?? 5utr ----------- 3utr polyAAA site
Real
?? 5utr ----------- 3utr polyAAA site
21
Searching for Exons
22
Genscan (Burge and Karlin, 1998)
  • Dramatic improvement over previous methods
  • Generalised HMM
  • Different parameter sets for different GC content
    regions (intron length distribution and exon
    stats)

23
Performance of ab initio methods
  • Can confirm gene structures experimentally by
    sequencing cDNA
  • Current methods are not really good enough
  • 75 correct per exon, worse with initial and
    final exons
  • 20 correct per gene
  • Easier for simpler organisms, e.g. C. elegans
  • Options are to improve methods, or get extra
    information
  • An attractive source of new information is whole
    genome sequence from a related species, e.g.
    mouse for man

24
GeneWise (Birney)
  • GeneWise aligns a protein sequence (or HMM) to
    genomic DNA taking into account splicing
    information

25
-20bp
3bp
- 6bp
- 8bp
- 66bp
- 1bp
0bp
- 3bp
-1 bp
2bp
1bp
1bp
26
Other Comparative Approaches
  • Procrustes (Gelfand, Mironov and Pevzner, 1996)
  • Find possible exons, align and piece together
    homologue
  • Similar sensitivity, lower specificity to
    GeneWise
  • GenomeScan (Yeh, Lim and Burge, 2001)
  • Extension of GenScan to use protein matches where
    available to add to the GenScan score for an exon
  • Higher sensitivity, especially when match is weak
    (it always predicts something) lower specificity

27
Targetted Genewise UTRs


PMATCH all genome
Genewise
BLAST
Human Protein Seqs Uniprot/TrEmbl/RefSeq
Est2Genome
cDNAs
BLAST
Genewise phases, no UTRs
Est2genome UTRs, no phases
Translateable gene with UTRs
28
Conservation Helps Gene Structure Prediction
Test on Chromosome 22 13472 mouse hits 4978 exons
  • Specificity (accuracy)
  • Coding
  • 79 correct
  • 21 wrong
  • Non coding
  • 85 correct
  • 15 wrong
  • Sensitivity (coverage)
  • Coding
  • 1266 out of 2991 exons found (42) ?

29
Twinscan (Korf et al., 2001)
Fit a conservation sequence alongside the
target sequence
30
Alternative Splicing
  • Alternative splicing is very prevalent in
    vertebrates historically underestimated.
  • For Human Genome paper reconstruct full (coding)
    length transcripts from cDNAs and ESTs on two
    chromosomes
  • Chr 22 642 transcripts map to 245 genes
  • Average 2.6 transcripts per gene
  • Two or more transcripts for 145 (90) genes
  • Chr 19 1859 transcripts map to 544 genes
  • Average 3.2 transcripts per gene
  • 70 alternatives affect coding sequence
  • Compare C. elegans data
  • 22 genes have multiple transcripts, average 1.34
    transcripts per gene

31
Genome Annotation Strategy
32
Annotation process
  • Automated analysis
  • Repeat detection
  • RepeatMasker (Smit), tandem, inverted
  • Gene prediction
  • Genscan (Burge), FGENESH (Solovyev)
  • Database searches
  • Initial protein and DNA matches using BLAST
  • Refined protein matches using genewise (wise2)
  • Refined EST matches using ESTGENOME, spangle
  • Pfam annotation using halfwise (wise2)
  • Integrate results, display, annotate
  • ACEDB, web-based tools (e.g. spangle)
  • Investigate gene predictions experimentally
  • Submit to EMBL

33
Ensembl What do you get?
  • Genome Annotation
  • Protein coding gene structure
  • Consistent with genome, predicted across all
    vertebrates
  • RNA genes (including miRNA)
  • Consistent with genome, predicted in across
    mammals
  • Additional identifiers per genes
  • Affymetrix, EntrezGene, Uniprot
  • Comparative Genomics
  • Genome alignments
  • Blastz, Blat, coordinated with UCSC
  • Orthologs between genomes
  • Protein evolution rates
  • Dn/ds rates between species
  • Variants (SNPs), strains, genotypes
  • Functional Genomics datasets
  • Infrastructure
  • Website, Data mining tool, database and data dump
  • Portable, extendable, open source system with
    database, API, website, pipeline

34
Beyond Classical ab initio Computational Gene
Prediction
  • Ensembl-style automatic gene annotation relies on
    alignment of supporting evidence mRNA, ESTs,
    proteins sequences that have been independently
    experimentally determined.
  • Classical ab initio gene prediction partly relies
    on statistics of coding potentials, derived from
    a database
  • From the point of view of the cellular
    transcription machinery, genes are just a series
    of short signals
  • Transcription start site
  • Translation start site
  • 5 3 Intron splicing signals
  • Termination signals
  • Short signal sequences historically difficult to
    recognise over background noise in large genomes
    can we recognise them better with todays machine
    learning approaches?

35
Machine Learning of Promoter Elements
Method Predictions Starts found () Accuracy
() Eponine 215 53.5 73.0 Pro'spector 278 55.5 64.
0 CpG islands 306 65.8 62.0 TATA
-6.5 39869 99.6 5.7 TATA -2.6 540 13.0 7.4
36
Gene Expression
37
Ensembl Regulatory Build
  • Assumes punctuated regulatory sites elements
  • Union of all sites used in any cell/tissue type
    are assigned start and ends on the genome
  • Element may have a cell/tissue specific
    annotation
  • Build steps
  • Define focus elements (DNase, FAIRE, CTCF, TFBS)
  • Create functional annotations with overlap
    elements (Histone modifications)
  • First Build Ensembl release 45 (June 2007)
  • 110,000 elements, 2 Mb of DNA
  • 6,000 promoter associated by inherent pattern
    (DNaseI H3K36me3)

Flicek Birney et al
38
Microarray Experiments
  • Multiplex On-chip binding
  • Hybridisation
  • Dual fluorescent tag red/green
  • Relatively new technology
  • Computational issues
  • Normalisation
  • Clustering
  • Significance
  • False positive/negative
  • Reference spikes
  • Relation to test axis (e.g disease, drug etc.)
  • Data standards
  • MIAME (Minimum Information About a Microarray
    Experiment)

39
Microarrays SAM
  • Significance Analysis of Microarrays
  • Per-gene score each gene has individual
    signal/noise
  • Better than plain T-tests
  • Assumes no underlying model
  • Based on repeat consistency
  • Low false discovery rate 12 in radiation test

xAi - xBi si s0 xAi Signal,
condition A xBi Signal, condition B siStandard
deviation s0Pseudocount
di
Tusher, Tibshirani Chu Proc Natl Acad Sci USA.
2001
Write a Comment
User Comments (0)
About PowerShow.com