Title: Genome Annotation
1Genome Annotation
Laura Clarke 18/12/02
- What is Genome Annotation
- Manual Curation
- Automatic Annotation
- Conclusions
3Genome Annotation
Our Aim
4Genome Annotation
Gene Identification
Known genes
Novel genes
- where?
- genomic structure?
- transcripts(s)?
- protein(s)?
- attach useful links
- how to predict?
- require evidence
- transcripts(s)?
- protein(s)?
- attach useful links
5Manual Curation
6Manual Curation
- Who?
- The G16
- Wormbase
- Flybase
7Manual Curation
Identifying Genes
- Known
- Novel
- Novel transcript
- Putatitive
- Pseudogene
8Manual Curation
9Automatic Annotation
10Automatic Annotation
11Automatic Annotation
12Automatic Annotation
Raw Compute
Sequence data arrives in contigs
Repeat masking
Ab initio predictions (Genscan)
Blast the predictions against swall, vertebrate
RNA, unigene
ePCR places markers on the sequence
Assembly information is used to position contigs
on a golden path
EnsEMBL core
13Automatic Annotation
human proteins
Other proteins
Add UTRs
Genscan exons
14Automatic Annotation
Protein Sequences
Aligned to the Genome
Blast and MiniSeq
15Automatic Annotation
ESTs and cDNA
Map cDNAs and ESTs using Exonerate (determine
coverage, identity and location in genome)
Store hits and filter on percentage identity and
length coverage
blast sequence and create a miniseq
Map transcripts back into genome-assembly
16Automatic Annotation
Miniseq - the need for speed
Minigenomic 1kb on either side run Genewise
Map back to genomic
Spliced alignment
17Automatic Annotation
- 8x ES40 Alpha (667 MHz) with 2Tb fibre channel
storage - 6x ES45 Alpha (1GZ) with 4Tb fibre channel
storage - 360x DS10L (467 MHz) farm with 60Gb local disk
storage - 767xRLX800i with 80Gb of local disk storage
- Further 21Tb storage on farm
- Tru64 UNIX (avoids the 2Gb file limit)
- 7 MySQL (v. 3) instances
- Most binaries and all sequence databases stored
locally (avoids using NFS)
18Automatic Annotation
Latest full Human Build
- NCBI 30 build, released Sept 2002
- Ensembl genes 22,980
- Ensembl transcripts 27,628
- Ensembl exons 204,542
- Made from
- 41,955 proteins, 84,079 cdnas
- Transcripts from human proteins 43,418
- Transcripts from homology 4,818
- cDNA alignments 75,668
- Transcripts with UTRs 32,661
- Genscan predictions
73,128 (375,361 exons)
19Automatic Annotation
Other Analyses
- Protein Analysis
- Interpro
- Other Algorithms
- Comparative Analysis
- Homologous Gene Pairing
- Synteny finding
20Automatic Annotation
Other Species
- Mouse
- Mosquito
- Zebrafish
- Fugu
21Automatic Annotation
Coming Soon
- Human NCBI 31
- Rat
- C briggsae
- C elegans
- Mosquito
- Drosophilla
22EnsEMBL Website