Title: http:creativecommons'orglicensesbysa2'0
1http//creativecommons.org/licenses/by-sa/2.0/
2Lecture 5.2 Genome Annotation
- Francis Ouellette
- francis_at_bioinformatics.ubc.ca
3New slides
- All slides which are different then what you have
in your binder have a in the left hand
corner of the page. If its a modification of the
slide you have in your binder, Ive tried to note
it as such.
4Outline
- What we have
- How we use it
- What we should get out of it
5What else?
- Genome sequences (DNA only) by themselves are not
as useful as genomes that are fully annotated. - Functions of many processes reside 3D proteins,
and the structure of proteins and RNA is known
for only few sequences. - Need to know where the protein coding sequences
are, and what they do this is a very big
challenge in bioinformatics. - Proteins are not everything, there are also
parts components information in DNA and RNA
(and carbohydrates and lipids). - All of this becomes part of the parts list,
from where all biology will be understood.
6Challenges at building the Parts List
- Finding genes involves computational methods as
well as experimental validation - Computational methods are often inadequate, and
often generate erroneous gene (false positive)
sequences which - Are missing exons
- Have incorrect exons
- Over predict genes
- Where the 5 and 3 UTR are missing
7Assumptions we make
- Reductionist approach still works, albeit we are
now becoming more and more systems biologists - Evolution drives everything, and will be the way
we figure things out. Or said in another way - Evolutionary relationships and comparisons will
be essential in our efforts to solve and
understand genomes
8How we got started
- GenBank database was populated by common genes
- rRNA, tRNA
- Globin
- Histone
- ATPases
- Actin
- Tubulin and others
9(No Transcript)
10Things we are looking to annotate?
- CDS
- mRNA
- Alternative RNA
- Promoter and Poly-A Signal
- Pseudogenes
- ncRNA
11Pseudogenes
- Could be as high as 20-30 of all Genomic
sequence predictions could be pseudogene - Non-functional copy of a gene
- Processed pseudogene
- Retro-transposon derived
- No 5 promoters
- No introns
- Often includes polyA tail
- Non-processed pseudogene
- Gene duplication derived
- Both include events that make the gene
non-funtional - Frameshift
- Stop codons
- We assume pseudogenes have no function, but we
really dont know!
12LOCUS NG_005487 1850 bp
DNA linear ROD 14-FEB-2006 DEFINITION Mus
musculus ubiquitin-conjugating enzyme E2 variant
2 pseudogene (LOC625221) on
chromosome 6. ACCESSION NG_005487 VERSION
NG_005487.1 GI87239965 KEYWORDS . SOURCE
Mus musculus (house mouse) ORGANISM Mus
musculus Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Euarchontoglires
Glires Rodentia Sciurognathi
Muroidea Muridae Murinae Mus. REFERENCE 1
(bases 1 to 1850) AUTHORS Wilson,R. TITLE
Mus musculus BAC clone RP24-201D17 from 6
JOURNAL Unpublished (2003) COMMENT
PROVISIONAL REFSEQ This record has not yet been
subject to final NCBI review. The
reference sequence was derived from
AC121925.2. FEATURES
Location/Qualifiers source 1..1850
/organism"Mus musculus"
/mol_type"genomic DNA"
/db_xref"taxon10090"
/chromosome"6"
/note"AC121925.2 32277..34126" gene
101..1750
/gene"LOC625221" /pseudo
/db_xref"GeneID625221"
repeat_region 1792..1827
/rpt_family"ID" ORIGIN 1
tcttctgcct caattcctca agtgctagta tcatatgccc
atgccattat ttttaactcc 61 cctttttcat
gctaagaatt gaacacacgg ccctgcgtgc ggtggtgcgt
ctggtagcag 121 gagaagatgg cggtctccac
aggagttaaa gttcctcgta attttcgctt gttggaagaa
13Noncoding RNA (ncRNA)
- ncRNA represent 98 of all transcripts in a
mammalian cell - ncRNA have not been taken into account in gene
counts - cDNA
- ORF computational prediction
- Comparative genomics looking at ORF
- ncRNA can be
- Structural
- Catalytic
- Regulatory
14From NW_632744.1
gene complement(55100..55691)
/locus_tag"CR40465"
/note"synonym CR_tc_AT13310"
/db_xref"GeneID3354945"
misc_RNA complement(55100..55691)
/locus_tag"CR40465"
/note"This annotation is identical to the
ncRNA CR_tc_AT13310
annotation, also mapped idenitcally to 2L
20224138,20223553
last curated on Thu Jan 15 133702 PST
2004" /db_xref"FlyBaseFBgn0
058465" /db_xref"GeneID3354
945"
15Noncoding RNA (ncRNA)
- tRNA transfer RNA involved in translation
- rRNA ribosomal RNA structural component of
ribosome, where translation takes place - snoRNA small nucleolar RNA functional/catalytic
in RNA maturation - Antisense RNA gene regulation/silencing?
16Rfam
- Covariance model searches are extremely compute
intensive. A small model (like tRNA) can search a
sequence database at a rate of around 300
bases/sec. The compute time scales roughly to the
4th power of the length of the RNA, so larger
models quickly become infeasible without
significant compute resources.
17BLAST
- Seeks high-scoring segment pairs (HSP)
- pair of sequences that can be aligned without
gaps - when aligned, have maximal aggregate score
(score cannot be improved by extension or
trimming) - score must be above score threshold S
- Public Search engines
- WWW search form http//www.ncbi.nlm.nih.gov/BLAST
- Unix command line blastall -p progname -d db -i
query gt outfile - Making your own search space
18So many matrices...
- Triple-PAM strategy (Altschul, 1991)
- PAM 40 Short alignments, highly similar
- tblastn against ESTs
- PAM 120
- PAM 250 Longer, weaker local alignments
- Looking in the twilight zone
- BLOSUM (Henikoff, 1993)
- BLOSUM 90 Short alignments, highly similar
- BLOSUM 62 Most effective in detecting
known members of a protein family - Standard on NCBI server works in most cases
- BLOSUM 30 Longer, weaker local alignments
19Protein coding genes in prokaryotes, and simple
eukaryotes
- Use ORF finder
- http//www.ncbi.nlm.nih.gov/gorf/orfig.cgi
- Simple ATG/Stop
- Simple link to FASTA formatted files and BLAST.
- Problems
- In frame Methionine
- Small protein
- Solution comparative genomics
20Figure 11 from Methods in comparative genomics
genome correspondence, gene identification and
regulatory motif discovery. Kellis M, Patterson
N, Birren B, Berger B, Lander ES. J Comput Biol.
200411(2-3)319-55.
Saccharomyces cerevisiae. Saccharomyces
paradoxus, Saccharomyces mikatae, Saccharomyces
bayanus
21Ab initio gene identification
- Goals
- Identify coding exons
- Seek gene structure information
- Get a protein sequence for further analysis
- Relevance
- Characterization of anonymous DNA genomic
sequences - Works on all DNA sequences
22Gene-Finding Strategies
Genomic Sequence
Comparative
Site-Based
Content-Based
- Bulk properties ofsequence
- Open reading frames
- Codon usage
- Repeat periodicity
- Compositional complexity
- Absolute properties ofsequence
- Consensus sequences
- Donor and acceptor splice sites
- Transcription factor binding sites
- Polyadenylation signals
- Right ATG start
- Stop codons out-of-context
- Inferences basedon sequence homology
- Protein sequence with similarity
to translated product of query - Modular structure of proteins
usually precludes finding complete gene
23Gene-Finding Methods
Genomic Sequence
Neural Network
Rule-Based
- Cutoff method
- Criteria applied sequentially to identify
possible exons - Rank or eliminate candidates from
consideration based on pre-determined cutoff
at each step
- Composite method
- Criteria applied in parallel
- Training sets used to optimize performance
- Weight scores in order of importance
24Evaluation Statistics
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity Fraction of actual coding regions
that are correctly predicted as
coding Specificity Fraction of the prediction
that is actually correct Correlation Combined
measure of sensitivity and specificity, Coefficien
t ranging from 1 (always wrong) to 1 (always
right)
25Relative Performance
- Claverie 1997 Rogic 2000
- Sn () Sp () CC CC
- Individual Exons
- MZEF 78 86 0.79
- HEXON 71 65 0.64
- SorFind 42 47 0.62
- GRAIL II 51 57 0.47
- Gene Structure
- GENSCAN 78 81 0.86 0.91
- FGENES 73 78 0.74 0.83
- GRAIL II/Gap 51 52 0.66
- GeneParser 35 40 0.54
- HMMgene 0.91
26What works best when?
- Genome survey (draft) dataexpect only a single
exon in any given stretch of contiguous sequence - BLASTN vs. dbEST (3 UTR)
- BLASTX vs. nr (protein CDS)
- Finished data large contigs are available,
providing context - GENSCAN
- HMMgene
27What you need
- Compute the prediction
- Confirm with biological sequences (also with
computational tools) - Integrate all of this
- Annotate genome (often via a GUI Graphical User
Interface) - Validate
- Re-annotate/Update
- Check it twice
- Submit to GenBank
28Some of the things available
- EnsEMBL (EBI)
- Sequin (NCBI)
- PseudoCAP (SFU)
- GMOD (CSHL)
- Pegasys (UBiC)
- Apollo (EBI/Berkeley)
29ENSEMBL
30http//www.pseudomonas.com/
31(No Transcript)
32(No Transcript)
33(No Transcript)
34http//bioinformatics.ubc.ca/pegasys/
35Pegasys
36Example output GAME XML (Genome Annotation
Markup Elements XML)
- Input to Apollo
- Genome editor created by Berkeley Drosophila
group and Ensembl - Simultaneously view heterogeneous computational
evidence - Manually create and/or edit annotations
37Apollo
- Apollo is a collaborative project between the
Berkeley Drosophila Genome Project (www.bdgp.org)
and Ensembl (www.ensembl.org). The collaboration
was set up to create a tool to initially annotate
fly but which would also be able to annotate and
browse any large eukaryotic genome. There is
a sister developers' website at
www.fruitfly.org/annot/apollo to download the fly
specific Apollo annotation tool. - All the code is open source and freely
downloadable.
38Features of Apollo include
- Zoomable and scrollable feature display down to
sequence level optimized for display of large
regions of genome. - User configurable feature types (colour,
appearance, size, order, score threshold) - Can connects directly to the Ensembl web site for
the latest human genome annotation - Reads/write gff format
- Searchable for feature names or sequence string
- Ability to select features and sort by
different feature attributes - All features are linked out to their source
database web sites (ensembl,swissprot,embl,unigene
etc) - Display of genomic sequence and any associated
start and stop codons - Prints postscript output
- Display is reversible allowing easy
interpretation of reverse strand features.
39(No Transcript)
40GenBank Features
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
41GenBank Features the important ones
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
42GenBank Features the abundant one
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
43Gene Prediction Caveats
- Predictions are of protein coding regions
- Do not detect non-coding areas (5 and 3 UTR)
- Non-coding RNA genes are missed
- Predictions are for typical genes
- Must predict a beginning and an end
- Partial or multiple genes are often missed
- Training sets may be biased
- Methods are sensitive to GC content
- Weighting of factors may be inordinately biased
44Moving along
- Sequencing technology led genomics, and to some
extant bioinformatics - EST complicated things, and where the beginning
of specialized methods or functional division
in GenBank. - Yeast chromosomes and bacterial chromosomes
rapidly lead us to our obvious ineptitude of
genome annotations, and these genomes where
simple! - A controlled vocabulary was necessary, albeit
slow to be created Gene Ontology (GO), Sequence
Ontology (SO).
45Genome annotation problems
- Assembling the genome
- Analysis interpretation
- Lack of consistency from gene to gene
- Lack of consistency from person to person
- Lack of controlled vocabulary
- Parts we dont know
- Bacteria vs mammals
- Graphical user interface
- Gene expression/molecular interactions
- Dimensions
- Updates and maintenance
46Some comments about the human genome
- Finished February 15, 2001
- Finished April 25, 2003
- Still not fully understood and definitely not
finished. - We are still in the genomic era.
- To get a full parts list, we need, as a
community, to develop a system to rigorously find
all of the part of the human genome - Genes
- Protein coding sequences
- Non coding RNAs (ncRNA)
- Identify and understand regulatory sequences
- Many other cool things we dont know about!
47The ideal annotation of MyGene
All clones
All SNPs
Promoter(s)
MyGene
All mRNAs
All proteins
- All protein modifications
- Ontologies
- Interactions (complexes, pathways, networks)
- Expression (where and when, and how much)
- Evolutionary relationships
All structures
48Things we will need to integrate in the future
- Better gene predictions
- Haplotypes to map complex diseases
- Micro-array/gene expression data
- SAGE data
- Protein-protein interaction data
- GFP (Green Fluorescent Protein)
- Human-base (Entrez Gene)
- Better standardization of annotation protocols.
- Integration!
49Some Concluding remarks
- Trust but verify
- Beware of gene prediction tools!
- Always use more than one gene prediction tool and
more than one genome when possible. - Active area of bioinformatics research, so be
mindful of the new literature in this .
50http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id113
51http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id39
52Finding records needing to be updated?
28
- Who updates? Submitters, Journals, 3rd party
- What to update?Gene names, citations, new
product,sequencing errors - Where? update_at_ncbi.nlm.nih.gov
- Why update?
53example
54(No Transcript)
55(No Transcript)
56From francis Wed Mar 3 223219 1999
To ddbjupdt_at_ddbj.nig.ac.jp
Subject D25291 mito
Dear colleagues,
it
appears that DDBJ record D25291 is contaminated
with mitochondrial sequences from nucleotide
673 to 1803, as it is identical to mouse
mitochondrial sequence (EMBL V00711) for more
than 1100 nucleotides.
I would recommend deleting that segment of the
record, or removing the record altogether, as
it leads to unfortunate misinterpretation of the
data when using GenBank or DDBJ. The protein
sequence (which is erroneous, as it is
all of mitochondrial origin) should definitely be
removed as well.
.
LOCUS
MUSNGH 1803 bp mRNA ROD
29-AUG-1997 DEFINITION Mouse neuroblastoma
and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.
ACCESSION D25291
57Sequence Updated
31
LOCUS MUSNGH 1803 bp mRNA
ROD 29-AUG-1997 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 cell TA20 mRNA, complete
cds.
ACCESSION D25291 NID g1850791 VERSION
D25291.1 GI1850791
length
Date
DEF
GI
Version
LOCUS MUSNGH 619 bp mRNA
ROD 12-MAR-1999 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 mRNA. ACCESSION
D25291 VERSION D25291.2 GI4520413
58(No Transcript)
59(No Transcript)
60Courses in program
- Required courses
- MBB 505/MEDG 548C PROBLEM BASED LEARNING IN
BIOINFORMATICS - MBB 659 SPECIAL TOPICS IN BIOINFORMATICS
- MBB 841 BIOINFORMATICS
- CMPT 881 THEORETICAL COMPUTING
- CMPT 889 BIOINFORMATICS ALGORITHMS
- CPSC 545 ALGORITHMS FOR BIOINFORMATICS
- Electives
- CMPT 354 DATABASE SYSTEMS I
- CMPT 740 DATA MINING
- CMPT 880 SPECIAL TOPICS IN MEDICAL IMAGE
ANALYSIS - CPSC 304 INTRODUCTION TO RELATIONAL DATABASES
- CPSC 504 DATABASE DESIGN
- HCEP 511 CANCER EPIDEMIOLOGY
- CPSC 53A TOPICS IN ALGORITHMS AND COMPLEXITY
BIOINFORMATICS - INFO 506 CRITICAL RESEARCH ANALYSIS
- MATH 561 MATHEMATICAL BIOLOGY
- MATH 612D TOPICS IN MATHEMATICAL BIOLOGY
-MATHEMATICS OF INFECTIOUS DISEASES AND
IMMUNOLOGY - MBB 823 PROTEIN STRUCTURE AND FUNCTION
PROTEOMIC BIOINFORMATICS - MBB 831 MOLECULAR EVOLUTION OF EUKARYOTE
GENOMES - MBB 835 GENOMIC ANALYSIS
- MEDG 505 GENOME ANALYSISSTAT 540 STATISTICAL
METHODS FOR HIGH - DIMENSIONAL BIOLOGYSTAT 802 MULTIVARIATE
ANALYSIS - STAT 805 NON-PARAMETRIC STATISTICS AND DISCRETE
DATA ANALYSIS - STAT 890 STATISTICS SELECTED TOPICS -
BIOMETRICAL GENETICS
61Bioinformatics Faculty/Mentors
- David Baillie Bioinformatics, Molecular Biology
Biochemistry, SFU - Fiona Brinkman (on maternity leave June -
Sept/06) Molecular Biology Biochemistry, SFU - Ryan Brinkman Medical Genetics, UBC, BC Cancer
Research Centre, BCCA - Jenny Bryan (on maternity leave beginning Jan/06)
Statistics and Michael Smith Laboratories, UBC - Artem Cherkasov Medicine, Division of Infectious
Diseases, UBC - Ann Condon (on sabbatical until Sept/06)
Computer Science, UBC - Martin Ester Computing Science, SFU
- Arvind Gupta Computing Science, SFU
- Phil Hieter Michael Smith Laboratories, UBC
- Holger Hoos Computer Science, UBC
- Steven Jones Program Director, BioinformaticsGeno
me Sciences Centre, BCCA - Marco Marra Genome Sciences Centre, BCCA
- Francis Ouellette Director, UBC Bioinformatics
Centre (UBiC) Michael Smith Laboratories and
Medical Genetics, UBC - Paul Pavlidis UBC Bioinformatics Centre (UBiC)
Psychiatry, UBC - Frederic Pio Molecular Biology Biochemistry,
SFU - Cenk Sahinalp Computing Science, SFU
- Wyeth Wasserman Centre for Molecular Medicine
Therapeutics, UBC - Mark Wilkinson Medical Genetics, UBC
62Associate Faculty
- Patrick Keeling, Botany, UBC
- Leah Keshet, Math, UBC
- Ted Kirkpatrick, Computing Science, SFU
- Michael Kobor, Medical Genetics, UBC
- Ben Koop, Biology, UVic
- Jim Kronstad, Michael Smith Laboratories, UBC
- Gerry Krystal, Terry Fox Laboratory, BCCA
- Wan Lam, Cancer Genetics, BCCA
- Peter Lansdorp, Terry Fox Laboratory, BCCA
- Nhu Le, Cancer Control Research, BCCA
- Michel Leroux, Molecular Biology Biochemistry,
SFU - Victor Ling, Cancer Genetics, BCCA
- Calum MacAuley, Cancer Imaging, BCAA
- Dixie Mager, Terry Fox Laboratory, BCCA
- Brad McNeney, Statistical Actuarial Science,
SFU - Don Moerman, Zoology, UBC
- Ed Moore, Physiology, UBC
- Gregg Morin, Genome Sciences Centre, BCCA
- Colleen Nelson, Surgery, UBC
- Chris Bajdik, Cancer Control Research, BCCA
- Andrew Beckenbach, Biological Sciences, SFU
- Christopher Beh, Molecular Biology
Biochemistry, SFU - Bruce Brandhorst, Molecular Biology
Biochemistry, SFU - Felix Breden, Biological Sciences, SFU
- Hugh Brock, Zoology, UBC
- Angela Brooks-Wilson, Genome Sciences Centre,
BCCA - Andy Coldman, Cancer Control Strategy, BCCA
- Veronica Dahl, Computing Science, SFU
- William Davidson, Molecular Biology
Biochemistry, SFU - Charmaine Dean, Statistical Actuarial Science,
SFU - Allen Eaves, Terry Fox Laboratory, BCCA
- Connie Eaves, Terry Fox Laboratory, BCCA
- Eldon Emberly, Physics, SFU
- Joanne Emerman, Anatomy, UBC
- Brett Finlay, Michael Smith Laboratories, UBC
- Rick Gallagher, Cancer Control Research, BCCA
- Raphael Gottardo, Statistics, UBC
63http//bioinformatics.ubc.ca/faculty/
64- Application Deadlines
- Feb 14, 2006 for International applicants
- Mar 21, 2006 for North American applicants.
- For more information, please see the graduate
training website at - http//bioinformatics.bcgsc.ca.