http:creativecommons'orglicensesbysa2'0 - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

http:creativecommons'orglicensesbysa2'0

Description:

All s which are different then what you have in your binder ... Sequin (NCBI) PseudoCAP (SFU) GMOD (CSHL) Pegasys (UBiC) Apollo (EBI/Berkeley) Lecture 4.2 ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 64
Provided by: FrancisO6
Category:

less

Transcript and Presenter's Notes

Title: http:creativecommons'orglicensesbysa2'0


1
http//creativecommons.org/licenses/by-sa/2.0/
2
Lecture 5.2 Genome Annotation
  • Francis Ouellette
  • francis_at_bioinformatics.ubc.ca

3
New slides
  • All slides which are different then what you have
    in your binder have a in the left hand
    corner of the page. If its a modification of the
    slide you have in your binder, Ive tried to note
    it as such.

4
Outline
  • What we have
  • How we use it
  • What we should get out of it

5
What else?
  • Genome sequences (DNA only) by themselves are not
    as useful as genomes that are fully annotated.
  • Functions of many processes reside 3D proteins,
    and the structure of proteins and RNA is known
    for only few sequences.
  • Need to know where the protein coding sequences
    are, and what they do this is a very big
    challenge in bioinformatics.
  • Proteins are not everything, there are also
    parts components information in DNA and RNA
    (and carbohydrates and lipids).
  • All of this becomes part of the parts list,
    from where all biology will be understood.

6
Challenges at building the Parts List
  • Finding genes involves computational methods as
    well as experimental validation
  • Computational methods are often inadequate, and
    often generate erroneous gene (false positive)
    sequences which
  • Are missing exons
  • Have incorrect exons
  • Over predict genes
  • Where the 5 and 3 UTR are missing

7
Assumptions we make
  • Reductionist approach still works, albeit we are
    now becoming more and more systems biologists
  • Evolution drives everything, and will be the way
    we figure things out. Or said in another way
  • Evolutionary relationships and comparisons will
    be essential in our efforts to solve and
    understand genomes

8
How we got started
  • GenBank database was populated by common genes
  • rRNA, tRNA
  • Globin
  • Histone
  • ATPases
  • Actin
  • Tubulin and others

9
(No Transcript)
10
Things we are looking to annotate?
  • CDS
  • mRNA
  • Alternative RNA
  • Promoter and Poly-A Signal
  • Pseudogenes
  • ncRNA

11
Pseudogenes
  • Could be as high as 20-30 of all Genomic
    sequence predictions could be pseudogene
  • Non-functional copy of a gene
  • Processed pseudogene
  • Retro-transposon derived
  • No 5 promoters
  • No introns
  • Often includes polyA tail
  • Non-processed pseudogene
  • Gene duplication derived
  • Both include events that make the gene
    non-funtional
  • Frameshift
  • Stop codons
  • We assume pseudogenes have no function, but we
    really dont know!

12
LOCUS NG_005487 1850 bp
DNA linear ROD 14-FEB-2006 DEFINITION Mus
musculus ubiquitin-conjugating enzyme E2 variant
2 pseudogene (LOC625221) on
chromosome 6. ACCESSION NG_005487 VERSION
NG_005487.1 GI87239965 KEYWORDS . SOURCE
Mus musculus (house mouse) ORGANISM Mus
musculus Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Euarchontoglires
Glires Rodentia Sciurognathi
Muroidea Muridae Murinae Mus. REFERENCE 1
(bases 1 to 1850) AUTHORS Wilson,R. TITLE
Mus musculus BAC clone RP24-201D17 from 6
JOURNAL Unpublished (2003) COMMENT
PROVISIONAL REFSEQ This record has not yet been
subject to final NCBI review. The
reference sequence was derived from
AC121925.2. FEATURES
Location/Qualifiers source 1..1850
/organism"Mus musculus"
/mol_type"genomic DNA"
/db_xref"taxon10090"
/chromosome"6"
/note"AC121925.2 32277..34126" gene
101..1750
/gene"LOC625221" /pseudo
/db_xref"GeneID625221"
repeat_region 1792..1827
/rpt_family"ID" ORIGIN 1
tcttctgcct caattcctca agtgctagta tcatatgccc
atgccattat ttttaactcc 61 cctttttcat
gctaagaatt gaacacacgg ccctgcgtgc ggtggtgcgt
ctggtagcag 121 gagaagatgg cggtctccac
aggagttaaa gttcctcgta attttcgctt gttggaagaa
13
Noncoding RNA (ncRNA)
  • ncRNA represent 98 of all transcripts in a
    mammalian cell
  • ncRNA have not been taken into account in gene
    counts
  • cDNA
  • ORF computational prediction
  • Comparative genomics looking at ORF
  • ncRNA can be
  • Structural
  • Catalytic
  • Regulatory

14
From NW_632744.1
gene complement(55100..55691)
/locus_tag"CR40465"
/note"synonym CR_tc_AT13310"
/db_xref"GeneID3354945"
misc_RNA complement(55100..55691)
/locus_tag"CR40465"
/note"This annotation is identical to the
ncRNA CR_tc_AT13310
annotation, also mapped idenitcally to 2L
20224138,20223553
last curated on Thu Jan 15 133702 PST
2004" /db_xref"FlyBaseFBgn0
058465" /db_xref"GeneID3354
945"
15
Noncoding RNA (ncRNA)
  • tRNA transfer RNA involved in translation
  • rRNA ribosomal RNA structural component of
    ribosome, where translation takes place
  • snoRNA small nucleolar RNA functional/catalytic
    in RNA maturation
  • Antisense RNA gene regulation/silencing?

16
Rfam
  • Covariance model searches are extremely compute
    intensive. A small model (like tRNA) can search a
    sequence database at a rate of around 300
    bases/sec. The compute time scales roughly to the
    4th power of the length of the RNA, so larger
    models quickly become infeasible without
    significant compute resources.

17
BLAST
  • Seeks high-scoring segment pairs (HSP)
  • pair of sequences that can be aligned without
    gaps
  • when aligned, have maximal aggregate score
    (score cannot be improved by extension or
    trimming)
  • score must be above score threshold S
  • Public Search engines
  • WWW search form http//www.ncbi.nlm.nih.gov/BLAST
  • Unix command line blastall -p progname -d db -i
    query gt outfile
  • Making your own search space

18
So many matrices...
  • Triple-PAM strategy (Altschul, 1991)
  • PAM 40 Short alignments, highly similar
  • tblastn against ESTs
  • PAM 120
  • PAM 250 Longer, weaker local alignments
  • Looking in the twilight zone
  • BLOSUM (Henikoff, 1993)
  • BLOSUM 90 Short alignments, highly similar
  • BLOSUM 62 Most effective in detecting
    known members of a protein family
  • Standard on NCBI server works in most cases
  • BLOSUM 30 Longer, weaker local alignments

19
Protein coding genes in prokaryotes, and simple
eukaryotes
  • Use ORF finder
  • http//www.ncbi.nlm.nih.gov/gorf/orfig.cgi
  • Simple ATG/Stop
  • Simple link to FASTA formatted files and BLAST.
  • Problems
  • In frame Methionine
  • Small protein
  • Solution comparative genomics

20
Figure 11 from Methods in comparative genomics
genome correspondence, gene identification and
regulatory motif discovery. Kellis M, Patterson
N, Birren B, Berger B, Lander ES. J Comput Biol.
200411(2-3)319-55.
Saccharomyces cerevisiae. Saccharomyces
paradoxus, Saccharomyces mikatae, Saccharomyces
bayanus
21
Ab initio gene identification
  • Goals
  • Identify coding exons
  • Seek gene structure information
  • Get a protein sequence for further analysis
  • Relevance
  • Characterization of anonymous DNA genomic
    sequences
  • Works on all DNA sequences

22
Gene-Finding Strategies
Genomic Sequence
Comparative
Site-Based
Content-Based
  • Bulk properties ofsequence
  • Open reading frames
  • Codon usage
  • Repeat periodicity
  • Compositional complexity
  • Absolute properties ofsequence
  • Consensus sequences
  • Donor and acceptor splice sites
  • Transcription factor binding sites
  • Polyadenylation signals
  • Right ATG start
  • Stop codons out-of-context
  • Inferences basedon sequence homology
  • Protein sequence with similarity
    to translated product of query
  • Modular structure of proteins
    usually precludes finding complete gene

23
Gene-Finding Methods
Genomic Sequence
Neural Network
Rule-Based
  • Cutoff method
  • Criteria applied sequentially to identify
    possible exons
  • Rank or eliminate candidates from
    consideration based on pre-determined cutoff
    at each step
  • Composite method
  • Criteria applied in parallel
  • Training sets used to optimize performance
  • Weight scores in order of importance

24
Evaluation Statistics
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity Fraction of actual coding regions
that are correctly predicted as
coding Specificity Fraction of the prediction
that is actually correct Correlation Combined
measure of sensitivity and specificity, Coefficien
t ranging from 1 (always wrong) to 1 (always
right)
25
Relative Performance
  • Claverie 1997 Rogic 2000
  • Sn () Sp () CC CC
  • Individual Exons
  • MZEF 78 86 0.79
  • HEXON 71 65 0.64
  • SorFind 42 47 0.62
  • GRAIL II 51 57 0.47
  • Gene Structure
  • GENSCAN 78 81 0.86 0.91
  • FGENES 73 78 0.74 0.83
  • GRAIL II/Gap 51 52 0.66
  • GeneParser 35 40 0.54
  • HMMgene 0.91

26
What works best when?
  • Genome survey (draft) dataexpect only a single
    exon in any given stretch of contiguous sequence
  • BLASTN vs. dbEST (3 UTR)
  • BLASTX vs. nr (protein CDS)
  • Finished data large contigs are available,
    providing context
  • GENSCAN
  • HMMgene

27
What you need
  • Compute the prediction
  • Confirm with biological sequences (also with
    computational tools)
  • Integrate all of this
  • Annotate genome (often via a GUI Graphical User
    Interface)
  • Validate
  • Re-annotate/Update
  • Check it twice
  • Submit to GenBank

28
Some of the things available
  • EnsEMBL (EBI)
  • Sequin (NCBI)
  • PseudoCAP (SFU)
  • GMOD (CSHL)
  • Pegasys (UBiC)
  • Apollo (EBI/Berkeley)

29
ENSEMBL
30
http//www.pseudomonas.com/
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
http//bioinformatics.ubc.ca/pegasys/
35
Pegasys
36
Example output GAME XML (Genome Annotation
Markup Elements XML)
  • Input to Apollo
  • Genome editor created by Berkeley Drosophila
    group and Ensembl
  • Simultaneously view heterogeneous computational
    evidence
  • Manually create and/or edit annotations

37
Apollo
  • Apollo is a collaborative project between the
    Berkeley Drosophila Genome Project (www.bdgp.org)
    and Ensembl (www.ensembl.org).  The collaboration
    was set up to create a tool to initially annotate
    fly but which would also be able to annotate and
    browse any large eukaryotic genome.    There is
    a sister  developers' website at
    www.fruitfly.org/annot/apollo to download the fly
    specific Apollo annotation tool. 
  • All the code is open source and freely
    downloadable.

38
Features of Apollo include
  • Zoomable and scrollable feature display down to
    sequence level optimized for display of large
    regions of genome.
  • User configurable feature types (colour,
    appearance, size, order, score threshold)
  • Can connects directly to the Ensembl web site for
    the latest human genome annotation
  • Reads/write gff format
  • Searchable for feature names or sequence string
  • Ability to select features and sort  by 
    different feature attributes
  • All features are linked out to their source
    database web sites (ensembl,swissprot,embl,unigene
    etc)
  • Display of genomic sequence and any associated
    start and stop codons
  • Prints postscript output
  • Display is reversible allowing easy
    interpretation of reverse strand features.

39
(No Transcript)
40
GenBank Features
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
41
GenBank Features the important ones
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
42
GenBank Features the abundant one
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
43
Gene Prediction Caveats
  • Predictions are of protein coding regions
  • Do not detect non-coding areas (5 and 3 UTR)
  • Non-coding RNA genes are missed
  • Predictions are for typical genes
  • Must predict a beginning and an end
  • Partial or multiple genes are often missed
  • Training sets may be biased
  • Methods are sensitive to GC content
  • Weighting of factors may be inordinately biased

44
Moving along
  • Sequencing technology led genomics, and to some
    extant bioinformatics
  • EST complicated things, and where the beginning
    of specialized methods or functional division
    in GenBank.
  • Yeast chromosomes and bacterial chromosomes
    rapidly lead us to our obvious ineptitude of
    genome annotations, and these genomes where
    simple!
  • A controlled vocabulary was necessary, albeit
    slow to be created Gene Ontology (GO), Sequence
    Ontology (SO).

45
Genome annotation problems
  • Assembling the genome
  • Analysis interpretation
  • Lack of consistency from gene to gene
  • Lack of consistency from person to person
  • Lack of controlled vocabulary
  • Parts we dont know
  • Bacteria vs mammals
  • Graphical user interface
  • Gene expression/molecular interactions
  • Dimensions
  • Updates and maintenance

46
Some comments about the human genome
  • Finished February 15, 2001
  • Finished April 25, 2003
  • Still not fully understood and definitely not
    finished.
  • We are still in the genomic era.
  • To get a full parts list, we need, as a
    community, to develop a system to rigorously find
    all of the part of the human genome
  • Genes
  • Protein coding sequences
  • Non coding RNAs (ncRNA)
  • Identify and understand regulatory sequences
  • Many other cool things we dont know about!

47
The ideal annotation of MyGene
All clones
All SNPs
Promoter(s)
MyGene
All mRNAs
All proteins
  • All protein modifications
  • Ontologies
  • Interactions (complexes, pathways, networks)
  • Expression (where and when, and how much)
  • Evolutionary relationships

All structures
48
Things we will need to integrate in the future
  • Better gene predictions
  • Haplotypes to map complex diseases
  • Micro-array/gene expression data
  • SAGE data
  • Protein-protein interaction data
  • GFP (Green Fluorescent Protein)
  • Human-base (Entrez Gene)
  • Better standardization of annotation protocols.
  • Integration!

49
Some Concluding remarks
  • Trust but verify
  • Beware of gene prediction tools!
  • Always use more than one gene prediction tool and
    more than one genome when possible.
  • Active area of bioinformatics research, so be
    mindful of the new literature in this .

50
http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id113
51
http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id39
52
Finding records needing to be updated?
28
  • Who updates? Submitters, Journals, 3rd party
  • What to update?Gene names, citations, new
    product,sequencing errors
  • Where? update_at_ncbi.nlm.nih.gov
  • Why update?

53
example
54
(No Transcript)
55
(No Transcript)
56
From francis Wed Mar 3 223219 1999
To ddbjupdt_at_ddbj.nig.ac.jp

Subject D25291 mito


Dear colleagues,

it
appears that DDBJ record D25291 is contaminated
with mitochondrial sequences from nucleotide
673 to 1803, as it is identical to mouse
mitochondrial sequence (EMBL V00711) for more
than 1100 nucleotides.

I would recommend deleting that segment of the
record, or removing the record altogether, as
it leads to unfortunate misinterpretation of the
data when using GenBank or DDBJ. The protein
sequence (which is erroneous, as it is
all of mitochondrial origin) should definitely be
removed as well.


.
LOCUS
MUSNGH 1803 bp mRNA ROD
29-AUG-1997 DEFINITION Mouse neuroblastoma
and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.
ACCESSION D25291
57
Sequence Updated
31
LOCUS MUSNGH 1803 bp mRNA
ROD 29-AUG-1997 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 cell TA20 mRNA, complete
cds.
ACCESSION D25291 NID g1850791 VERSION
D25291.1 GI1850791
length
Date
DEF
GI
Version
LOCUS MUSNGH 619 bp mRNA
ROD 12-MAR-1999 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 mRNA. ACCESSION
D25291 VERSION D25291.2 GI4520413
58
(No Transcript)
59
(No Transcript)
60
Courses in program
  • Required courses
  • MBB 505/MEDG 548C PROBLEM BASED LEARNING IN
    BIOINFORMATICS
  • MBB 659  SPECIAL TOPICS IN BIOINFORMATICS
  • MBB 841 BIOINFORMATICS
  • CMPT 881 THEORETICAL COMPUTING
  • CMPT 889 BIOINFORMATICS ALGORITHMS
  • CPSC 545 ALGORITHMS FOR BIOINFORMATICS
  • Electives
  • CMPT 354 DATABASE SYSTEMS I
  • CMPT 740 DATA MINING
  • CMPT 880 SPECIAL TOPICS IN MEDICAL IMAGE
    ANALYSIS
  • CPSC 304 INTRODUCTION TO RELATIONAL DATABASES
  • CPSC 504 DATABASE DESIGN
  • HCEP 511 CANCER EPIDEMIOLOGY
  • CPSC 53A TOPICS IN ALGORITHMS AND COMPLEXITY
    BIOINFORMATICS
  • INFO 506 CRITICAL RESEARCH ANALYSIS
  • MATH 561 MATHEMATICAL BIOLOGY
  • MATH 612D TOPICS IN MATHEMATICAL BIOLOGY
    -MATHEMATICS OF INFECTIOUS DISEASES AND
    IMMUNOLOGY
  • MBB 823 PROTEIN STRUCTURE AND FUNCTION
    PROTEOMIC BIOINFORMATICS
  • MBB 831 MOLECULAR EVOLUTION OF EUKARYOTE
    GENOMES
  • MBB 835 GENOMIC ANALYSIS
  • MEDG 505 GENOME ANALYSISSTAT 540 STATISTICAL
    METHODS FOR HIGH
  • DIMENSIONAL BIOLOGYSTAT 802 MULTIVARIATE
    ANALYSIS
  • STAT 805 NON-PARAMETRIC STATISTICS AND DISCRETE
    DATA ANALYSIS
  • STAT 890 STATISTICS SELECTED TOPICS -
    BIOMETRICAL GENETICS

61
Bioinformatics Faculty/Mentors
  • David Baillie Bioinformatics, Molecular Biology
    Biochemistry, SFU
  • Fiona Brinkman (on maternity leave June -
    Sept/06) Molecular Biology Biochemistry, SFU
  • Ryan Brinkman Medical Genetics, UBC, BC Cancer
    Research Centre, BCCA
  • Jenny Bryan (on maternity leave beginning Jan/06)
    Statistics and Michael Smith Laboratories, UBC
  • Artem Cherkasov Medicine, Division of Infectious
    Diseases, UBC
  • Ann Condon  (on sabbatical until Sept/06)
    Computer Science, UBC
  • Martin Ester Computing Science, SFU
  • Arvind Gupta Computing Science, SFU
  • Phil Hieter Michael Smith Laboratories, UBC
  • Holger Hoos Computer Science, UBC
  • Steven Jones Program Director, BioinformaticsGeno
    me Sciences Centre, BCCA
  • Marco Marra Genome Sciences Centre, BCCA
  • Francis Ouellette Director, UBC Bioinformatics
    Centre (UBiC) Michael Smith Laboratories and
    Medical Genetics, UBC
  • Paul Pavlidis UBC Bioinformatics Centre (UBiC)
    Psychiatry, UBC
  • Frederic Pio Molecular Biology Biochemistry,
    SFU
  • Cenk Sahinalp Computing Science, SFU
  • Wyeth Wasserman Centre for Molecular Medicine
    Therapeutics, UBC
  • Mark Wilkinson Medical Genetics, UBC

62
Associate Faculty
  • Patrick Keeling, Botany, UBC
  • Leah Keshet, Math, UBC
  • Ted Kirkpatrick, Computing Science, SFU 
  • Michael Kobor, Medical Genetics, UBC 
  • Ben Koop, Biology, UVic
  • Jim Kronstad, Michael Smith Laboratories, UBC  
  • Gerry Krystal, Terry Fox Laboratory, BCCA
  • Wan Lam, Cancer Genetics, BCCA
  • Peter Lansdorp, Terry Fox Laboratory, BCCA
  • Nhu Le, Cancer Control Research, BCCA
  • Michel Leroux, Molecular Biology Biochemistry,
    SFU  
  •     Victor Ling, Cancer Genetics, BCCA
  • Calum MacAuley, Cancer Imaging, BCAA
  • Dixie Mager, Terry Fox Laboratory, BCCA
  • Brad McNeney, Statistical Actuarial Science,
    SFU  
  • Don Moerman, Zoology, UBC
  • Ed Moore, Physiology, UBC
  • Gregg Morin, Genome Sciences Centre, BCCA
  • Colleen Nelson, Surgery, UBC
  • Chris Bajdik, Cancer Control Research, BCCA
  • Andrew Beckenbach, Biological Sciences, SFU  
  • Christopher Beh, Molecular Biology
    Biochemistry, SFU  
  • Bruce Brandhorst, Molecular Biology
    Biochemistry, SFU  
  •       Felix Breden, Biological Sciences, SFU  
  • Hugh Brock, Zoology, UBC
  • Angela Brooks-Wilson, Genome Sciences Centre,
    BCCA
  • Andy Coldman, Cancer Control Strategy, BCCA
  • Veronica Dahl, Computing Science, SFU  
  • William Davidson, Molecular Biology
    Biochemistry, SFU  
  • Charmaine Dean, Statistical Actuarial Science,
    SFU  
  • Allen Eaves, Terry Fox Laboratory, BCCA
  • Connie Eaves, Terry Fox Laboratory, BCCA
  • Eldon Emberly, Physics, SFU
  • Joanne Emerman, Anatomy, UBC
  • Brett Finlay, Michael Smith Laboratories, UBC  
  • Rick Gallagher, Cancer Control Research, BCCA
  • Raphael Gottardo, Statistics, UBC

63
http//bioinformatics.ubc.ca/faculty/
64
  • Application Deadlines
  • Feb 14, 2006 for International applicants
  • Mar 21, 2006 for North American applicants.
  • For more information, please see the graduate
    training website at
  • http//bioinformatics.bcgsc.ca.
Write a Comment
User Comments (0)
About PowerShow.com