http:creativecommons'orglicensesbysa2'0

About This Presentation

Title:

http:creativecommons'orglicensesbysa2'0

Description:

All s which are different then what you have in your binder ... Sequin (NCBI) PseudoCAP (SFU) GMOD (CSHL) Pegasys (UBiC) Apollo (EBI/Berkeley) Lecture 4.2 ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 64

Provided by: FrancisO6

Category:

more less

Transcript and Presenter's Notes

Title: http:creativecommons'orglicensesbysa2'0

1
http//creativecommons.org/licenses/by-sa/2.0/
2
Lecture 5.2 Genome Annotation

Francis Ouellette
francis_at_bioinformatics.ubc.ca

3
New slides

All slides which are different then what you have
in your binder have a in the left hand
corner of the page. If its a modification of the
slide you have in your binder, Ive tried to note
it as such.

4
Outline

What we have
How we use it
What we should get out of it

5
What else?

Genome sequences (DNA only) by themselves are not
as useful as genomes that are fully annotated.
Functions of many processes reside 3D proteins,
and the structure of proteins and RNA is known
for only few sequences.
Need to know where the protein coding sequences
are, and what they do this is a very big
challenge in bioinformatics.
Proteins are not everything, there are also
parts components information in DNA and RNA
(and carbohydrates and lipids).
All of this becomes part of the parts list,
from where all biology will be understood.

6
Challenges at building the Parts List

Finding genes involves computational methods as
well as experimental validation
Computational methods are often inadequate, and
often generate erroneous gene (false positive)
sequences which
Are missing exons
Have incorrect exons
Over predict genes
Where the 5 and 3 UTR are missing

7
Assumptions we make

Reductionist approach still works, albeit we are
now becoming more and more systems biologists
Evolution drives everything, and will be the way
we figure things out. Or said in another way
Evolutionary relationships and comparisons will
be essential in our efforts to solve and
understand genomes

8
How we got started

GenBank database was populated by common genes
rRNA, tRNA
Globin
Histone
ATPases
Actin
Tubulin and others

9
(No Transcript)
10
Things we are looking to annotate?

CDS
mRNA
Alternative RNA
Promoter and Poly-A Signal
Pseudogenes
ncRNA

11
Pseudogenes

Could be as high as 20-30 of all Genomic
sequence predictions could be pseudogene
Non-functional copy of a gene
Processed pseudogene
Retro-transposon derived
No 5 promoters
No introns
Often includes polyA tail
Non-processed pseudogene
Gene duplication derived
Both include events that make the gene
non-funtional
Frameshift
Stop codons
We assume pseudogenes have no function, but we
really dont know!

12
LOCUS NG_005487 1850 bp
DNA linear ROD 14-FEB-2006 DEFINITION Mus
musculus ubiquitin-conjugating enzyme E2 variant
2 pseudogene (LOC625221) on
chromosome 6. ACCESSION NG_005487 VERSION
NG_005487.1 GI87239965 KEYWORDS . SOURCE
Mus musculus (house mouse) ORGANISM Mus
musculus Eukaryota Metazoa
Chordata Craniata Vertebrata Euteleostomi
Mammalia Eutheria Euarchontoglires
Glires Rodentia Sciurognathi
Muroidea Muridae Murinae Mus. REFERENCE 1
(bases 1 to 1850) AUTHORS Wilson,R. TITLE
Mus musculus BAC clone RP24-201D17 from 6
JOURNAL Unpublished (2003) COMMENT
PROVISIONAL REFSEQ This record has not yet been
subject to final NCBI review. The
reference sequence was derived from
AC121925.2. FEATURES
Location/Qualifiers source 1..1850
/organism"Mus musculus"
/mol_type"genomic DNA"
/db_xref"taxon10090"
/chromosome"6"
/note"AC121925.2 32277..34126" gene
101..1750
/gene"LOC625221" /pseudo
/db_xref"GeneID625221"
repeat_region 1792..1827
/rpt_family"ID" ORIGIN 1
tcttctgcct caattcctca agtgctagta tcatatgccc
atgccattat ttttaactcc 61 cctttttcat
gctaagaatt gaacacacgg ccctgcgtgc ggtggtgcgt
ctggtagcag 121 gagaagatgg cggtctccac
aggagttaaa gttcctcgta attttcgctt gttggaagaa
13
Noncoding RNA (ncRNA)

ncRNA represent 98 of all transcripts in a
mammalian cell
ncRNA have not been taken into account in gene
counts
cDNA
ORF computational prediction
Comparative genomics looking at ORF
ncRNA can be
Structural
Catalytic
Regulatory

14
From NW_632744.1
gene complement(55100..55691)
/locus_tag"CR40465"
/note"synonym CR_tc_AT13310"
/db_xref"GeneID3354945"
misc_RNA complement(55100..55691)
/locus_tag"CR40465"
/note"This annotation is identical to the
ncRNA CR_tc_AT13310
annotation, also mapped idenitcally to 2L
20224138,20223553
last curated on Thu Jan 15 133702 PST
2004" /db_xref"FlyBaseFBgn0
058465" /db_xref"GeneID3354
945"
15
Noncoding RNA (ncRNA)

tRNA transfer RNA involved in translation
rRNA ribosomal RNA structural component of
ribosome, where translation takes place
snoRNA small nucleolar RNA functional/catalytic
in RNA maturation
Antisense RNA gene regulation/silencing?

16
Rfam

Covariance model searches are extremely compute
intensive. A small model (like tRNA) can search a
sequence database at a rate of around 300
bases/sec. The compute time scales roughly to the
4th power of the length of the RNA, so larger
models quickly become infeasible without
significant compute resources.

17
BLAST

Seeks high-scoring segment pairs (HSP)
pair of sequences that can be aligned without
gaps
when aligned, have maximal aggregate score
(score cannot be improved by extension or
trimming)
score must be above score threshold S
Public Search engines
WWW search form http//www.ncbi.nlm.nih.gov/BLAST
Unix command line blastall -p progname -d db -i
query gt outfile
Making your own search space

18
So many matrices...

Triple-PAM strategy (Altschul, 1991)
PAM 40 Short alignments, highly similar
tblastn against ESTs
PAM 120
PAM 250 Longer, weaker local alignments
Looking in the twilight zone
BLOSUM (Henikoff, 1993)
BLOSUM 90 Short alignments, highly similar
BLOSUM 62 Most effective in detecting
known members of a protein family
Standard on NCBI server works in most cases
BLOSUM 30 Longer, weaker local alignments

19
Protein coding genes in prokaryotes, and simple
eukaryotes

Use ORF finder
http//www.ncbi.nlm.nih.gov/gorf/orfig.cgi
Simple ATG/Stop
Simple link to FASTA formatted files and BLAST.
Problems
In frame Methionine
Small protein
Solution comparative genomics

20
Figure 11 from Methods in comparative genomics
genome correspondence, gene identification and
regulatory motif discovery. Kellis M, Patterson
N, Birren B, Berger B, Lander ES. J Comput Biol.
200411(2-3)319-55.
Saccharomyces cerevisiae. Saccharomyces
paradoxus, Saccharomyces mikatae, Saccharomyces
bayanus
21
Ab initio gene identification

Goals
Identify coding exons
Seek gene structure information
Get a protein sequence for further analysis
Relevance
Characterization of anonymous DNA genomic
sequences
Works on all DNA sequences

22
Gene-Finding Strategies
Genomic Sequence
Comparative
Site-Based
Content-Based

Bulk properties ofsequence
Open reading frames
Codon usage
Repeat periodicity
Compositional complexity

Absolute properties ofsequence
Consensus sequences
Donor and acceptor splice sites
Transcription factor binding sites
Polyadenylation signals
Right ATG start
Stop codons out-of-context

Inferences basedon sequence homology
Protein sequence with similarity
to translated product of query
Modular structure of proteins
usually precludes finding complete gene

23
Gene-Finding Methods
Genomic Sequence
Neural Network
Rule-Based

Cutoff method
Criteria applied sequentially to identify
possible exons
Rank or eliminate candidates from
consideration based on pre-determined cutoff
at each step

Composite method
Criteria applied in parallel
Training sets used to optimize performance
Weight scores in order of importance

24
Evaluation Statistics
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity Fraction of actual coding regions
that are correctly predicted as
coding Specificity Fraction of the prediction
that is actually correct Correlation Combined
measure of sensitivity and specificity, Coefficien
t ranging from 1 (always wrong) to 1 (always
right)
25
Relative Performance

Claverie 1997 Rogic 2000
Sn () Sp () CC CC
Individual Exons
MZEF 78 86 0.79
HEXON 71 65 0.64
SorFind 42 47 0.62
GRAIL II 51 57 0.47
Gene Structure
GENSCAN 78 81 0.86 0.91
FGENES 73 78 0.74 0.83
GRAIL II/Gap 51 52 0.66
GeneParser 35 40 0.54
HMMgene 0.91

26
What works best when?

Genome survey (draft) dataexpect only a single
exon in any given stretch of contiguous sequence
BLASTN vs. dbEST (3 UTR)
BLASTX vs. nr (protein CDS)
Finished data large contigs are available,
providing context
GENSCAN
HMMgene

27
What you need

Compute the prediction
Confirm with biological sequences (also with
computational tools)
Integrate all of this
Annotate genome (often via a GUI Graphical User
Interface)
Validate
Re-annotate/Update
Check it twice
Submit to GenBank

28
Some of the things available

EnsEMBL (EBI)
Sequin (NCBI)
PseudoCAP (SFU)
GMOD (CSHL)
Pegasys (UBiC)
Apollo (EBI/Berkeley)

29
ENSEMBL
30
http//www.pseudomonas.com/
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
http//bioinformatics.ubc.ca/pegasys/
35
Pegasys
36
Example output GAME XML (Genome Annotation
Markup Elements XML)

Input to Apollo
Genome editor created by Berkeley Drosophila
group and Ensembl
Simultaneously view heterogeneous computational
evidence
Manually create and/or edit annotations

37
Apollo

Apollo is a collaborative project between the
Berkeley Drosophila Genome Project (www.bdgp.org)
and Ensembl (www.ensembl.org). The collaboration
was set up to create a tool to initially annotate
fly but which would also be able to annotate and
browse any large eukaryotic genome. There is
a sister developers' website at
www.fruitfly.org/annot/apollo to download the fly
specific Apollo annotation tool.
All the code is open source and freely
downloadable.

38
Features of Apollo include

Zoomable and scrollable feature display down to
sequence level optimized for display of large
regions of genome.
User configurable feature types (colour,
appearance, size, order, score threshold)
Can connects directly to the Ensembl web site for
the latest human genome annotation
Reads/write gff format
Searchable for feature names or sequence string
Ability to select features and sort by
different feature attributes
All features are linked out to their source
database web sites (ensembl,swissprot,embl,unigene
etc)
Display of genomic sequence and any associated
start and stop codons
Prints postscript output
Display is reversible allowing easy
interpretation of reverse strand features.

39
(No Transcript)
40
GenBank Features
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
41
GenBank Features the important ones
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
42
GenBank Features the abundant one
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR at
tenuator CAAT_signal CDS conflict C_region D-loop
D_segment enhancer exon
GC_signal gene iDNA intron J_segment LTR mat_pepti
de misc_binding misc_difference misc_feature misc_
recomb misc_RNA misc_signal misc_structure modifie
d_base
mRNA N_region old_sequence polyA_signal polyA_site
precursor_RNA primer_bind prim_transcript promote
r protein_bind RBS repeat_region repeat_unit rep_o
rigin rRNA
satellite scRNA sig_peptide snoRNA snRNA S_region
stem_loop STS TATA_signal terminator transit_pepti
de tRNA unsure variation V_region V_segment
43
Gene Prediction Caveats

Predictions are of protein coding regions
Do not detect non-coding areas (5 and 3 UTR)
Non-coding RNA genes are missed
Predictions are for typical genes
Must predict a beginning and an end
Partial or multiple genes are often missed
Training sets may be biased
Methods are sensitive to GC content
Weighting of factors may be inordinately biased

44
Moving along

Sequencing technology led genomics, and to some
extant bioinformatics
EST complicated things, and where the beginning
of specialized methods or functional division
in GenBank.
Yeast chromosomes and bacterial chromosomes
rapidly lead us to our obvious ineptitude of
genome annotations, and these genomes where
simple!
A controlled vocabulary was necessary, albeit
slow to be created Gene Ontology (GO), Sequence
Ontology (SO).

45
Genome annotation problems

Assembling the genome
Analysis interpretation
Lack of consistency from gene to gene
Lack of consistency from person to person
Lack of controlled vocabulary
Parts we dont know
Bacteria vs mammals
Graphical user interface
Gene expression/molecular interactions
Dimensions
Updates and maintenance

46
Some comments about the human genome

Finished February 15, 2001
Finished April 25, 2003
Still not fully understood and definitely not
finished.
We are still in the genomic era.
To get a full parts list, we need, as a
community, to develop a system to rigorously find
all of the part of the human genome
Genes
Protein coding sequences
Non coding RNAs (ncRNA)
Identify and understand regulatory sequences
Many other cool things we dont know about!

47
The ideal annotation of MyGene
All clones
All SNPs
Promoter(s)
MyGene
All mRNAs
All proteins

All protein modifications
Ontologies
Interactions (complexes, pathways, networks)
Expression (where and when, and how much)
Evolutionary relationships

All structures
48
Things we will need to integrate in the future

Better gene predictions
Haplotypes to map complex diseases
Micro-array/gene expression data
SAGE data
Protein-protein interaction data
GFP (Green Fluorescent Protein)
Human-base (Entrez Gene)
Better standardization of annotation protocols.
Integration!

49
Some Concluding remarks

Trust but verify
Beware of gene prediction tools!
Always use more than one gene prediction tool and
more than one genome when possible.
Active area of bioinformatics research, so be
mindful of the new literature in this .

50
http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id113
51
http//bioinformatics.ubc.ca/resources/links_direc
tory/?subcategory_id39
52
Finding records needing to be updated?
28

Who updates? Submitters, Journals, 3rd party
What to update?Gene names, citations, new
product,sequencing errors
Where? update_at_ncbi.nlm.nih.gov
Why update?

53
example
54
(No Transcript)
55
(No Transcript)
56
From francis Wed Mar 3 223219 1999
To ddbjupdt_at_ddbj.nig.ac.jp

Subject D25291 mito

Dear colleagues,

it
appears that DDBJ record D25291 is contaminated
with mitochondrial sequences from nucleotide
673 to 1803, as it is identical to mouse
mitochondrial sequence (EMBL V00711) for more
than 1100 nucleotides.

I would recommend deleting that segment of the
record, or removing the record altogether, as
it leads to unfortunate misinterpretation of the
data when using GenBank or DDBJ. The protein
sequence (which is erroneous, as it is
all of mitochondrial origin) should definitely be
removed as well.

.
LOCUS
MUSNGH 1803 bp mRNA ROD
29-AUG-1997 DEFINITION Mouse neuroblastoma
and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.
ACCESSION D25291
57
Sequence Updated
31
LOCUS MUSNGH 1803 bp mRNA
ROD 29-AUG-1997 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 cell TA20 mRNA, complete
cds.
ACCESSION D25291 NID g1850791 VERSION
D25291.1 GI1850791
length
Date
DEF
GI
Version
LOCUS MUSNGH 619 bp mRNA
ROD 12-MAR-1999 DEFINITION Mouse
neuroblastoma and rat glioma hybridoma cell line
NG108-15 mRNA. ACCESSION
D25291 VERSION D25291.2 GI4520413
58
(No Transcript)
59
(No Transcript)
60
Courses in program

Required courses
MBB 505/MEDG 548C PROBLEM BASED LEARNING IN
BIOINFORMATICS
MBB 659 SPECIAL TOPICS IN BIOINFORMATICS
MBB 841 BIOINFORMATICS
CMPT 881 THEORETICAL COMPUTING
CMPT 889 BIOINFORMATICS ALGORITHMS
CPSC 545 ALGORITHMS FOR BIOINFORMATICS

Electives
CMPT 354 DATABASE SYSTEMS I
CMPT 740 DATA MINING
CMPT 880 SPECIAL TOPICS IN MEDICAL IMAGE
ANALYSIS
CPSC 304 INTRODUCTION TO RELATIONAL DATABASES
CPSC 504 DATABASE DESIGN
HCEP 511 CANCER EPIDEMIOLOGY
CPSC 53A TOPICS IN ALGORITHMS AND COMPLEXITY
BIOINFORMATICS
INFO 506 CRITICAL RESEARCH ANALYSIS
MATH 561 MATHEMATICAL BIOLOGY
MATH 612D TOPICS IN MATHEMATICAL BIOLOGY
-MATHEMATICS OF INFECTIOUS DISEASES AND
IMMUNOLOGY
MBB 823 PROTEIN STRUCTURE AND FUNCTION
PROTEOMIC BIOINFORMATICS
MBB 831 MOLECULAR EVOLUTION OF EUKARYOTE
GENOMES
MBB 835 GENOMIC ANALYSIS
MEDG 505 GENOME ANALYSISSTAT 540 STATISTICAL
METHODS FOR HIGH
DIMENSIONAL BIOLOGYSTAT 802 MULTIVARIATE
ANALYSIS
STAT 805 NON-PARAMETRIC STATISTICS AND DISCRETE
DATA ANALYSIS
STAT 890 STATISTICS SELECTED TOPICS -
BIOMETRICAL GENETICS

61
Bioinformatics Faculty/Mentors

David Baillie Bioinformatics, Molecular Biology
Biochemistry, SFU
Fiona Brinkman (on maternity leave June -
Sept/06) Molecular Biology Biochemistry, SFU
Ryan Brinkman Medical Genetics, UBC, BC Cancer
Research Centre, BCCA
Jenny Bryan (on maternity leave beginning Jan/06)
Statistics and Michael Smith Laboratories, UBC
Artem Cherkasov Medicine, Division of Infectious
Diseases, UBC
Ann Condon (on sabbatical until Sept/06)
Computer Science, UBC
Martin Ester Computing Science, SFU
Arvind Gupta Computing Science, SFU
Phil Hieter Michael Smith Laboratories, UBC

Holger Hoos Computer Science, UBC
Steven Jones Program Director, BioinformaticsGeno
me Sciences Centre, BCCA
Marco Marra Genome Sciences Centre, BCCA
Francis Ouellette Director, UBC Bioinformatics
Centre (UBiC) Michael Smith Laboratories and
Medical Genetics, UBC
Paul Pavlidis UBC Bioinformatics Centre (UBiC)
Psychiatry, UBC
Frederic Pio Molecular Biology Biochemistry,
SFU
Cenk Sahinalp Computing Science, SFU
Wyeth Wasserman Centre for Molecular Medicine
Therapeutics, UBC
Mark Wilkinson Medical Genetics, UBC