Genome Annotation and Databases - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Genome Annotation and Databases

Description:

Assembled sequence greater than 1000bp long is deposited in public database ... AGAVE (Architecture for Genomic Annotation, Visualization and Exchange) ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 30
Provided by: csta2
Category:

less

Transcript and Presenter's Notes

Title: Genome Annotation and Databases


1
Genome Annotation and Databases
  • Genomic DNA sequence
  • Genomic annotation

BIO520 Bioinformatics Jim Lund
2
Genome Annotation
  • Find known repeats
  • Search for new repeated seqeunces
  • Predict Genes
  • BLASTX
  • Genewise, Fgenes, Genscan
  • Integrate other data sources.

Accuracy highest in high homology class
3
Genome annotation servers
  • Integrate information from several maps
  • DNA sequence (contigs, quality).
  • Physical (cytogenetic, STS content).
  • Genes (Predicted and known).
  • Several prediction programs.
  • Expressed sequence tags (ESTs, Unigene clusters)
  • Evidence (Predicted, confirmed)
  • Non-coding RNA (ncRNA) transcripts.
  • Regions of shared synteny.

4
Data Release
  • Human genome sequence released under 1996 Bermuda
    rules
  • Assembled sequence greater than 1000bp long is
    deposited in public database (GenBank/EMBL/DDBJ)
    every 24 hours
  • No patents are filed
  • Bermuda principles reaffirmed at January 2003
    WT/NIH meeting
  • Pre-release of data for all community projects
  • Nature 421 , 875 (2003)
  • NHGRI
  • http//www.genome.gov/page.cfm?pageID10506376
  • WT
  • http//www.wellcome.ac.uk/en/1/awtpubrepdat.html
  • Benefits of Open Data Access supported by OECD
    report
  • http//dataaccess.ucsd.edu

5
Accessing the Genome
  • Genomes sequences are becoming available very
    rapidly
  • Large and difficult to handle computationally
  • Everyone expects to be able to access them
    immediately
  • Bench Biologists
  • Has my gene been sequenced?
  • What are the genes in this region?
  • Where are all the GPCRs
  • Connect the genome to other resources
  • Research Bioinformatics
  • Give me a dataset of human genomic DNA
  • Give me a protein dataset

6
Getting information out
  • Search/browse to find the gene or region.
  • Export formats
  • Screen shot
  • FASTA seq.
  • Genbank file with features annotated
  • Feature list (Gff, tab-delimited text)
  • Pip (plot of sequence identity between organisms).

7
Challenges
  • Scale and data flow
  • Mainly engineering problems
  • Presentation, ease of use
  • Engineering problems
  • User interface design
  • Algorithmic
  • Partly engineering
  • Partly research

8
NCBI sequence assembly (sequence chromosome)
  • Remove contaminants
  • Bin by chromosome arms
  • Sequence Layout
  • Sequence Building
  • Place on chromosomes

9
NCBI sequence assembly - a modified greedy
approach
  • Sequence Layout
  • Curated Finished Regions
  • Curated assembly instructions
  • MegaBLAST hits
  • Consider clone order
  • BAC chromosome assignment
  • annotation
  • STS markers
  • personal communication
  • Remove conflicting overlaps, redundant BACs
  • Sequence Building
  • Consider fragmentfragment sequence overlaps for
    each BAC pair in layout
  • Meld overlapping sequence
  • Order and Orient (oo )
  • alignments (mRNA, EST)
  • BAC annotation
  • paired plasmid reads

10
NCBI Genome Build Process
STS
dbSNP
Clones
GenomeScan
Collaboration Curation
GenBank
LocusLink
RefSeq
Update Links gis Prepare for release
LocusLink
Annotation
Contig Build Release
Assembly
Resource Updates
Freeze
Input Data Sequences Curated NTs TPF
BLAST hits
Public Release
Sequences (contig mRNA protein)
Exclude Problem accessions
Analysis Review Corrections for next build
Map Viewer
FTP BLAST Input Resources
11
What is being annotated?
Feature Method
Genes By alignment, by prediction
Markers
By ePCR
Variation
By alignment
Clones/Cytogenetic location
By alignment (BAC ends)
Phenotype (MIM)
Via Gene identification, associated markers
Cytogenetic Position
By annotated BAC-END sequenced clones By
FISH-mapped clones used in assembly
12
RefSeq a reagent for Contig Annotation
genome
  • Potential Problems
  • Gene Families
  • Partial
  • Chimeric
  • Intron read-through
  • Linker
  • Vector
  • Wrong organism

RefSeq mRNAs
GenBank mRNAs
ESTs
  • RefSeq Advantages
  • Separate Gene Families
  • Not Partial
  • Means to correct
  • problem sequences

TBLASTN
RPSBLAST
RefSeq process results in excluding
problem GenBank sequences from annotation pipeline
GenomeScan
13
NCBI Products of annotation
  • RefSeqs (transcripts, proteins)
  • Gene id (LocusID)
  • features in chromosome coordinates
  • features in contig (NT accession)
  • coordinates
  • Available in
  • Map Viewer
  • Graphical display
  • Tabular display
  • Sequence downloads
  • FTP
  • RefSeqs (contigs, transcripts, proteins)
  • Mapping Data
  • LocusLink Other resources

14
NCBI Map Viewer
15
NCBI Map Viewer Tabular report
16
Genes in regions of conserved synteny
17
Query by sequence Review the alignment
  • A click away
  • Alignments (BLAST hit)
  • Gene Description (LocusLink)
  • Report of all features in the region
  • Contig sequence
  • Sequence in the region
  • other mRNAs aligning in the region
  • Define your own gene model based on alignments in
    the region

18
Quality Control - Genome review
  • Is the sequence correct?
  • Is the feature correctly placed?
  • Is there a feature that should be placed?
  • Are the attributes of the feature correct?
  • Approaches
  • In-house analysis review (manual curation)
  • Shared information (UCSC/Ensembl)
  • Solicited review by experts in local regions

19
Ensembl Analysis
  • Set of high quality gene predictions
  • From known human mRNAs aligned against genome
  • From similar protein and mRNAs aligned against
    genome
  • From Genscan predictions confirmed via BLAST of
    Protein, cDNA, ESTs databases.
  • Initial functional annotation from Interpro
  • Integration with external resources (SNPs, SAGE,
    OMIM)
  • Comparative analysis between mouse/human
  • DNA sequence alignment
  • Protein orthologs

20
Ensembl prediction pipeline
DNA
RepeatMasker
Genscan
Pmatch all human Proteins and cdnas
Blast genscan peptides v Protein,unigene,est,vert
mrna
MiniGenewise MiniEst2genome
Genes
21
Genome Annotation
The generic structure of an automatic genome
annotation pipeline and delivery system
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Chromosome
Overview Genes and Markers 1Mb
Detailed View Genes, ESTs, CpG etc. 100kb
26
Useful genomic annotation and browser URLs
Automated annotation pipelines                
EBI/Sanger Institute Ensembl Project
http//www.ensembl.org/Homo_sapiens/
                NCBI Human Genome Browser
http//proxy.library.uiuc.edu3367/genome/guide/h
uman/                 The Oak Ridge National
Laboratories Genome Channel http//compbio.ornl.
gov/channel/                 Celera Discovery
System http//cds.celera.com/                
Incyte Genomics Genomics Knowledge Platform
http//www.incyte.com/incyte_science/technology/g
kp/                 Paracel GeneMatcher2
System http//www.paracel.com/products/gm2.html H
uman genome browsers                 UCSC
Human Genome Browser http//genome.cse.ucsc.edu/c
gi-bin/hgGateway/                 Softberry
Genome Explorer http//www.softberry.com/berry.ph
tml?topicgenomexp                 Viaken
Enterprise Ensembl Solution http//www.viaken.co
m/ns/solutions/ensembl.html                
LabBook Inc. Genomic Explorer Suite
http//www.labbook.com/products/ExplorerSuite.asp
                University of Tokyo Gene
Resource Locator Browser http//grl.gi.k.u-tokyo.
ac.jp/ Other useful sites                 The
Institute for Genomic Research (TIGR)
http//www.tigr.org/                 Human
Genome Central http//www.ensembl.org/genome/cent
ral/ and http//proxy.library.uiuc.edu3528/genom
e/central/

27
Genome annotaion issues
  • Annotation servers.
  • Pro make genomics information accessible to
    biologists without expert bioinformatics skills.
  • Con makes it difficult to perform large-scale
    data mining.
  • Solution enable more experienced users to
    retrieve the data they require and to run
    analyses locally.
  • Open annotation systems.
  • Biologists need to have access to annotations
    available in the community and to share their own
    contributions with the community.
  • A common protocol between systems that enables
    genome data to be freely exchanged
  • AGAVE (Architecture for Genomic Annotation,
    Visualization and Exchange)
  • Distributed Annotation System (DAS) projects


28
Genome annotation servers
  • Several ways to find information
  • Search by clone, gene, EST, marker.
  • Browse sequence.
  • BLAST searches.
  • Homology, start in one organism, jump to the
    syntenic region of another.

29
UCSC Genome Browser

http//genome.ucsc.edu/cgi-bin/hgGateway
Write a Comment
User Comments (0)
About PowerShow.com