Genes, Genomes, and Genomics - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Genes, Genomes, and Genomics

Description:

Complete DNA segments responsible to make functional products. Products. Proteins ... One ORF per gene. ORFs begin with start, end with stop codon (def. ... – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 34

Provided by: uwe96

Category:

more less

Transcript and Presenter's Notes

Title: Genes, Genomes, and Genomics

1
Genes, Genomes, and Genomics
Bioinformatics in the Classroom plagiarized
from http//www.dnalc.org/bioinformatics/presenta
tions/hhmi_2003/2003_3.ppt June, 2003
2
Two. Again
Francis Collins, HGP
Craig Venter, Celera Inc.
3
Whats in a chromosome?
4
Hierarchical vs. Whole Genome
5
The value of genome sequences lies in their
annotation

Annotation Characterizing genomic features
using computational and experimental methods
Genes Four levels of annotation
Gene Prediction Where are genes?
What do they look like?
Domains What do the proteins do?
Role What pathway(s) involved in?

6
How many genes?

Consortium 35,000 genes?
Celera 30,000 genes?
Affymetrix 60,000 human genes on GeneChips?
Incyte and HGS over 120,000 genes?
GenBank 49,000 unique gene coding sequences?
UniGene gt 89,000 clusters of unique ESTs?

7
Current consensus (in flux )

15,000 known genes (similarity to previously
isolated genes and expressed sequences from a
large variety of different organisms)
17,000 predicted (GenScan, GeneFinder, GRAIL)
Based on and limited to previous knowledge

8
How to we get from here
9
to here,
10
What are genes? - 1

Complete DNA segments responsible to make
functional products
Products
Proteins
Functional RNA molecules
RNAi (interfering RNA)
rRNA (ribosomal RNA)
snRNA (small nuclear)
snoRNA (small nucleolar)
tRNA (transfer RNA)

11
What are genes? - 2

Definition vs. dynamic concept
Consider
Prokaryotic vs. eukaryotic gene models
Introns/exons
Posttranscriptional modifications
Alternative splicing
Differential expression
Genes-in-genes
Genes-ad-genes
Posttranslational modifications
Multi-subunit proteins

12
Prokaryotic gene model ORF-genes

Small genomes, high gene density
Haemophilus influenza genome 85 genic
Operons
One transcript, many genes
No introns.
One gene, one protein
Open reading frames
One ORF per gene
ORFs begin with start,
end with stop codon (def.)

TIGR http//www.tigr.org/tigr-scripts/CMR2/CMRGen
omes.spl NCBI http//www.ncbi.nlm.nih.gov/PMGifs/
Genomes/micr.html
13
Eukaryotic gene model spliced genes

Posttranscriptional modification
5-CAP, polyA tail, splicing
Open reading frames
Mature mRNA contains ORF
All internal exons contain open read-through
Pre-start and post-stop sequences are UTRs
Multiple translates
One gene many proteins via alternative splicing

14
Expansions and Clarifications

ORFs
Start triplets stop
Prokaryotes gene ORF
Eukaryotes spliced genes or ORF genes
Exons
Remain after introns have been removed
Flanking parts contain non-coding sequence (5-
and 3-UTRs)

15
Where do genes live?

In genomes
Example human genome
Ca. 3,200,000,000 base pairs
25 chromosomes 1-22, X, Y, mt
28,000-45,000 genes (current estimate)
128 nucleotides (RNA gene) 2,800 kb (DMD)
Ca. 25 of genome are genes (introns, exons)
Ca. 1 of genome codes for amino acids (CDS)
30 kb gene length (average)
1.4 kb ORF length (average)
3 transcripts per gene (average)

16
Sample genomes
List of 68 eukaryotes, 141 bacteria, and 17
archaea at http//www.ncbi.nlm.nih.gov/PMGifs/Geno
mes/links2a.html
17
So much DNA so few genes
18
Genomic sequence features

Repeats (Junk DNA)
Transposable elements, simple repeats
RepeatMasker
Genes
Vary in density, length, structure
Identification depends on evidence and methods
and may require concerted application of
bioinformatics methods and lab research
Pseudo genes
Look-a-likes of genes, obstruct gene finding
efforts.
Non-coding RNAs (ncRNA)
tRNA, rRNA, snRNA, snoRNA, miRNA
tRNASCAN-SE, COVE

19
Gene identification

Homology-based gene prediction
Similarity Searches (e.g. BLAST, BLAT)
Genome Browsers
RNA evidence (ESTs)
Ab initio gene prediction
Gene prediction programs
Prokaryotes
ORF identification
Eukaryotes
Promoter prediction
PolyA-signal prediction
Splice site, start/stop-codon predictions

20
Gene prediction through comparative genomics

Highly similar (Conserved) regions between two
genomes are useful or else they would have
diverged
If genomes are too closely related all regions
are similar, not just genes
If genomes are too far apart, analogous regions
may be too dissimilar to be found

21
Genome Browsers
NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/
Generic Genome Browser (CSHL) www.wormbase.org/db
/seq/gbrowse
Ensembl Genome Browser www.ensembl.org/
UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGate
way?orghuman
Apollo Genome Browser www.bdgp.org/annot/apollo/
22
Gene discovery using ESTs

Expressed Sequence Tags (ESTs) represent
sequences from expressed genes.
If region matches EST with high stringency then
region is probably a gene or pseudo gene.
EST overlapping exon boundary gives an accurate
prediction of exon boundary.

23
Ab initio gene prediction

Prokaryotes
ORF-Detectors
Eukaryotes
Position, extent direction through promoter
and polyA-signal predictors
Structure through splice site predictors
Exact location of coding sequences through
determination of relationships between potential
start codons, splice sites, ORFs, and stop codons

24
Tools

ORF detectors
NCBI http//www.ncbi.nih.gov/gorf/gorf.html
Promoter predictors
CSHL http//rulai.cshl.org/software/index1.htm
BDGP fruitfly.org/seq_tools/promoter.html
ICG TATA-Box predictor
PolyA signal predictors
CSHL argon.cshl.org/tabaska/polyadq_form.html
Splice site predictors
BDGP http//www.fruitfly.org/seq_tools/splice.htm
l
Start-/stop-codon identifiers
DNALC Translator/ORF-Finder
BCM Searchlauncher

25
How it works I Motif identification

Exon-Intron Borders Splice Sites

Exon Intron
Exon gaggcatcaggtttgtagactgtgtttcag
tgcacccact ccgccgctgagtgagccgtgtc
tattctaggacgcgcggg tgtgaattaggtaagaggtt
atatctccagatggagatca ccatgaggaggtgagtg
ccattatttccaggtatgagacg
Splice site Splice site
Exon Intron
Exon gaggcatcagGTttgtagactgtgtttcAG
tgcacccact ccgccgctgaGTgagccgtgtc
tattctAGgacgcgcggg tgtgaattagGTaagaggtt
atatctccAGatggagatca ccatgaggagGTgagtg
ccattatttccAGgtatgagacg
Splice site Splice site
Motif Extraction Programs at http//www-btls.jst.g
o.jp/
26
How it works II - Movies
Pribnow-Box Finder 0/1 Pribnow-Box Finder all
27
How it works III The (ugly) truth
28
Gene prediction programs

Rule-based programs
Use explicit set of rules to make decisions.
Example GeneFinder
Neural Network-based programs
Use data set to build rules.
Examples Grail, GrailEXP
Hidden Markov Model-based programs
Use probabilities of states and transitions
between these states to predict features.
Examples Genscan, GenomeScan

29
Evaluating prediction programs

Sensitivity vs. Specificity
Sensitivity
How many genes were found out of all present?
Sn TP/(TPFN)
Specificity
How many predicted genes are indeed genes?
Sp TP/(TPFP)

30
Gene prediction accuracies

Nucleotide level 95Sn, 90Sp (Lows less than
50)
Exon level 75Sn, 68Sp (Lows less than 30)
Gene Level 40 Sn, 30Sp (Lows less than 10)
Programs that combine statistical evaluations
with similarity searches most powerful.

31
Common difficulties

First and last exons difficult to annotate
because they contain UTRs.
Smaller genes are not statistically significant
so they are thrown out.
Algorithms are trained with sequences from known
genes which biases them against genes about which
nothing is known.
Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.

32
The annotation pipeline

Mask repeats using RepeatMasker.
Run sequence through several programs.
Take predicted genes and do similarity search
against ESTs and genes from other organisms.
Do similarity search for non-coding sequences to
find ncRNA.

33
Annotation nomenclature

Known Gene Predicted gene matches the entire
length of a known gene.
Putative Gene Predicted gene contains region
conserved with known gene. Also referred to as
like or similar to.
Unknown Gene Predicted gene matches a gene or
EST of which the function is not known.
Hypothetical Gene Predicted gene that does not
contain significant similarity to any known gene
or EST.