Title: Biological Motivation Gene Finding
1Biological MotivationGene Finding
- Anne R. Haake
- Rhys Price Jones
2Gene Finding
- Why do it?
- Find and annotate all the genes within the large
volume of DNA sequence data - how many genes in an organism? homologies?
- Â
- Gain understanding of problems in basic science
- e.g. gene regulation-what are the mechanisms
involved in transcription, splicing, etc? - Different emphasis in these goals has some effect
on the design of computational approaches for
gene finding.
3Gene Finding by Biological Methods
- Extract mRNA
reverse - transcribe
cDNA - Label cDNA
-
- Detecting by using
cDNA probe -
- Gene found
DNA library
4Gene Finding by Computational Methods
- Dependent on good experimental data to build
reliable predictive models - Various aspects of gene structure/function
provide information used in gene finding programs
5Figure 12.3
6The Informatics View of Genes
- Genes are character strings embedded in much
larger strings called the genome - Genes are composed of ordered elements associated
with the fundamental genetic processes including
transcription, splicing, and translation.
7Gene Finding
- Cells recognize genes from DNA sequence
- find genes via their bioprocesses
- Not so easy for us..
8 CTAGCAGGGACCCCAGCGCCC
GAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAA
CTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGC
CCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAA
GCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTG
GGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAAC
TGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGA
ATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCT
AAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTT
TTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGT
GCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTC
AGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTA
TTGTTATGAGACTGGATATAT...
9 G CTAGCAGGGACCCCAGCGCCCGAGAGACCAT
GCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCA
GGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGA
GGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTA
AAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAG
AATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGAT
TGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGAC
TCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAA
AAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTC
ATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTG
CATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAA
AATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAG
ACTGGATATAT...
10Types of Genes
- Protein coding
- most genes
- RNA genes
- rRNA
- tRNA
- snRNA (small nuclear RNA)
- snoRNA (small nucleolar RNA)
11 3 Major Categories of Information used in Gene
Finding Programs
- Signals/features a sequence pattern with
functional significance e.g. splice donor
acceptor sites, start and stop codons, promoter
features such as TATA boxes, TF binding sites,
CpG islands - Content/composition -statistical properties of
coding vs. non-coding regions. - e.g. codon-bias length of ORFs in prokaryotesGC
content - Similarity-compare DNA sequence to known
sequences in database - Not only known proteins but also ESTs, cDNAs
12Looking for Protein Coding Genes
- Look for ORF (begins with start codon, ends with
stop codon, no internal stops!) - long (usually gt 60-100 aa)
- If homologous to known protein more likely
- Look for basal signals
- Transcription, splicing, translation
- Look for regulatory signals
- Depends on organism
- Prokaryotes vs Eukaryotes
- Vertebrate vs fungi
13Easier problemGene Finding in Bacterial Genomes
- Why?
- Dense Genomes
- Short intergenic regions
- Uninterrupted ORFs
- Conserved signals
- Abundant comparative information
- Complete Genomes available for many
14What do Prokaryotic Genes look like?
5
3
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
15Prokaryotic Gene Expression
Promoter
Cistron1
Cistron2
CistronN
Terminator
Transcription
RNA Polymerase
mRNA 5
3
1
2
N
SD in polycistronic message
N
N
C
N
C
C
1
2
3
Polypeptides
Slide modified from http//biology.uky.edu/520/Le
cture/lect8/lect8Notes.ppt
16Open Reading Frame (ORF)
- Any stretch of DNA that potentially encodes a
protein - The identification of an ORF is the first
indication that a segment of DNA may be part of a
functional gene
17Open Reading Frames
- Each grouping of the nucleotides into consecutive
triplets constitutes a reading frame. There are
three different reading frames in the 5-gt3
direction and a further three in the reverse
direction on the opposite strand. - A sequence of triplets that contains no stop
codon is an Open Reading Frame (ORF) -
A C G T A A C T G A C T A G G T G A A T
CGT AAC TGA CTA GGT GAA
GTA ACT GAC TAG GTG AAT
18ORFs as gene candidates
- An open reading frame that begins with a start
codon (usually ATG, GTG or TTG, but this is
species-dependent) - Most prokaryotic genes code for proteins that are
60 or more amino acids in length - The probability that a random sequence of
nucleotides of length n has no stop codons is
(61/64)n - When n is 50, there is a probability of 92 that
the random sequence contains a stop codon - When n is 100, this probability exceeds 99
19Codon Bias
- Genetic code degenerate
- Equivalent triplet codons code for the same amino
acid - http//www.pangloss.com/seidel/Protocols/codon.htm
l - Codon usage varies
- organism to organism
- gene to gene
- Biological basis
- Avoidance of codons similar to stop
- Preference for codons that correspond to abundant
tRNAs within the organism
20Codon Bias Gene Differences
GAL4 ADH1 Gly GGG 0.21 0 Gly GGA 0.17 0 Gly
GGT 0.38 0.93 Gly GGC 0.24 0.07
Slide modified from http//biology.uky.edu/520/Le
cture/lect8/lect8Notes.ppt
21Codon BiasOrganism differences
- Yeast Genome arg specified by AGA 48 of time
(other five equivalent codons 10 each) - Fruitfly Genome arg specified by CGC 33 of time
(other five 13 each) - Complete set of codon usage biases can be found
at
http//www.kazusa.or.jp/codon/
22GC content
- GC relative to AT is a distinguishing factor of
bacterial genomes - Varies dramatically across species
- Serves as a means to identify bacterial species
- For various biological reasons
- Mutational bias of particular DNA polymerases
- DNA repair mechanisms
- horizontal gene transfer (transformation,
transduction, conjugation)
23GC Content
- GC content may be different in recently acquired
genes than elsewhere - This can lead to variations in the frequency of
codon usage within coding regions - There may be significant differences in codon
bias within different genes of a single
bacteriums genome
24Ribosome Binding Sites
- RBS is also known as a Shine-Dalgarno sequence
(species-dependent) that should bind well with
the 3 end of 16S rRNA (part of the ribosome) - Usually found within 4-18 nucleotides of the
start codon of a true gene
25Shine-Dalgarno Sequence
- Is a nucleotide sequence (consensus AGGAGG)
that is present in the 5'-untranslated region of
prokaryotic mRNAs. - This sequence serves as a binding site for
ribosomes and is thought to influence the reading
frame. - If a subsequence aligning well with the
Shine-Dalgarno sequence is found within 4-18
nucleotides of an ORFs start codon, that
improves the ORFs candidacy.
26Bacterial Promoter
-35 T82T84G78A65C54A45 (16-18
bp) T80A95T45A60A50T96(A,G) -10
1 Not so simple remember, these are consensus
sequences
27Termination Sequences
- 3-U tail
- Stem/loop
- Inverted repeat immediately preceding the runs of
uracil - Termination sequence