Title: Predicting Genes in Eukaryotic Genomes By Computer
1Predicting Genes inEukaryotic GenomesBy Computer
- Hao Bailin (???)
- T-Life Research Center, Fudan University
- Beijing Genomics Institute , Academia Sinica
- Institute of Theoretical Physics, Academia Sinica
- (www.itp.ac.cn/hao/)
-
2The Central Dogma of Molecular Biology
- replication
- DNA DNA
- reverse transcription
transcription - cDNA mRNA
-
translation -
Protein/Enzyme -
folding - Function Structure
- interaction
3(No Transcript)
4(No Transcript)
5(No Transcript)
6DNA(??????)??
- ?4???(??????a, c, g, t )??
- ?????????????????
- ??23????????24????19????12?????300?
- ???????????????????,????,??????????,??
7Large-Scale DNA SequencingSince 1977
- Sanger method polymerization stopping
- Maxam-Gilbert chemical degradation
- Each reaction 500-600 bp (a single read)
- Clone by clone vs. whole-genome shotgun
- Sequence assembling reads contigs scaffolds
superscaffolds - Automatic sequencer MegaBace, 96 or 384 channels
8Letter production at BGI (Beijing Hangzhou)
Daily 5 x107 Yearly 1010
9????????????
- ???? (Saccharomyces cerevisiae)
- ???? (Schizosacchromyces pombe)
- ???? (Caenorhabitatis elegans)
- ?? (Drosophila melanogaster)
- ?????? (Plasmodium falciparum)
- ????? (Anopheles gambiae)
- ?? (Homo sapiens)???? (Pan trogodytes)
- ?? (Mus musculus)??? (Rattus norvegicus)
- ?? (Canis familiaris)??? (Gallus gallus)???(Sus
scrofa) - ??? (Fugu rubripes)
- ?? (Bambyx mori)??? (Apsis mellifera)
- ???(Arabidopsis thaliana)???(Oryza sativa)
- ?? (Zea mays)
10cccaatatcttgcttcagcaagatattgggtatttctagctttcctttct
tcaaaaattgctatatgttagcagaaaagccttatccattaagagatgga
acttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtg
cattacttccataccaagattagcacggttgatgatatcagcccaagtat
taataacgcgaccttggctatcaactacagattggttgaaattgaatccg
tttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccc
tactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaa
aactagcatattggaagattaatcggccaaaataaccatgagcggccaca
atattataagtttcttcctcttgaccaaatctgtaaccctcattagcaga
ttcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccat
gcatagcactgaatagggaaccgccgaatacaccagctacacctaacatg
tgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaa
gttgaaagtaccagatattcctaaaggcataccatcagagaaacttcctt
gaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagct
gaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttc
ccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaa
ttagctcataaggaccaccattgtataaccactcatcaacagatgcagct
tcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataat
ggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggct
cacgaataccatcaatatctactggaggggcagcgatgaaggcgataata
aatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaacca
tccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgac
cccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaaga
tcttggtttattcaaattgcaaggactcccaagcacacgtattaactaga
aagataatagaaggcttgttatttaacagtataatatagactatatacca
atgtcaaccaagccagccccgacagttgtatatccatacaacaaaattta
ccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcaga
ttgctcctttctagtttccatatgggttgcccgggactcgaacccggaac
tagtcggatggagtagataattattccttgttacaatagagaaaaaacct
ctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgta
gaaataggctatttctattccgaagaggaagtctactaatttttttagta
gtaagttgattcacttactatttattatagtacagagaacatttcagaat
ggaaactgtgaaagttttaccttgatcatttatcaatcatttctagttta
ttagttttgtttaatgattaattaagaggattcaccagatcattgatacg
gagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaa
agtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgta
aaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtac
cgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagta
tatactttagtcgatacaaagtcttcttttttgaagatccactgtgataa
tgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatc
cgataaatcggtccaaattggtttactaataggatgccccgatccagtac
aaaattgggcttttgctaaagatccaatgagaggagtaacagggactttg
gtatcgaattttttcatttgagtatctattagaaatgaattctccagcat
ttgattccttactaacaaagaatttattggtacacttgaaaagtacccca
gaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggt
tgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacattt
ccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttc
cttgatatcgaacataatgcataaggggatccataacgaaccatatggtt
ttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctaga
aaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaag
aagattgtttacgaagaaacaacaagaaaaattcatattctgatacataa
gagttatataggaaccgaaatagtcttttattttcttttttcaaaataaa
aatggatttcattgaagtaataaaactattccaattcgagtagtagttga
gaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattga
aggagttgaagcaagatatccaaatggataggatagggtatttctatatg
tgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattg
aatgaatagatcgtaaattctgaaactttggtatttctttttcttccgga
caagactgttctcgtagcgagaatgggatttctacaacgatcgcaaaccc
ctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaat
ccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattc
tgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaattt
cttgttattagaaccaataatttcgacaagttcggaaccatttaatccat
aatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacg
aaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttc
catttgtatttctacttgaatcagagagagagaaatatttctcggtttat
caaatggtgatacatagtacaatatggtcagaacagggtgttgcattttt
taatacaaacccctggggaagaaaaggagtctaatccacggatctttttc
cgctccttttctatccaatttgtttatgtttgttctaattacaaaagaga
acaaatcctttatttttgcaggccaattgctcttttgactttgggataca
gtctctttatcaatatactgcttcttttacacattcaatccataacatcc
ttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaa
aatcaaaggtctactcataggaaaaccagcttttccctacatcaggcact
aatctatttttaacgtctaattagatcagggagttcttccaattaagaag
ttaagctcgttgctttttgttttaccagaattggagccaggctctatcca
tttattcattagacccagaaaatcagaatttttttattccattccaaaaa
tccaaaataagaaattgattttattacgacatgctattttttccattcat
tacccttgaggatcagtcgcggtcttatagactctaccaagagtctggac
gaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaa
gccgagtactctaccattgagttagcaacccagataaactaggatcttag
atacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtca
aaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaa
acactcagataccaaaaggaacgggtctggttaaatttcactaaggttaa
aagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttat
ttaaataaataaataaatcttgtatgagagtacaaacaagagggacaacc
ctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggata
aagagacttatccatctacaaattctagatgttcaatggacctttgtcaa
tggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaata
aaggcttatgttggattggcacgacataaatccagtcaaaaataggatta
agaaagaggcaaattatttctaaatagttagacaacaagggatactagtg
agcctctcctagttttttattcatttagttcttcaattaactcaaagttc
tttctttttctttaaagaattccgccttccttaaaatatcagaaacggtt
cttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatt
taaacaagtttgattctttatcggatcataaaaacctacttttcgaagat
ctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagaca
gcttattgggatagatgtagataaataaagccccccctagaaacgtatag
gaggttttctcctcatacggctcgagaatatgacttgcattaatttccgt
acagaaaaaacaaatttcatttatactcatgactcaagttgactaatttt
gattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtct
ctaaactcttttctttgcctcatctcgaacaaattcacttttattcctta
ttccggtccaattctattgttgagacagttgaaaatcgtgtttacttgtt
cgggaatcctttatctttgatttgtgaaatccttgggtttaaacattact
tcgggaattcttattcttttttctttcaaaagagtagcaacatacccttt
tttcttatttccttcgataaagcatttccctcttctatagaaatcgaata
tgagcgattgattctgatagactttaatcaaaagagttttcccatatctt
ccaaaattggactttcttcttattttaaccttttgatttctatattattt
cgatttctatattaagggtagaatgacaaagttggcctaatttattagtt
ttcactaaccctagattctttcccttgataaaaaataaattctgtcctct
cgagctccatcgtgtactatttacttagcttacttacaaacaacccagcg
aaaattcggttcgggacgaatagaacagactatgtcgagccaagagcatt
ttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtc
cttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaa
cataacattcctctaatttcattgcaaagtgttatagggaattgatccaa
tatggatggaatcatgaatagtcattagtttcgttttttgtatactaatt
caaacttgctttgctatctatggagaaatatgaataaaagaaattaagta
tttatcgggaaagactccgcaaagagccaatttatttaaacccatattct
atcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgctt
aagacttatttattatggaatttccatcctcaacagaggactcgagatga
tcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaa
taaactatcaacctcccgtttaattaatttaattaatatattagattagc
aatctatttttccataccatttttccgtaacaaaactaattaactattaa
ctagttaaactattgcaatgaaaagaaagttttttggtagttatagaatt
ctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaa
aaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttatt
ttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattct
atctaacgagcagttcttatcttatctttaccgggatggatcattctgga
tatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaa
aaagaagaaggaaccttttttactaataaaatactataaaaaaaatttat
ctctatcataaatctatctctaccataaaggaataggtctcgttttttat
acaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttca
atttgactggacttgacactggattatgttttctgagacagaaaatgaac
gcattaggactgcatcgaatctaagagtttataagagaaaaaaattctct
ttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgt
ttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacggg
accaaaacccgctgccttaccacttggccacgccccatttcgggttttat
gcgacactaataaacagtattatgtttatttcttattcgtcaatcctact
tcaattacataaaaatggggggtattctcttggtaggattctagacatgc
gaataatatagaatccaaaaaatgcattgatcattacatggaattctatt
aagatattatatgaaagtcgaatttcttccactctcatttgagagtgcga
atacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggatttt
gaatcctccttttcctttttcccttagaaaaataactcaatcaaaatcca
attatctactctacaagaacgaaacgcttgttatgcctaatatacttagt
ttaacctgtatttgttttaattctgttatttatccgactagttttttctt
cgccaaattgcccgaagcttatgccattttcaatccaatcgtggatttta
tgcctgtcatacctgtactcttttttctattagcctttgtttggcaagct
gctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatc
atgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtct
tatattatgaaccttcgattctaaaattcaaattcttctacattgaatgt
atagctgcagcaataaatttggatcagcctttctactccctgcatctacg
ttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattga
taagagtgcttattataaatcaattcttgcaatttttttcaaaaattgat
ttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttg
tgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggaga
ttatgtaatgcttactctcaaactttttgtttatacagtagtgatattct
ttgtttccctctttatctttggattcttatctaatgatccaggacgtaat
cctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggattt
gtttcatacatttatctacgagaaaatccgggggtcagaattccttccaa
ttcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaacc
ctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtcc
actcagccatctctccccgttccaaatcgaaaggtttccgtgatatgaca
gaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagt
tcaaaaaaattatattgccaattccattttagttatattcttttttctta
atgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaa
aattggatattggctaaaagacaatcagatagattttctcttcagcaggc
atttccatataggacttgttataataaaacaagcaggttatagaaaaaaa
ctcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaacc
aacccaccccataaaattggaaagaaagataaagtaagtggacctgactc
cttgaatgaggcctctatccgctattctgatatataaattcgatgtagat
gaaattgtataagtggatttttttgtatttccttagacttagaccacgca
aggcaagaatttctcgctatttactatttcatattcttgttactagatgt
tctataggaataagaagaaatcgcaacccctttccgctacacataaaaat
ggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaa
tcctatttttgttcttatacccatgcaatagagagcgagtgggaaaaggg
aggttactttttttcattttttccttaaaaaataggctttcttggaaata
ggaatcatggaataatctgaattccaatgtttatttctatagtataagaa
aaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccc
catagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggca
ttgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagt
gatataaggtgctcggaaatggttgaagtaattgaataggaggatcacta
tgactatagcccttggtagagttactaaagaagaaaatgatttatttgat
attatggacgactggttacgaagggaccgttttgtttttgtaggatggtc
tggcctattgctttttccttgtgcttatttcgctttaggaggttggttta
cagggacaacttttgtaacttcttggtatacccatggattggcgagttcc
tatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaa
tagtttagcacactctttgttgctactatggggcccggaagcacaagggg
attttactcgttggtgtcaattaggtggtctgtggacttttgttgctctc
catggggcttttgcactaataggtttcatgttacgtcaatttgaacttgc
tcggtctgttcaattgcggccttataatgcaatttcattctctggcccaa
tcgctgtttttgtttccgtattcctgatttatccactggggcaatccggt
tggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcct
cttcttccaaggatttcataattggacgttgaacccatttcatatgatgg
gagttgccggagtattaggcgcggctctgctatgcgctattcatggggca
accgtgga
11(No Transcript)
12?????20???(???AA) ??50 6000 AA ?????????
ID A1BG_HUMAN STANDARD PRT 495
AA. ... ... ... KW Immunoglobulin domain
Glycoprotein Plasma Repeat Signal. ... ...
...SQ SEQUENCE 495 AA 54209 MW
87A50C21CE89459C CRC64 MSMLVVFLLL
WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ
ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR
YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT
PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA
TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP
VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL
LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW
SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE
GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA
NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS
GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG
AAANLELIFV GPQHAGNYRC RYRSWVPHTF
ESELSDPVELLVAES
//Â
13Gene-Finding by Computer
- Starting from early 1980s
- Ab initio or de novo algorithms GeneMark,
GenScan, FgeneSH, Genie, based on gene-structure
models and training data. (Our on-going project
BGF, the BGI Gene Finder) - Homolog methods based on sequence alignment with
known genes in databases and comparative genomics
of not-too-distant species - Mixed approach using both strategy TwinScan
14Different Stages of Gene-Finding
- Use all possible existing programs and services
on the web with a public-domain or home-made
genome viewer - Write your own gene-finder, trained for the
specific organism - A dream for the time being design a
self-training and self-developing program for
any species which would improve itself
iteratively starting from a few available reads,
cDNAs, and ESTs
15Performance of Gene-Finders in Eukaryote Genomes
- M. Q. Zhang, Nature Review Genetics, 3 (2002)
698-710 (mostly for the human genome) - Nucleotide level 80
- Exon level 45
- Whole gene structure 20
- FgeneSH and BGF for rice (our tests on 128
cDNA-confirmed single-gene genomic sequences) - Nucleotide level 90
- Exon level 60
- Whole gene structure 40
-
165
3
5
3
- Each strand carries the same amount of
information, but different sets of genes. - Two strands are equivalent in information
content. - Two strands are not equivalent in gene content.
- Biological processing (duplication,
transcription) goes from 5 to 3. - Finding genes on one strand at a time or on two
strands at the same time one-pass or two-pass
programs.
17start
stop
5
Genomic DNA
3
transcribe
RNA Pol II
Pre-mRNA
splicesome u1u2u4u5u6RNP
splice
mRNA
5-UTR
3-UTR
translate
ribsome init. elong.
factors term. chaperonine
AA seq ( protein primary seq )
fold
Protein fold
18Three Scales of Search
- Local signals with minimal signature (start,
stop, splicing) movable signals (caps,
promoters, polyAs, branching points, some very
weak) --- clustering, discrimination analysis,
various statistical models - Intermediate exons, introns, intergenic ---
Markov, semi-Markov, Hidden-Markov models
intron length distribution - Global optimal combination of the above ---
dynamic programming -
19()?(.)(.)(.)?()
Transcription
Translation
Translation
Transcription
start
start
end
end
- Signals
- transcription start (downstream of
promoters) - transcription end (upstream of poly-A)
- ? translation start (ctg, 1/64 in a random
seq.) - ? translation end (tag, tga, taa, 3/64)
- ( splicing donor site (minimal signalgt,
1/16) - ) splicing accepter site (ag, 1/16)
- branching point (very weak a)
20()?(.)(.)(.)?()
Transcription
Translation
Translation
Transcription
start
start
end
end
- ?( First exon
- )( Internal exon
- )? Last exon
- ( Non-coding 5 exon
- )? Non-coding 5 exon
- (.) Intron
- ?( Non-coding 3 exon (rare)
- ) Non-coding 3 exon (rare)
- Intergenic region
21Signal and Sequence Models
- eiid equal probability independently and
identically distributed - niid non-equal probability independently and
identically distributed - WWM Windowed weight matrix, etc.
- MMn Markov chain model of order n homogeneous
and period-3 MM5 are used in many gene-finders - Consensus sequence
22Consensus Sequences
- TATAAT ( Pribnov or -10 box )
- T80A95T45A60A50T96
- TTGACA ( -35 box )
- T82T84G78A65C54A45
- CAAT ( CAAT or 75 box )
- GGYCAATCT
- TATA ( TATA or Goldberger-Hogness box )
- TATAWAW
- ATG ( Transcription start point )
- However, in Aful ATG 76 GTG 22 TTG
2
23(No Transcript)
24GT-AG Rule for Intron
- 5 splicing
- donor site
- exon A64G73 G100T100A62A68G84T63
- 12PyNC65A100G100 Nexon
-
3 splicing -
acceptor site
25(No Transcript)
26Exon and intron size distribution
27Algorithms
- Sequence models and scores for signals
- Dynamic programming optimal parse
- Hidden Markov Model geometric distribution of
intron lengths - Semi-Hidden Markov Model needs
sequence-generating models and length probability
for each node - Language theory approach
28Flow Chart of GenScan
- Chris Burge (1996) A 27-state semi-HMM
- A simpler model 19-state
- A model taking UTR introns into account
35-state
29- FigureN, intergenic
- region P,promotor F,
- 5UTR , single-
- exon gene , initial
- exon phase
- k internal exon ,ter
- -minal exon T, 3UTR
- A,polyadenylation signal
- and, , phase k
- intron. ) strand.
30Problems Minor and Major
- Ambiguity symbols (N, W, S, R, )
- (1-p) at flanking D-type nodes
- Indels and frame-shifts
- Gradient effects in gene structure
- Introns in 5-UTRs and 3-UTRs leading to
35-state Markov Models - Alternative splicing and sub-optimal paths
- Limit of probabilistic models
- Deterministic approaches
31Dyck language A language of nested parentheses
- Many types of parentheses
- Finite depth of nesting
- Context-free language
- Our case
- Only 3 types of parentheses
- Shallow nesting
- Conjecture may be regular language
32Two Subspecies of Rice
- Oryza sativa ssp. indica (??)
- Oryza sativa ssp. Japonica (??)
- The difference was described in Xu Shens
- (??????) Chinese Dictionary of
- East Han Dynasty ( 2nd Century AD)
- J.H. Zhang et al. Rice cultivation of Jianhu
Remains in Henan Province, Science J.
(????),53(4),2002, 3 (in Chinese)
33Two Test Datasets for RiceGene-Finders
- The 28469 japonica full-length cDNAs (Kikuchi et
al., Science 301 (18 July 2003) - Select a high-quality subset without overlaps
with publically available cDNAs - A single-gene set 500 sequences with one gene in
each - A multi-gene set 46 sequences with 199 genes in
total (at least 4 genes in a sequence)
34Assessment of Gene-Finders
- Test done between 22 July and 2 August 2003
- FgeneSH (trained on monocotyledons)
- GeneMark.hmm
- RiceHMM
- GlimmerR
- GenScan (trained on maize)
- BGF(rise.genomics.org.cn/bgf/)
35Our Ultimate Goal
- An iterative, self-training, self-improving
gene-finder for any species, starting from a
small number of reads with or without EST, cDNA
supports - Annotaion and re-annotation of the rice genomes
- Plant comparative genomics, especially, that of
Gramene and Crucifers
36tRNA features
- tRNA gene ? pre-tRNA ? mature tRNA
- Mature tRNA 75 95 bases
- Cloverleaf like structure
- Five arms acceptor arm, D arm, anticodon arm, V
loop (extra arm), T C arm
37How many tRNA genes are present in an
organism?
- Codon ?? tRNA ?? amino acid
- 61 encoding codons
- 20 amino acids
- Are there 61 species of tRNA with all possible
anticodons ? - Met (M) has one codon but two tRNAs
38 Wobble hypothesis Crick, 1966
- Many tRNAs recognize more than one codon
- Through non-Watson-Crick base pairings
- Less than 61 tRNAs are needed
39The Modified Wobble Hypothesis(Guthrie Abelson
1982)
- In eukaryotes, 46 different tRNA species would be
enough. - The modified wobble hypothesis is almost
perfectly hold in H. sapiens, S. cerevisiae, A.
thaliana, C.elegans whose complete collection of
tRNAs are now known.
40tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S
W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
41tRNA Genes in the Rice Genome(Found by
tRNAScan-SE BLASTN)
42Chloroplast tRNA genes in ssp. indica and japonica
- 33 tRNA genes found in indica and japonica genome
respectively. - They are completely identical, no mutation is
found (E. C. Kemmerer and Ray Wu found two tRNA
genes perfectly conserved). - It is remarkable that in spite of more than 9000
years of separation no mutation could be observed
in the chloroplast tRNA genes in the two ssp.
43The End
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52Some Informatics Work Related to the Rice (Oryza
sativa L. ssp. indica) Draft Genome
- HAO Bailin (???)
- Beijing Genomics Institute (BGI)
- Institute of Theoretical Physics (ITP)
- T-Life Research Center, Fudan University
- http//www.itp.ac.cn/hao/
53Informatics Problems
- Collection and quality control of data
- Assembling of reads, dealing with repeats
- Gene-finding and annotation
- RNA genes
- Protein-coding genes
- Prediction of structure and function
- Connection to gene expression data
54The Central Dogma of Molecular Biology
- replication
- DNA DNA
- reverse transcription
transcription - cDNA mRNA
-
translation -
Protein/Enzyme -
folding - Function Structure
- interaction
55Genetic Material
- DNA linear or circular
- Chromosome DNA histons
- Mitochondria (???)
- Chloroplast (???)
- Plasmids linear or circular
56Two Kinds of Tasks
- Developing a new method of gene-finding a more
or less academic job - Finding genes in a given genomic sequence a
practical job
57The transfer RNA Genes in Rice (Oryza sativa ssp
indica)collection of contigs
- WANG Xiyin(???) SHI Xiaoli(???)
- (Peking U and BGI)
- HAO Bailin(???)
- (BGI, Fudan University, and ITP)
58tRNA function
- tRNAs are the actual translator from mRNA to
Amino Acids in protein. - Bridge between RNA world and protein world
- Naming convention
- trnQ-UUG
59tRNA features
- tRNA gene ? pre-tRNA ? mature tRNA
- Mature tRNA 75 95 bases
- Cloverleaf like structure
- Five arms acceptor arm, D arm, anticodon arm, V
loop (extra arm), T C arm
60tRNA structure
mRNA
61How many tRNA genes are present in an
organism?
- Codon ?? tRNA ?? amino acid
- 61 encoding codons
- 20 amino acids
- Are there 61 species of tRNA with all possible
anticodons ? - Met (M) has one codon but two tRNAs
62The Wobble Hypothesis
- The Wobble Hypothesis (Ckrick 1968)
- The Modified Wobble Hypothesis (1982) 46 tRNA
species would be enough - What has been found in yeast, worm and human
63 Wobble hypothesis Crick, 1966
- Many tRNAs recognize more than one codon
- Through non-Watson-Crick base pairings
- Less than 61 tRNAs are needed
64Wobble rules by Crick
Codon(base 3) Anticodon(base 1)
U A,G,I
C G,I
A U,I
G C,U
65The Modified Wobble Hypothesis
- In eukaryotes, 46 different tRNA species would be
enough. - Revised wobble hypothesis is almost perfectly
obeyed by H. sapiens, S. cerevisiae, A. thaliana,
C.elegans whose complete collection of tRNAs are
now known.
66Revised wobble hypothesis in eukaryotes
Guthrie Abelson,1982
Two-codon boxes
Codon (base 3) Anticodon (base 1)
U G,I
C G,I
A U
G C
67Modified wobble hypothesis
in eukaryotes
- In two codon boxes
-
- In four codon boxes
- One exceptional four codon box for Gly
Codon base 3 Anticodon base 1
UC G
Codon base 3 Anticodon base 1
UC A(I)
Codon base 3 Anticodon base 1
UC G
68Human codon usage and tRNA genes
69tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S
W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
70Distribution of tRNA genesin a genome
- A kind of repeats
- Usually clustered together
- Distributed unevenly among chromosomes
- For example, in human genome, 140 tRNA genes,
making up to 25 of the total, form a cluster in
a narrow region of only 4 Mbp on chr. 6
71BGI Rice Contigs
- 127 550 contigs of total length 361Mb (from the
estimated genome of 466Mb) - N50 size 6690bp
- It makes sense to look for tRNAs, since their
length is around 75-95bp and it is possible to
catch most of them.
72How many tRNA genes are there in rice genome ?
- Is revised wobble hypothesis obeyed ?
- Are there 46 species of tRNA genes ?
- How many copies for each tRNA?
73With the help of the tRNAscan-SE program and
BLASTN, a collection of tRNA genes was obtained
- 592 canonical tRNA genes
- 3 possible selenocysteine tRNA genes
- 1 possible suppressor tRNA gene
- 27 possible pseudo-tRNA-genes
74592 canonical tRNA genes
- BLASTN confirmed tRNA 467
- Probable novel tRNA 74
- Putative novel tRNA 51
- Novel means more adapted to rice
7527 pseudo-tRNA genes
- Genomic sequences structurally related to tRNA
- Unable to yield active gene products
- May have insertions, deletions
- May lack functional promoters
- Experiments needed to test if they are really
functionally inactive - Divided into four classes
- End-truncated type
- Insertion-disrupted
- Non-maintained
- Non-tRNA but pol III-like elements
76Rice codons and tRNA genes
77Wobble hypothesis is perfectly obeyed by rice
genome !
- 45 species of tRNA genes
- Found so far.
78On the absence of trnT-CGU gene
- In fact, six possible trnT-CGU genes were found
- but discarded for low similarity to known tRNA
genes. - The incompleteness of data only 361Mb in
contigs. - The tendency of tRNA gens to cluster together in
a genome. - Almost surely to be found (3 trnT-CGU genes were
found in japonica). - Rice is not an exception to the wobble hypothesis.
7936 tRNA genes have an intron
- 6 of the total
- All have only one intron
- Intron length 12-20 bp generally
- 38 bp the longest
- All trnY-GUA genes have an intron
- All non-initiator trnM-CAU genes have an intron,
while all initiator initiator trnM-CAU genes have
no intron
80One possible suppressor tRNA gene
- Suppressor tRNA is a mutant tRNA that recognizes
a stop codon(UAA/UAG) - instead of the codon for the cognate amino acid.
Sometimes, but not always , due to a base
substitution in the anticodon. - Here, it recognizes the stop codon UAA.
813 possible selenocysteine tRNA genes
- The 21st codon of amino acid found in every
domain of life on Earth. - While there are many more amino acids than those
twenty which are part of the standard genetic
code, only selenocysteine and pyrrolysine have
been discovered to be coded genetically. - In fact selenocysteine is encoded by the UGA
codon the umber termination codon. - Different mechanisms are adopted in prokaryotes
and eukaryotes to tell the translation machinery
of the cells that it should continue or terminate
the process of translation. - AIDs patients are found to contain several low
molecular mass selenium compounds which are
thought to be selenoprotein encoded by the HIV
genome.
8246 chloroplast and 10 mitochondrial tRNA genes
found
- Due to sequencing contamination
- Some chloroplast tRNA genes must be identical
copies - There are about 33 tRNA genes in rice chloroplast
genome as predicted by tRNAscan-SE
83Codon Bias in Rice Genome
- Codon bias exists in rice genome.
- Codon bias in rice resembles that in human,
however XCG-form codons are less used in human
genome. - Codons ending with G or C is prefered to those
ending with A or U respectively. -
84A roughly positive correlation between codon
usage and the corresponding tRNA gene number
85tRNA genes are dispersed in the whole genome
- Based on the public data of japonica, Chr.10,
Chr.7, Chr.6, Chr.3 may have much more tRNA genes
than the other chromosomes. - Many of them may form a few clusters.
- In fact, many tRNA genes are repeats,
- for example, 8 almost identical trnQ-UUG genes
are found on a contig of indica. - There are many tRNA genes identical in sequence.
They may be repeating copies of genes or may be
caused by assembly error.
86tRNA genes in eukaryotes
Species tRNA gene number Genome size (Mbp) tRNA gene per Mbp in genome CDs size (Mbp) tRNA gene per Mbp for CDs
S. cerevisiae 273 12 22.75 8.45 32
S. pombe 174 14 12.48 6.9 25
C. elegans 584 100 5.84 26.1 22
A. thaliana 620 125 4.96 33.5 18
D. melanogaster 284 180 1.58 24.1 12
O. sativa 596 464 1.48 -- --
H. sapiens 648 3400 0.19 58.5(?) 11(?)
87Chloroplast genome of Oryza sativa ssp. indica
and japonica
- Almost the same genome size
- indica 134559 (2001 data)
- japonica 134525 (1989 data, CHOSXX, X15901)
- Elizabeth. C. Kemmerer and Ray Wu(2001)
- very few differences between the sequences of 11
chloroplast genes from indica and japonica,
including 2 tRNA genes. - The coding region and flanking region up to 100
bp are highly conserved. - More difference in intron region than coding
region.
88Chloroplast tRNA genes in ssp. indica and japonica
- 33 tRNA genes found in indica and japonica genome
respectively. - They are completely identical, no mutation is
found (E. C. Kemmerer and Ray Wu found two tRNA
genes perfectly conserved). - It is remarkable that in spite of more than 7000
years of separation no mutation could be observed
in the chloroplast tRNA genes in the two ssp.
89References
- 1. J. Yu et al., A draft sequence of the rice
genome (oryza sativa l. ssp. indica). Science
296, 79(2002). - 2. S. A. Goff et al., A draft sequence of the
rice genome (oryza sativa l. ssp japonica).
Science 296, 92(2002). - 3. F. Crick, Codon-anticodon pairings the
wobble hypothesis. J. Mol. Biol. 19
548-555(1966). - 4. Guthrie, C. and Abelson, J. Organization and
expression of tRNA genes in Saccharomyces
cerevisiae. In The Molecular Biology of the
Yeast Saccharomyces Metabolism and Gene
Expression (ed. J. Strathern et al. ), Cold
Spring Harbor Laboratory, Cold Spring Harbor, New
York, pp. 487-528. (1982). - 5. International human genome sequencing
consortium, Nature 409, 801(2001). - 6. http//rna.wustl.edu/GtRDB/ S. Eddy et al..
90Transcriptional or functional efficiency of tRNA
genes
Codon frequency tRNA gene number Codon frequency per tRNA gene
Codons ending with U and C 5332 277 19.25
Codons ending with A 1678 132 12.67
Codons ending with G 2978 183 16.26
91tRNA gene sub-species studied
92We have to point out
- The collection of tRNA genes we obtained may be
redundant - Novel tRNA genes may be found
- Experimental work is needed to prove whether they
are genuine tRNA genes or not - As scientists sequencing human genome pointed
out the work looking for novel ncRNA genes would
still be challenging even the complete finished
sequence of the genome were available
93GC-Gradient Effect in Rice
- Jun Wang, Gane Wong, et al.
- Genome Research (2002)
94Fly
95- Huimin Xie ???
- Grammatical Complexity and
- 1D dynamical Systems
- Vol.6 in Directions in
Chaos - WSPC, 1996.
- ??? ????????
- ?????????, 1994
- J.Hopcroft, J.Ullman,
- Introduction to Automata Theory, Languages and
- Computation,
- Addison-Wesley, 1979.
96 THE ENDTHANK YOU!