Title: Finding Genes in the Rice Genome
1Finding Genes in theRice Genome
- Hao Bailin
- T-Life Research Center, Fudan University
- Beijing Genomics Institute , Academia Sinica
- Institute of Theoretical Physics, Academia Sinica
- (www.itp.ac.cn/hao/)
- On-going work by a team of 10-12 people since
August 2001 Zheng Weimou, Xie Huimin, Liu
Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin
Jiao, et al. Nothing written yet.
2Two Cultivars of Rice
- Oryza sativa ssp. indica (??)
- Oryza sativa ssp. Japonica (??)
- The difference was described in Xu Shens
- (??????) Chinese Dictionary of
- East Han Dynasty ( 2nd Century AD)
- J.H. Zhang et al. Rice cultivation of Jianhu
Remains in Henan Province, Science J.
(????),53(4),2002, 3 (in Chinese)
3cccaatatcttgcttcagcaagatattgggtatttctagctttcctttct
tcaaaaattgctatatgttagcagaaaagccttatccattaagagatgga
acttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtg
cattacttccataccaagattagcacggttgatgatatcagcccaagtat
taataacgcgaccttggctatcaactacagattggttgaaattgaatccg
tttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccc
tactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaa
aactagcatattggaagattaatcggccaaaataaccatgagcggccaca
atattataagtttcttcctcttgaccaaatctgtaaccctcattagcaga
ttcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccat
gcatagcactgaatagggaaccgccgaatacaccagctacacctaacatg
tgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaa
gttgaaagtaccagatattcctaaaggcataccatcagagaaacttcctt
gaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagct
gaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttc
ccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaa
ttagctcataaggaccaccattgtataaccactcatcaacagatgcagct
tcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataat
ggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggct
cacgaataccatcaatatctactggaggggcagcgatgaaggcgataata
aatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaacca
tccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgac
cccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaaga
tcttggtttattcaaattgcaaggactcccaagcacacgtattaactaga
aagataatagaaggcttgttatttaacagtataatatagactatatacca
atgtcaaccaagccagccccgacagttgtatatccatacaacaaaattta
ccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcaga
ttgctcctttctagtttccatatgggttgcccgggactcgaacccggaac
tagtcggatggagtagataattattccttgttacaatagagaaaaaacct
ctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgta
gaaataggctatttctattccgaagaggaagtctactaatttttttagta
gtaagttgattcacttactatttattatagtacagagaacatttcagaat
ggaaactgtgaaagttttaccttgatcatttatcaatcatttctagttta
ttagttttgtttaatgattaattaagaggattcaccagatcattgatacg
gagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaa
agtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgta
aaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtac
cgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagta
tatactttagtcgatacaaagtcttcttttttgaagatccactgtgataa
tgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatc
cgataaatcggtccaaattggtttactaataggatgccccgatccagtac
aaaattgggcttttgctaaagatccaatgagaggagtaacagggactttg
gtatcgaattttttcatttgagtatctattagaaatgaattctccagcat
ttgattccttactaacaaagaatttattggtacacttgaaaagtacccca
gaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggt
tgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacattt
ccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttc
cttgatatcgaacataatgcataaggggatccataacgaaccatatggtt
ttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctaga
aaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaag
aagattgtttacgaagaaacaacaagaaaaattcatattctgatacataa
gagttatataggaaccgaaatagtcttttattttcttttttcaaaataaa
aatggatttcattgaagtaataaaactattccaattcgagtagtagttga
gaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattga
aggagttgaagcaagatatccaaatggataggatagggtatttctatatg
tgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattg
aatgaatagatcgtaaattctgaaactttggtatttctttttcttccgga
caagactgttctcgtagcgagaatgggatttctacaacgatcgcaaaccc
ctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaat
ccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattc
tgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaattt
cttgttattagaaccaataatttcgacaagttcggaaccatttaatccat
aatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacg
aaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttc
catttgtatttctacttgaatcagagagagagaaatatttctcggtttat
caaatggtgatacatagtacaatatggtcagaacagggtgttgcattttt
taatacaaacccctggggaagaaaaggagtctaatccacggatctttttc
cgctccttttctatccaatttgtttatgtttgttctaattacaaaagaga
acaaatcctttatttttgcaggccaattgctcttttgactttgggataca
gtctctttatcaatatactgcttcttttacacattcaatccataacatcc
ttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaa
aatcaaaggtctactcataggaaaaccagcttttccctacatcaggcact
aatctatttttaacgtctaattagatcagggagttcttccaattaagaag
ttaagctcgttgctttttgttttaccagaattggagccaggctctatcca
tttattcattagacccagaaaatcagaatttttttattccattccaaaaa
tccaaaataagaaattgattttattacgacatgctattttttccattcat
tacccttgaggatcagtcgcggtcttatagactctaccaagagtctggac
gaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaa
gccgagtactctaccattgagttagcaacccagataaactaggatcttag
atacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtca
aaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaa
acactcagataccaaaaggaacgggtctggttaaatttcactaaggttaa
aagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttat
ttaaataaataaataaatcttgtatgagagtacaaacaagagggacaacc
ctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggata
aagagacttatccatctacaaattctagatgttcaatggacctttgtcaa
tggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaata
aaggcttatgttggattggcacgacataaatccagtcaaaaataggatta
agaaagaggcaaattatttctaaatagttagacaacaagggatactagtg
agcctctcctagttttttattcatttagttcttcaattaactcaaagttc
tttctttttctttaaagaattccgccttccttaaaatatcagaaacggtt
cttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatt
taaacaagtttgattctttatcggatcataaaaacctacttttcgaagat
ctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagaca
gcttattgggatagatgtagataaataaagccccccctagaaacgtatag
gaggttttctcctcatacggctcgagaatatgacttgcattaatttccgt
acagaaaaaacaaatttcatttatactcatgactcaagttgactaatttt
gattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtct
ctaaactcttttctttgcctcatctcgaacaaattcacttttattcctta
ttccggtccaattctattgttgagacagttgaaaatcgtgtttacttgtt
cgggaatcctttatctttgatttgtgaaatccttgggtttaaacattact
tcgggaattcttattcttttttctttcaaaagagtagcaacatacccttt
tttcttatttccttcgataaagcatttccctcttctatagaaatcgaata
tgagcgattgattctgatagactttaatcaaaagagttttcccatatctt
ccaaaattggactttcttcttattttaaccttttgatttctatattattt
cgatttctatattaagggtagaatgacaaagttggcctaatttattagtt
ttcactaaccctagattctttcccttgataaaaaataaattctgtcctct
cgagctccatcgtgtactatttacttagcttacttacaaacaacccagcg
aaaattcggttcgggacgaatagaacagactatgtcgagccaagagcatt
ttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtc
cttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaa
cataacattcctctaatttcattgcaaagtgttatagggaattgatccaa
tatggatggaatcatgaatagtcattagtttcgttttttgtatactaatt
caaacttgctttgctatctatggagaaatatgaataaaagaaattaagta
tttatcgggaaagactccgcaaagagccaatttatttaaacccatattct
atcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgctt
aagacttatttattatggaatttccatcctcaacagaggactcgagatga
tcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaa
taaactatcaacctcccgtttaattaatttaattaatatattagattagc
aatctatttttccataccatttttccgtaacaaaactaattaactattaa
ctagttaaactattgcaatgaaaagaaagttttttggtagttatagaatt
ctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaa
aaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttatt
ttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattct
atctaacgagcagttcttatcttatctttaccgggatggatcattctgga
tatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaa
aaagaagaaggaaccttttttactaataaaatactataaaaaaaatttat
ctctatcataaatctatctctaccataaaggaataggtctcgttttttat
acaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttca
atttgactggacttgacactggattatgttttctgagacagaaaatgaac
gcattaggactgcatcgaatctaagagtttataagagaaaaaaattctct
ttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgt
ttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacggg
accaaaacccgctgccttaccacttggccacgccccatttcgggttttat
gcgacactaataaacagtattatgtttatttcttattcgtcaatcctact
tcaattacataaaaatggggggtattctcttggtaggattctagacatgc
gaataatatagaatccaaaaaatgcattgatcattacatggaattctatt
aagatattatatgaaagtcgaatttcttccactctcatttgagagtgcga
atacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggatttt
gaatcctccttttcctttttcccttagaaaaataactcaatcaaaatcca
attatctactctacaagaacgaaacgcttgttatgcctaatatacttagt
ttaacctgtatttgttttaattctgttatttatccgactagttttttctt
cgccaaattgcccgaagcttatgccattttcaatccaatcgtggatttta
tgcctgtcatacctgtactcttttttctattagcctttgtttggcaagct
gctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatc
atgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtct
tatattatgaaccttcgattctaaaattcaaattcttctacattgaatgt
atagctgcagcaataaatttggatcagcctttctactccctgcatctacg
ttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattga
taagagtgcttattataaatcaattcttgcaatttttttcaaaaattgat
ttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttg
tgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggaga
ttatgtaatgcttactctcaaactttttgtttatacagtagtgatattct
ttgtttccctctttatctttggattcttatctaatgatccaggacgtaat
cctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggattt
gtttcatacatttatctacgagaaaatccgggggtcagaattccttccaa
ttcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaacc
ctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtcc
actcagccatctctccccgttccaaatcgaaaggtttccgtgatatgaca
gaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagt
tcaaaaaaattatattgccaattccattttagttatattcttttttctta
atgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaa
aattggatattggctaaaagacaatcagatagattttctcttcagcaggc
atttccatataggacttgttataataaaacaagcaggttatagaaaaaaa
ctcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaacc
aacccaccccataaaattggaaagaaagataaagtaagtggacctgactc
cttgaatgaggcctctatccgctattctgatatataaattcgatgtagat
gaaattgtataagtggatttttttgtatttccttagacttagaccacgca
aggcaagaatttctcgctatttactatttcatattcttgttactagatgt
tctataggaataagaagaaatcgcaacccctttccgctacacataaaaat
ggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaa
tcctatttttgttcttatacccatgcaatagagagcgagtgggaaaaggg
aggttactttttttcattttttccttaaaaaataggctttcttggaaata
ggaatcatggaataatctgaattccaatgtttatttctatagtataagaa
aaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccc
catagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggca
ttgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagt
gatataaggtgctcggaaatggttgaagtaattgaataggaggatcacta
tgactatagcccttggtagagttactaaagaagaaaatgatttatttgat
attatggacgactggttacgaagggaccgttttgtttttgtaggatggtc
tggcctattgctttttccttgtgcttatttcgctttaggaggttggttta
cagggacaacttttgtaacttcttggtatacccatggattggcgagttcc
tatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaa
tagtttagcacactctttgttgctactatggggcccggaagcacaagggg
attttactcgttggtgtcaattaggtggtctgtggacttttgttgctctc
catggggcttttgcactaataggtttcatgttacgtcaatttgaacttgc
tcggtctgttcaattgcggccttataatgcaatttcattctctggcccaa
tcgctgtttttgtttccgtattcctgatttatccactggggcaatccggt
tggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcct
cttcttccaaggatttcataattggacgttgaacccatttcatatgatgg
gagttgccggagtattaggcgcggctctgctatgcgctattcatggggca
accgtgga
4Gene-Finding by Computer
- Starting from early 1980s
- Ab initio or de novo algorithms GeneMark,
GenScan, FgeneSH, Genie, based on gene-structure
models and training data. (Our on-going project
BGF, the BGI Gene Finder) - Homolog methods based on sequence alignment with
known genes in databases - Mixed approach using both strategy TwinScan
5Different Stages of Gene-Finding
- Use all possible existing programs and services
on the web with a public-domain or home-made
genome viewer - Write your own gene-finder, trained for the
specific organism - A dream for the time being design a
self-training and self-developing program for
any species which would improve itself
iteratively starting from a few available reads,
cDNAs, and ESTs
6Performance of Gene-Finders in Eukaryote Genomes
- M. Q. Zhang, Nature Review Genetics, 3 (2002)
698-710 (mostly for the human genome) - Nucleotide level 80
- Exon level 45
- Whole gene structure 20
- FgeneSH and BGF for rice (our tests on 128
cDNA-confirmed single-gene genomic sequences) - Nucleotide level 90
- Exon level 60
- Whole gene structure 40
-
75
3
5
3
- Each strand carries the same amount of
information, but different sets of genes. - Two strands are equivalent in information
content. - Two strands are not equivalent in gene content.
- Biological processing (duplication,
transcription) goes from 5 to 3. - Finding genes on one strand at a time or on two
strands at the same time one-pass or two-pass
programs.
8start
stop
5
Genomic DNA
3
transcribe
RNA Pol II
Pre-mRNA
splicesome u1u2u4u5u6RNP
splice
mRNA
5-UTR
3-UTR
translate
ribsome init. elong.
factors term. chaperonine
AA seq ( protein primary seq )
fold
Protein fold
9Three Scales of Search
- Local signals with minimal signature (start,
stop, splicing) movable signals (caps,
promoters, polyAs, branching points, some very
weak) --- clustering, discrimination analysis,
various statistical models - Intermediate exons, introns, intergenic ---
Markov, semi-Markov, Hidden-Markov models
intron length distribution - Global optimal combination of the above ---
dynamic programming -
10()?(.)(.)(.)?()
Transcription
Translation
Translation
Transcription
start
start
end
end
- Signals
- transcription start (downstream of
promoters) - transcription end (upstream of poly-A)
- ? translation start (ctg, 1/64 in a random
seq.) - ? translation end (tag, tga, taa, 3/64)
- ( splicing donor site (minimal signalgt,
1/16) - ) splicing accepter site (ag, 1/16)
- branching point (very weak a)
11()?(.)(.)(.)?()
Transcription
Translation
Translation
Transcription
start
start
end
end
- ?( First exon
- )( Internal exon
- )? Last exon
- ( Non-coding 5 exon
- )? Non-coding 5 exon
- (.) Intron
- ?( Non-coding 3 exon (rare)
- ) Non-coding 3 exon (rare)
- Intergenic region
12Signal and Sequence Models
- eiid equal probability independently and
identically distributed - niid non-equal probability independently and
identically distributed - WWM Windowed weight matrix, etc.
- MMn Markov chain model of order n homogeneous
and period-3 MM5 are used in many gene-finders - Consensus sequence
13Consensus Sequences
- TATAAT ( Pribnov or -10 box )
- T80A95T45A60A50T96
- TTGACA ( -35 box )
- T82T84G78A65C54A45
- CAAT ( CAAT or 75 box )
- GGYCAATCT
- TATA ( TATA or Goldberger-Hogness box )
- TATAWAW
- ATG ( Transcription start point )
- However, in Aful ATG 76 GTG 22 TTG
2
14(No Transcript)
15GT-AG Rule for Intron
- 5 splicing
- donor site
- exon A64G73 G100T100A62A68G84T63
- 12PyNC65A100G100 Nexon
-
3 splicing -
acceptor site
16(No Transcript)
17Exon and intron size distribution
18Algorithms
- Sequence models and scores for signals
- Dynamic programming optimal parse
- Hidden Markov Model geometric distribution of
intron lengths - Semi-Hidden Markov Model needs
sequence-generating models and length probability
for each node - Language theory approach
19Flow Chart of GenScan
- Chris Burge (1996) A 27-state semi-HMM
- A simpler model 19-state
- A model taking UTR introns into account
35-state
20- FigureN, intergenic
- region P,promotor F,
- 5UTR , single-
- exon gene , initial
- exon phase
- k internal exon ,ter
- -minal exon T, 3UTR
- A,polyadenylation signal
- and, , phase k
- intron. ) strand.
21Problems Minor and Major
- Ambiguity symbols (N, W, S, R, )
- (1-p) at flanking D-type nodes
- Indels and frame-shifts
- Gradient effects in gene structure
- Introns in 5-UTRs and 3-UTRs leading to
35-state Markov Models - Alternative splicing and sub-optimal paths
- Limit of probabilistic models
- Deterministic approaches
22Dyck language A language of nested parentheses
- Many types of parentheses
- Finite depth of nesting
- Context-free language
- Our case
- Only 3 types of parentheses
- Shallow nesting
- Conjecture may be regular language
23Two Test Datasets for RiceGene-Finders
- The 28469 japonica full-length cDNAs (Kikuchi et
al., Science 301 (18 July 2003) - Select a high-quality subset without overlaps
with publically available cDNAs - A single-gene set 500 sequences with one gene in
each - A multi-gene set 46 sequences with 199 genes in
total (at least 4 genes in a sequence)
24Assessment of Gene-Finders
- Test done between 22 July and 2 August 2003
- FgeneSH (trained on monocotyledons)
- GeneMark.hmm
- RiceHMM
- GlimmerR
- GenScan (trained on maize)
- BGF
25Our Ultimate Goal
- An iterative, self-training, self-improving
gene-finder for any species, starting from a
small number of reads with or without EST, cDNA
supports - Annotaion and re-annotation of the rice genomes
- Plant comparative genomics, especially, that of
Gramene and Crucifers
26tRNA features
- tRNA gene ? pre-tRNA ? mature tRNA
- Mature tRNA 75 95 bases
- Cloverleaf like structure
- Five arms acceptor arm, D arm, anticodon arm, V
loop (extra arm), T C arm
27How many tRNA genes are present in an
organism?
- Codon ?? tRNA ?? amino acid
- 61 encoding codons
- 20 amino acids
- Are there 61 species of tRNA with all possible
anticodons ? - Met (M) has one codon but two tRNAs
28 Wobble hypothesis Crick, 1966
- Many tRNAs recognize more than one codon
- Through non-Watson-Crick base pairings
- Less than 61 tRNAs are needed
29The Modified Wobble Hypothesis(Guthrie Abelson
1982)
- In eukaryotes, 46 different tRNA species would be
enough. - The modified wobble hypothesis is almost
perfectly hold in H. sapiens, S. cerevisiae, A.
thaliana, C.elegans whose complete collection of
tRNAs are now known.
30tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S
W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
31tRNA Genes in the Rice Genome(Found by
tRNAScan-SE BLASTN)
32Chloroplast tRNA genes in ssp. indica and japonica
- 33 tRNA genes found in indica and japonica genome
respectively. - They are completely identical, no mutation is
found (E. C. Kemmerer and Ray Wu found two tRNA
genes perfectly conserved). - It is remarkable that in spite of more than 9000
years of separation no mutation could be observed
in the chloroplast tRNA genes in the two ssp.