Predicting Genes in Eukaryotic Genomes By Computer - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Predicting Genes in Eukaryotic Genomes By Computer

Description:

... Nature Review Genetics, 3 (2002) 698-710 (mostly ... (Sus scrofa) (Fugu rubripes) (Bambyx mori) (Apsis mellifera) ... – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 97
Provided by: AAA76
Category:

less

Transcript and Presenter's Notes

Title: Predicting Genes in Eukaryotic Genomes By Computer


1
Predicting Genes inEukaryotic GenomesBy Computer
  • Hao Bailin (???)
  • T-Life Research Center, Fudan University
  • Beijing Genomics Institute , Academia Sinica
  • Institute of Theoretical Physics, Academia Sinica
  • (www.itp.ac.cn/hao/)

2
The Central Dogma of Molecular Biology
  • replication
  • DNA DNA
  • reverse transcription
    transcription
  • cDNA mRNA

  • translation

  • Protein/Enzyme

  • folding
  • Function Structure
  • interaction

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
DNA(??????)??
  • ?4???(??????a, c, g, t )??
  • ?????????????????
  • ??23????????24????19????12?????300?
  • ???????????????????,????,??????????,??

7
Large-Scale DNA SequencingSince 1977
  • Sanger method polymerization stopping
  • Maxam-Gilbert chemical degradation
  • Each reaction 500-600 bp (a single read)
  • Clone by clone vs. whole-genome shotgun
  • Sequence assembling reads contigs scaffolds
    superscaffolds
  • Automatic sequencer MegaBace, 96 or 384 channels

8
Letter production at BGI (Beijing Hangzhou)
Daily 5 x107 Yearly 1010
9
????????????
  • ???? (Saccharomyces cerevisiae)
  • ???? (Schizosacchromyces pombe)
  • ???? (Caenorhabitatis elegans)
  • ?? (Drosophila melanogaster)
  • ?????? (Plasmodium falciparum)
  • ????? (Anopheles gambiae)
  • ?? (Homo sapiens)???? (Pan trogodytes)
  • ?? (Mus musculus)??? (Rattus norvegicus)
  • ?? (Canis familiaris)??? (Gallus gallus)???(Sus
    scrofa)
  • ??? (Fugu rubripes)
  • ?? (Bambyx mori)??? (Apsis mellifera)
  • ???(Arabidopsis thaliana)???(Oryza sativa)
  • ?? (Zea mays)

10
cccaatatcttgcttcagcaagatattgggtatttctagctttcctttct
tcaaaaattgctatatgttagcagaaaagccttatccattaagagatgga
acttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtg
cattacttccataccaagattagcacggttgatgatatcagcccaagtat
taataacgcgaccttggctatcaactacagattggttgaaattgaatccg
tttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccc
tactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaa
aactagcatattggaagattaatcggccaaaataaccatgagcggccaca
atattataagtttcttcctcttgaccaaatctgtaaccctcattagcaga
ttcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccat
gcatagcactgaatagggaaccgccgaatacaccagctacacctaacatg
tgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaa
gttgaaagtaccagatattcctaaaggcataccatcagagaaacttcctt
gaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagct
gaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttc
ccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaa
ttagctcataaggaccaccattgtataaccactcatcaacagatgcagct
tcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataat
ggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggct
cacgaataccatcaatatctactggaggggcagcgatgaaggcgataata
aatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaacca
tccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgac
cccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaaga
tcttggtttattcaaattgcaaggactcccaagcacacgtattaactaga
aagataatagaaggcttgttatttaacagtataatatagactatatacca
atgtcaaccaagccagccccgacagttgtatatccatacaacaaaattta
ccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcaga
ttgctcctttctagtttccatatgggttgcccgggactcgaacccggaac
tagtcggatggagtagataattattccttgttacaatagagaaaaaacct
ctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgta
gaaataggctatttctattccgaagaggaagtctactaatttttttagta
gtaagttgattcacttactatttattatagtacagagaacatttcagaat
ggaaactgtgaaagttttaccttgatcatttatcaatcatttctagttta
ttagttttgtttaatgattaattaagaggattcaccagatcattgatacg
gagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaa
agtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgta
aaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtac
cgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagta
tatactttagtcgatacaaagtcttcttttttgaagatccactgtgataa
tgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatc
cgataaatcggtccaaattggtttactaataggatgccccgatccagtac
aaaattgggcttttgctaaagatccaatgagaggagtaacagggactttg
gtatcgaattttttcatttgagtatctattagaaatgaattctccagcat
ttgattccttactaacaaagaatttattggtacacttgaaaagtacccca
gaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggt
tgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacattt
ccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttc
cttgatatcgaacataatgcataaggggatccataacgaaccatatggtt
ttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctaga
aaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaag
aagattgtttacgaagaaacaacaagaaaaattcatattctgatacataa
gagttatataggaaccgaaatagtcttttattttcttttttcaaaataaa
aatggatttcattgaagtaataaaactattccaattcgagtagtagttga
gaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattga
aggagttgaagcaagatatccaaatggataggatagggtatttctatatg
tgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattg
aatgaatagatcgtaaattctgaaactttggtatttctttttcttccgga
caagactgttctcgtagcgagaatgggatttctacaacgatcgcaaaccc
ctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaat
ccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattc
tgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaattt
cttgttattagaaccaataatttcgacaagttcggaaccatttaatccat
aatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacg
aaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttc
catttgtatttctacttgaatcagagagagagaaatatttctcggtttat
caaatggtgatacatagtacaatatggtcagaacagggtgttgcattttt
taatacaaacccctggggaagaaaaggagtctaatccacggatctttttc
cgctccttttctatccaatttgtttatgtttgttctaattacaaaagaga
acaaatcctttatttttgcaggccaattgctcttttgactttgggataca
gtctctttatcaatatactgcttcttttacacattcaatccataacatcc
ttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaa
aatcaaaggtctactcataggaaaaccagcttttccctacatcaggcact
aatctatttttaacgtctaattagatcagggagttcttccaattaagaag
ttaagctcgttgctttttgttttaccagaattggagccaggctctatcca
tttattcattagacccagaaaatcagaatttttttattccattccaaaaa
tccaaaataagaaattgattttattacgacatgctattttttccattcat
tacccttgaggatcagtcgcggtcttatagactctaccaagagtctggac
gaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaa
gccgagtactctaccattgagttagcaacccagataaactaggatcttag
atacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtca
aaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaa
acactcagataccaaaaggaacgggtctggttaaatttcactaaggttaa
aagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttat
ttaaataaataaataaatcttgtatgagagtacaaacaagagggacaacc
ctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggata
aagagacttatccatctacaaattctagatgttcaatggacctttgtcaa
tggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaata
aaggcttatgttggattggcacgacataaatccagtcaaaaataggatta
agaaagaggcaaattatttctaaatagttagacaacaagggatactagtg
agcctctcctagttttttattcatttagttcttcaattaactcaaagttc
tttctttttctttaaagaattccgccttccttaaaatatcagaaacggtt
cttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatt
taaacaagtttgattctttatcggatcataaaaacctacttttcgaagat
ctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagaca
gcttattgggatagatgtagataaataaagccccccctagaaacgtatag
gaggttttctcctcatacggctcgagaatatgacttgcattaatttccgt
acagaaaaaacaaatttcatttatactcatgactcaagttgactaatttt
gattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtct
ctaaactcttttctttgcctcatctcgaacaaattcacttttattcctta
ttccggtccaattctattgttgagacagttgaaaatcgtgtttacttgtt
cgggaatcctttatctttgatttgtgaaatccttgggtttaaacattact
tcgggaattcttattcttttttctttcaaaagagtagcaacatacccttt
tttcttatttccttcgataaagcatttccctcttctatagaaatcgaata
tgagcgattgattctgatagactttaatcaaaagagttttcccatatctt
ccaaaattggactttcttcttattttaaccttttgatttctatattattt
cgatttctatattaagggtagaatgacaaagttggcctaatttattagtt
ttcactaaccctagattctttcccttgataaaaaataaattctgtcctct
cgagctccatcgtgtactatttacttagcttacttacaaacaacccagcg
aaaattcggttcgggacgaatagaacagactatgtcgagccaagagcatt
ttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtc
cttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaa
cataacattcctctaatttcattgcaaagtgttatagggaattgatccaa
tatggatggaatcatgaatagtcattagtttcgttttttgtatactaatt
caaacttgctttgctatctatggagaaatatgaataaaagaaattaagta
tttatcgggaaagactccgcaaagagccaatttatttaaacccatattct
atcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgctt
aagacttatttattatggaatttccatcctcaacagaggactcgagatga
tcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaa
taaactatcaacctcccgtttaattaatttaattaatatattagattagc
aatctatttttccataccatttttccgtaacaaaactaattaactattaa
ctagttaaactattgcaatgaaaagaaagttttttggtagttatagaatt
ctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaa
aaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttatt
ttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattct
atctaacgagcagttcttatcttatctttaccgggatggatcattctgga
tatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaa
aaagaagaaggaaccttttttactaataaaatactataaaaaaaatttat
ctctatcataaatctatctctaccataaaggaataggtctcgttttttat
acaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttca
atttgactggacttgacactggattatgttttctgagacagaaaatgaac
gcattaggactgcatcgaatctaagagtttataagagaaaaaaattctct
ttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgt
ttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacggg
accaaaacccgctgccttaccacttggccacgccccatttcgggttttat
gcgacactaataaacagtattatgtttatttcttattcgtcaatcctact
tcaattacataaaaatggggggtattctcttggtaggattctagacatgc
gaataatatagaatccaaaaaatgcattgatcattacatggaattctatt
aagatattatatgaaagtcgaatttcttccactctcatttgagagtgcga
atacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggatttt
gaatcctccttttcctttttcccttagaaaaataactcaatcaaaatcca
attatctactctacaagaacgaaacgcttgttatgcctaatatacttagt
ttaacctgtatttgttttaattctgttatttatccgactagttttttctt
cgccaaattgcccgaagcttatgccattttcaatccaatcgtggatttta
tgcctgtcatacctgtactcttttttctattagcctttgtttggcaagct
gctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatc
atgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtct
tatattatgaaccttcgattctaaaattcaaattcttctacattgaatgt
atagctgcagcaataaatttggatcagcctttctactccctgcatctacg
ttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattga
taagagtgcttattataaatcaattcttgcaatttttttcaaaaattgat
ttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttg
tgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggaga
ttatgtaatgcttactctcaaactttttgtttatacagtagtgatattct
ttgtttccctctttatctttggattcttatctaatgatccaggacgtaat
cctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggattt
gtttcatacatttatctacgagaaaatccgggggtcagaattccttccaa
ttcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaacc
ctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtcc
actcagccatctctccccgttccaaatcgaaaggtttccgtgatatgaca
gaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagt
tcaaaaaaattatattgccaattccattttagttatattcttttttctta
atgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaa
aattggatattggctaaaagacaatcagatagattttctcttcagcaggc
atttccatataggacttgttataataaaacaagcaggttatagaaaaaaa
ctcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaacc
aacccaccccataaaattggaaagaaagataaagtaagtggacctgactc
cttgaatgaggcctctatccgctattctgatatataaattcgatgtagat
gaaattgtataagtggatttttttgtatttccttagacttagaccacgca
aggcaagaatttctcgctatttactatttcatattcttgttactagatgt
tctataggaataagaagaaatcgcaacccctttccgctacacataaaaat
ggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaa
tcctatttttgttcttatacccatgcaatagagagcgagtgggaaaaggg
aggttactttttttcattttttccttaaaaaataggctttcttggaaata
ggaatcatggaataatctgaattccaatgtttatttctatagtataagaa
aaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccc
catagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggca
ttgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagt
gatataaggtgctcggaaatggttgaagtaattgaataggaggatcacta
tgactatagcccttggtagagttactaaagaagaaaatgatttatttgat
attatggacgactggttacgaagggaccgttttgtttttgtaggatggtc
tggcctattgctttttccttgtgcttatttcgctttaggaggttggttta
cagggacaacttttgtaacttcttggtatacccatggattggcgagttcc
tatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaa
tagtttagcacactctttgttgctactatggggcccggaagcacaagggg
attttactcgttggtgtcaattaggtggtctgtggacttttgttgctctc
catggggcttttgcactaataggtttcatgttacgtcaatttgaacttgc
tcggtctgttcaattgcggccttataatgcaatttcattctctggcccaa
tcgctgtttttgtttccgtattcctgatttatccactggggcaatccggt
tggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcct
cttcttccaaggatttcataattggacgttgaacccatttcatatgatgg
gagttgccggagtattaggcgcggctctgctatgcgctattcatggggca
accgtgga
11
(No Transcript)
12
?????20???(???AA) ??50 6000 AA ?????????
ID A1BG_HUMAN STANDARD PRT 495
AA. ... ... ... KW Immunoglobulin domain
Glycoprotein Plasma Repeat Signal. ... ...
...SQ SEQUENCE 495 AA 54209 MW
87A50C21CE89459C CRC64 MSMLVVFLLL
WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ
ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR
YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT
PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA
TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP
VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL
LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW
SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE
GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA
NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS
GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG
AAANLELIFV GPQHAGNYRC RYRSWVPHTF
ESELSDPVELLVAES
// 
13
Gene-Finding by Computer
  • Starting from early 1980s
  • Ab initio or de novo algorithms GeneMark,
    GenScan, FgeneSH, Genie, based on gene-structure
    models and training data. (Our on-going project
    BGF, the BGI Gene Finder)
  • Homolog methods based on sequence alignment with
    known genes in databases and comparative genomics
    of not-too-distant species
  • Mixed approach using both strategy TwinScan

14
Different Stages of Gene-Finding
  • Use all possible existing programs and services
    on the web with a public-domain or home-made
    genome viewer
  • Write your own gene-finder, trained for the
    specific organism
  • A dream for the time being design a
    self-training and self-developing program for
    any species which would improve itself
    iteratively starting from a few available reads,
    cDNAs, and ESTs

15
Performance of Gene-Finders in Eukaryote Genomes
  • M. Q. Zhang, Nature Review Genetics, 3 (2002)
    698-710 (mostly for the human genome)
  • Nucleotide level 80
  • Exon level 45
  • Whole gene structure 20
  • FgeneSH and BGF for rice (our tests on 128
    cDNA-confirmed single-gene genomic sequences)
  • Nucleotide level 90
  • Exon level 60
  • Whole gene structure 40

16
5
3
5
3
  • Each strand carries the same amount of
    information, but different sets of genes.
  • Two strands are equivalent in information
    content.
  • Two strands are not equivalent in gene content.
  • Biological processing (duplication,
    transcription) goes from 5 to 3.
  • Finding genes on one strand at a time or on two
    strands at the same time one-pass or two-pass
    programs.

17
start
stop
5
Genomic DNA
3
transcribe
RNA Pol II
Pre-mRNA
splicesome u1u2u4u5u6RNP
splice
mRNA
5-UTR
3-UTR
translate
ribsome init. elong.
factors term. chaperonine
AA seq ( protein primary seq )
fold
Protein fold
18
Three Scales of Search
  • Local signals with minimal signature (start,
    stop, splicing) movable signals (caps,
    promoters, polyAs, branching points, some very
    weak) --- clustering, discrimination analysis,
    various statistical models
  • Intermediate exons, introns, intergenic ---
    Markov, semi-Markov, Hidden-Markov models
    intron length distribution
  • Global optimal combination of the above ---
    dynamic programming

19
()?(.)(.)(.)?()
Transcription
Translation

Translation
Transcription
start
start

end
end
  • Signals
  • transcription start (downstream of
    promoters)
  • transcription end (upstream of poly-A)
  • ? translation start (ctg, 1/64 in a random
    seq.)
  • ? translation end (tag, tga, taa, 3/64)
  • ( splicing donor site (minimal signalgt,
    1/16)
  • ) splicing accepter site (ag, 1/16)
  • branching point (very weak a)

20
()?(.)(.)(.)?()
Transcription
Translation

Translation
Transcription
start
start

end
end
  • ?( First exon
  • )( Internal exon
  • )? Last exon
  • ( Non-coding 5 exon
  • )? Non-coding 5 exon
  • (.) Intron
  • ?( Non-coding 3 exon (rare)
  • ) Non-coding 3 exon (rare)
  • Intergenic region

21
Signal and Sequence Models
  • eiid equal probability independently and
    identically distributed
  • niid non-equal probability independently and
    identically distributed
  • WWM Windowed weight matrix, etc.
  • MMn Markov chain model of order n homogeneous
    and period-3 MM5 are used in many gene-finders
  • Consensus sequence

22
Consensus Sequences
  • TATAAT ( Pribnov or -10 box )
  • T80A95T45A60A50T96
  • TTGACA ( -35 box )
  • T82T84G78A65C54A45
  • CAAT ( CAAT or 75 box )
  • GGYCAATCT
  • TATA ( TATA or Goldberger-Hogness box )
  • TATAWAW
  • ATG ( Transcription start point )
  • However, in Aful ATG 76 GTG 22 TTG
    2

23
(No Transcript)
24
GT-AG Rule for Intron
  • 5 splicing
  • donor site
  • exon A64G73 G100T100A62A68G84T63
  • 12PyNC65A100G100 Nexon

  • 3 splicing

  • acceptor site

25
(No Transcript)
26
Exon and intron size distribution
27
Algorithms
  • Sequence models and scores for signals
  • Dynamic programming optimal parse
  • Hidden Markov Model geometric distribution of
    intron lengths
  • Semi-Hidden Markov Model needs
    sequence-generating models and length probability
    for each node
  • Language theory approach

28
Flow Chart of GenScan
  • Chris Burge (1996) A 27-state semi-HMM
  • A simpler model 19-state
  • A model taking UTR introns into account
    35-state

29
  • FigureN, intergenic
  • region P,promotor F,
  • 5UTR , single-
  • exon gene , initial
  • exon phase
  • k internal exon ,ter
  • -minal exon T, 3UTR
  • A,polyadenylation signal
  • and, , phase k
  • intron. ) strand.

30
Problems Minor and Major
  • Ambiguity symbols (N, W, S, R, )
  • (1-p) at flanking D-type nodes
  • Indels and frame-shifts
  • Gradient effects in gene structure
  • Introns in 5-UTRs and 3-UTRs leading to
    35-state Markov Models
  • Alternative splicing and sub-optimal paths
  • Limit of probabilistic models
  • Deterministic approaches

31
Dyck language A language of nested parentheses
  • Many types of parentheses
  • Finite depth of nesting
  • Context-free language
  • Our case
  • Only 3 types of parentheses
  • Shallow nesting
  • Conjecture may be regular language

32
Two Subspecies of Rice
  • Oryza sativa ssp. indica (??)
  • Oryza sativa ssp. Japonica (??)
  • The difference was described in Xu Shens
  • (??????) Chinese Dictionary of
  • East Han Dynasty ( 2nd Century AD)
  • J.H. Zhang et al. Rice cultivation of Jianhu
    Remains in Henan Province, Science J.
    (????),53(4),2002, 3 (in Chinese)

33
Two Test Datasets for RiceGene-Finders
  • The 28469 japonica full-length cDNAs (Kikuchi et
    al., Science 301 (18 July 2003)
  • Select a high-quality subset without overlaps
    with publically available cDNAs
  • A single-gene set 500 sequences with one gene in
    each
  • A multi-gene set 46 sequences with 199 genes in
    total (at least 4 genes in a sequence)

34
Assessment of Gene-Finders
  • Test done between 22 July and 2 August 2003
  • FgeneSH (trained on monocotyledons)
  • GeneMark.hmm
  • RiceHMM
  • GlimmerR
  • GenScan (trained on maize)
  • BGF(rise.genomics.org.cn/bgf/)

35
Our Ultimate Goal
  • An iterative, self-training, self-improving
    gene-finder for any species, starting from a
    small number of reads with or without EST, cDNA
    supports
  • Annotaion and re-annotation of the rice genomes
  • Plant comparative genomics, especially, that of
    Gramene and Crucifers

36
tRNA features
  • tRNA gene ? pre-tRNA ? mature tRNA
  • Mature tRNA 75 95 bases
  • Cloverleaf like structure
  • Five arms acceptor arm, D arm, anticodon arm, V
    loop (extra arm), T C arm

37
How many tRNA genes are present in an
organism?
  • Codon ?? tRNA ?? amino acid
  • 61 encoding codons
  • 20 amino acids
  • Are there 61 species of tRNA with all possible
    anticodons ?
  • Met (M) has one codon but two tRNAs

38
Wobble hypothesis Crick, 1966
  • Many tRNAs recognize more than one codon
  • Through non-Watson-Crick base pairings
  • Less than 61 tRNAs are needed

39
The Modified Wobble Hypothesis(Guthrie Abelson
1982)
  • In eukaryotes, 46 different tRNA species would be
    enough.
  • The modified wobble hypothesis is almost
    perfectly hold in H. sapiens, S. cerevisiae, A.
    thaliana, C.elegans whose complete collection of
    tRNAs are now known.

40
tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S



W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
41
tRNA Genes in the Rice Genome(Found by
tRNAScan-SE BLASTN)
42
Chloroplast tRNA genes in ssp. indica and japonica
  • 33 tRNA genes found in indica and japonica genome
    respectively.
  • They are completely identical, no mutation is
    found (E. C. Kemmerer and Ray Wu found two tRNA
    genes perfectly conserved).
  • It is remarkable that in spite of more than 9000
    years of separation no mutation could be observed
    in the chloroplast tRNA genes in the two ssp.

43
The End
  • Thank you!

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Some Informatics Work Related to the Rice (Oryza
sativa L. ssp. indica) Draft Genome
  • HAO Bailin (???)
  • Beijing Genomics Institute (BGI)
  • Institute of Theoretical Physics (ITP)
  • T-Life Research Center, Fudan University
  • http//www.itp.ac.cn/hao/

53
Informatics Problems
  • Collection and quality control of data
  • Assembling of reads, dealing with repeats
  • Gene-finding and annotation
  • RNA genes
  • Protein-coding genes
  • Prediction of structure and function
  • Connection to gene expression data

54
The Central Dogma of Molecular Biology
  • replication
  • DNA DNA
  • reverse transcription
    transcription
  • cDNA mRNA

  • translation

  • Protein/Enzyme

  • folding
  • Function Structure
  • interaction

55
Genetic Material
  • DNA linear or circular
  • Chromosome DNA histons
  • Mitochondria (???)
  • Chloroplast (???)
  • Plasmids linear or circular

56
Two Kinds of Tasks
  • Developing a new method of gene-finding a more
    or less academic job
  • Finding genes in a given genomic sequence a
    practical job

57
The transfer RNA Genes in Rice (Oryza sativa ssp
indica)collection of contigs
  • WANG Xiyin(???) SHI Xiaoli(???)
  • (Peking U and BGI)
  • HAO Bailin(???)
  • (BGI, Fudan University, and ITP)

58
tRNA function
  • tRNAs are the actual translator from mRNA to
    Amino Acids in protein.
  • Bridge between RNA world and protein world
  • Naming convention
  • trnQ-UUG

59
tRNA features
  • tRNA gene ? pre-tRNA ? mature tRNA
  • Mature tRNA 75 95 bases
  • Cloverleaf like structure
  • Five arms acceptor arm, D arm, anticodon arm, V
    loop (extra arm), T C arm

60
tRNA structure
mRNA
61
How many tRNA genes are present in an
organism?
  • Codon ?? tRNA ?? amino acid
  • 61 encoding codons
  • 20 amino acids
  • Are there 61 species of tRNA with all possible
    anticodons ?
  • Met (M) has one codon but two tRNAs

62
The Wobble Hypothesis
  • The Wobble Hypothesis (Ckrick 1968)
  • The Modified Wobble Hypothesis (1982) 46 tRNA
    species would be enough
  • What has been found in yeast, worm and human

63
Wobble hypothesis Crick, 1966
  • Many tRNAs recognize more than one codon
  • Through non-Watson-Crick base pairings
  • Less than 61 tRNAs are needed

64
Wobble rules by Crick
Codon(base 3) Anticodon(base 1)
U A,G,I
C G,I
A U,I
G C,U
  • A?I ISONINE

65
The Modified Wobble Hypothesis
  • In eukaryotes, 46 different tRNA species would be
    enough.
  • Revised wobble hypothesis is almost perfectly
    obeyed by H. sapiens, S. cerevisiae, A. thaliana,
    C.elegans whose complete collection of tRNAs are
    now known.

66
Revised wobble hypothesis in eukaryotes
Guthrie Abelson,1982
Two-codon boxes
Codon (base 3) Anticodon (base 1)
U G,I
C G,I
A U
G C
  • Four-codon boxes

67
Modified wobble hypothesis
in eukaryotes
  • In two codon boxes
  • In four codon boxes
  • One exceptional four codon box for Gly

Codon base 3 Anticodon base 1
UC G
Codon base 3 Anticodon base 1
UC A(I)
Codon base 3 Anticodon base 1
UC G
68
Human codon usage and tRNA genes

69
tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S



W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
70
Distribution of tRNA genesin a genome
  • A kind of repeats
  • Usually clustered together
  • Distributed unevenly among chromosomes
  • For example, in human genome, 140 tRNA genes,
    making up to 25 of the total, form a cluster in
    a narrow region of only 4 Mbp on chr. 6

71
BGI Rice Contigs
  • 127 550 contigs of total length 361Mb (from the
    estimated genome of 466Mb)
  • N50 size 6690bp
  • It makes sense to look for tRNAs, since their
    length is around 75-95bp and it is possible to
    catch most of them.

72
How many tRNA genes are there in rice genome ?
  • Is revised wobble hypothesis obeyed ?
  • Are there 46 species of tRNA genes ?
  • How many copies for each tRNA?

73
With the help of the tRNAscan-SE program and
BLASTN, a collection of tRNA genes was obtained
  • 592 canonical tRNA genes
  • 3 possible selenocysteine tRNA genes
  • 1 possible suppressor tRNA gene
  • 27 possible pseudo-tRNA-genes

74
592 canonical tRNA genes
  • BLASTN confirmed tRNA 467
  • Probable novel tRNA 74
  • Putative novel tRNA 51
  • Novel means more adapted to rice

75
27 pseudo-tRNA genes
  • Genomic sequences structurally related to tRNA
  • Unable to yield active gene products
  • May have insertions, deletions
  • May lack functional promoters
  • Experiments needed to test if they are really
    functionally inactive
  • Divided into four classes
  • End-truncated type
  • Insertion-disrupted
  • Non-maintained
  • Non-tRNA but pol III-like elements

76
Rice codons and tRNA genes

77
Wobble hypothesis is perfectly obeyed by rice
genome !
  • 45 species of tRNA genes
  • Found so far.

78
On the absence of trnT-CGU gene
  • In fact, six possible trnT-CGU genes were found
  • but discarded for low similarity to known tRNA
    genes.
  • The incompleteness of data only 361Mb in
    contigs.
  • The tendency of tRNA gens to cluster together in
    a genome.
  • Almost surely to be found (3 trnT-CGU genes were
    found in japonica).
  • Rice is not an exception to the wobble hypothesis.

79
36 tRNA genes have an intron
  • 6 of the total
  • All have only one intron
  • Intron length 12-20 bp generally
  • 38 bp the longest
  • All trnY-GUA genes have an intron
  • All non-initiator trnM-CAU genes have an intron,
    while all initiator initiator trnM-CAU genes have
    no intron

80
One possible suppressor tRNA gene
  • Suppressor tRNA is a mutant tRNA that recognizes
    a stop codon(UAA/UAG)
  • instead of the codon for the cognate amino acid.
    Sometimes, but not always , due to a base
    substitution in the anticodon.
  • Here, it recognizes the stop codon UAA.

81
3 possible selenocysteine tRNA genes
  • The 21st codon of amino acid found in every
    domain of life on Earth.
  • While there are many more amino acids than those
    twenty which are part of the standard genetic
    code, only selenocysteine and pyrrolysine have
    been discovered to be coded genetically.
  • In fact selenocysteine is encoded by the UGA
    codon the umber termination codon.
  • Different mechanisms are adopted in prokaryotes
    and eukaryotes to tell the translation machinery
    of the cells that it should continue or terminate
    the process of translation.
  • AIDs patients are found to contain several low
    molecular mass selenium compounds which are
    thought to be selenoprotein encoded by the HIV
    genome.

82
46 chloroplast and 10 mitochondrial tRNA genes
found
  • Due to sequencing contamination
  • Some chloroplast tRNA genes must be identical
    copies
  • There are about 33 tRNA genes in rice chloroplast
    genome as predicted by tRNAscan-SE

83
Codon Bias in Rice Genome
  • Codon bias exists in rice genome.
  • Codon bias in rice resembles that in human,
    however XCG-form codons are less used in human
    genome.
  • Codons ending with G or C is prefered to those
    ending with A or U respectively.


84
A roughly positive correlation between codon
usage and the corresponding tRNA gene number
85
tRNA genes are dispersed in the whole genome
  • Based on the public data of japonica, Chr.10,
    Chr.7, Chr.6, Chr.3 may have much more tRNA genes
    than the other chromosomes.
  • Many of them may form a few clusters.
  • In fact, many tRNA genes are repeats,
  • for example, 8 almost identical trnQ-UUG genes
    are found on a contig of indica.
  • There are many tRNA genes identical in sequence.
    They may be repeating copies of genes or may be
    caused by assembly error.

86
tRNA genes in eukaryotes
Species tRNA gene number Genome size (Mbp) tRNA gene per Mbp in genome CDs size (Mbp) tRNA gene per Mbp for CDs
S. cerevisiae 273 12 22.75 8.45 32
S. pombe 174 14 12.48 6.9 25
C. elegans 584 100 5.84 26.1 22
A. thaliana 620 125 4.96 33.5 18
D. melanogaster 284 180 1.58 24.1 12
O. sativa 596 464 1.48 -- --
H. sapiens 648 3400 0.19 58.5(?) 11(?)
87
Chloroplast genome of Oryza sativa ssp. indica
and japonica
  • Almost the same genome size
  • indica 134559 (2001 data)
  • japonica 134525 (1989 data, CHOSXX, X15901)
  • Elizabeth. C. Kemmerer and Ray Wu(2001)
  • very few differences between the sequences of 11
    chloroplast genes from indica and japonica,
    including 2 tRNA genes.
  • The coding region and flanking region up to 100
    bp are highly conserved.
  • More difference in intron region than coding
    region.

88
Chloroplast tRNA genes in ssp. indica and japonica
  • 33 tRNA genes found in indica and japonica genome
    respectively.
  • They are completely identical, no mutation is
    found (E. C. Kemmerer and Ray Wu found two tRNA
    genes perfectly conserved).
  • It is remarkable that in spite of more than 7000
    years of separation no mutation could be observed
    in the chloroplast tRNA genes in the two ssp.

89
References
  • 1. J. Yu et al., A draft sequence of the rice
    genome (oryza sativa l. ssp. indica). Science
    296, 79(2002).
  • 2. S. A. Goff et al., A draft sequence of the
    rice genome (oryza sativa l. ssp japonica).
    Science 296, 92(2002).
  • 3. F. Crick, Codon-anticodon pairings the
    wobble hypothesis. J. Mol. Biol. 19
    548-555(1966).
  • 4. Guthrie, C. and Abelson, J. Organization and
    expression of tRNA genes in Saccharomyces
    cerevisiae. In The Molecular Biology of the
    Yeast Saccharomyces Metabolism and Gene
    Expression (ed. J. Strathern et al. ), Cold
    Spring Harbor Laboratory, Cold Spring Harbor, New
    York, pp. 487-528. (1982).
  • 5. International human genome sequencing
    consortium, Nature 409, 801(2001).
  • 6. http//rna.wustl.edu/GtRDB/ S. Eddy et al..

90
Transcriptional or functional efficiency of tRNA
genes
Codon frequency tRNA gene number Codon frequency per tRNA gene
Codons ending with U and C 5332 277 19.25
Codons ending with A 1678 132 12.67
Codons ending with G 2978 183 16.26
91
tRNA gene sub-species studied

92
We have to point out
  • The collection of tRNA genes we obtained may be
    redundant
  • Novel tRNA genes may be found
  • Experimental work is needed to prove whether they
    are genuine tRNA genes or not
  • As scientists sequencing human genome pointed
    out the work looking for novel ncRNA genes would
    still be challenging even the complete finished
    sequence of the genome were available

93
GC-Gradient Effect in Rice
  • Jun Wang, Gane Wong, et al.
  • Genome Research (2002)

94
Fly
95
  • Huimin Xie ???
  • Grammatical Complexity and
  • 1D dynamical Systems
  • Vol.6 in Directions in
    Chaos
  • WSPC, 1996.
  • ??? ????????
  • ?????????, 1994
  • J.Hopcroft, J.Ullman,
  • Introduction to Automata Theory, Languages and
  • Computation,
  • Addison-Wesley, 1979.

96
THE ENDTHANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com