Finding Genes in the Rice Genome - PowerPoint PPT Presentation

About This Presentation
Title:

Finding Genes in the Rice Genome

Description:

Title: Language Theory Combiantiorics and Bioinfromatics Author: aaa Last modified by: Hao Bailin Created Date: 12/4/2001 3:15:40 AM Document presentation format – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 33
Provided by: aaa8180
Category:
Tags: finding | genes | genome | rice

less

Transcript and Presenter's Notes

Title: Finding Genes in the Rice Genome


1
Finding Genes in theRice Genome
  • Hao Bailin
  • T-Life Research Center, Fudan University
  • Beijing Genomics Institute , Academia Sinica
  • Institute of Theoretical Physics, Academia Sinica
  • (www.itp.ac.cn/hao/)
  • On-going work by a team of 10-12 people since
    August 2001 Zheng Weimou, Xie Huimin, Liu
    Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin
    Jiao, et al. Nothing written yet.

2
Two Cultivars of Rice
  • Oryza sativa ssp. indica (??)
  • Oryza sativa ssp. Japonica (??)
  • The difference was described in Xu Shens
  • (??????) Chinese Dictionary of
  • East Han Dynasty ( 2nd Century AD)
  • J.H. Zhang et al. Rice cultivation of Jianhu
    Remains in Henan Province, Science J.
    (????),53(4),2002, 3 (in Chinese)

3
cccaatatcttgcttcagcaagatattgggtatttctagctttcctttct
tcaaaaattgctatatgttagcagaaaagccttatccattaagagatgga
acttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtg
cattacttccataccaagattagcacggttgatgatatcagcccaagtat
taataacgcgaccttggctatcaactacagattggttgaaattgaatccg
tttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccc
tactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaa
aactagcatattggaagattaatcggccaaaataaccatgagcggccaca
atattataagtttcttcctcttgaccaaatctgtaaccctcattagcaga
ttcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccat
gcatagcactgaatagggaaccgccgaatacaccagctacacctaacatg
tgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaa
gttgaaagtaccagatattcctaaaggcataccatcagagaaacttcctt
gaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagct
gaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttc
ccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaa
ttagctcataaggaccaccattgtataaccactcatcaacagatgcagct
tcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataat
ggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggct
cacgaataccatcaatatctactggaggggcagcgatgaaggcgataata
aatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaacca
tccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgac
cccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaaga
tcttggtttattcaaattgcaaggactcccaagcacacgtattaactaga
aagataatagaaggcttgttatttaacagtataatatagactatatacca
atgtcaaccaagccagccccgacagttgtatatccatacaacaaaattta
ccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcaga
ttgctcctttctagtttccatatgggttgcccgggactcgaacccggaac
tagtcggatggagtagataattattccttgttacaatagagaaaaaacct
ctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgta
gaaataggctatttctattccgaagaggaagtctactaatttttttagta
gtaagttgattcacttactatttattatagtacagagaacatttcagaat
ggaaactgtgaaagttttaccttgatcatttatcaatcatttctagttta
ttagttttgtttaatgattaattaagaggattcaccagatcattgatacg
gagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaa
agtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgta
aaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtac
cgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagta
tatactttagtcgatacaaagtcttcttttttgaagatccactgtgataa
tgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatc
cgataaatcggtccaaattggtttactaataggatgccccgatccagtac
aaaattgggcttttgctaaagatccaatgagaggagtaacagggactttg
gtatcgaattttttcatttgagtatctattagaaatgaattctccagcat
ttgattccttactaacaaagaatttattggtacacttgaaaagtacccca
gaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggt
tgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacattt
ccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttc
cttgatatcgaacataatgcataaggggatccataacgaaccatatggtt
ttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctaga
aaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaag
aagattgtttacgaagaaacaacaagaaaaattcatattctgatacataa
gagttatataggaaccgaaatagtcttttattttcttttttcaaaataaa
aatggatttcattgaagtaataaaactattccaattcgagtagtagttga
gaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattga
aggagttgaagcaagatatccaaatggataggatagggtatttctatatg
tgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattg
aatgaatagatcgtaaattctgaaactttggtatttctttttcttccgga
caagactgttctcgtagcgagaatgggatttctacaacgatcgcaaaccc
ctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaat
ccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattc
tgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaattt
cttgttattagaaccaataatttcgacaagttcggaaccatttaatccat
aatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacg
aaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttc
catttgtatttctacttgaatcagagagagagaaatatttctcggtttat
caaatggtgatacatagtacaatatggtcagaacagggtgttgcattttt
taatacaaacccctggggaagaaaaggagtctaatccacggatctttttc
cgctccttttctatccaatttgtttatgtttgttctaattacaaaagaga
acaaatcctttatttttgcaggccaattgctcttttgactttgggataca
gtctctttatcaatatactgcttcttttacacattcaatccataacatcc
ttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaa
aatcaaaggtctactcataggaaaaccagcttttccctacatcaggcact
aatctatttttaacgtctaattagatcagggagttcttccaattaagaag
ttaagctcgttgctttttgttttaccagaattggagccaggctctatcca
tttattcattagacccagaaaatcagaatttttttattccattccaaaaa
tccaaaataagaaattgattttattacgacatgctattttttccattcat
tacccttgaggatcagtcgcggtcttatagactctaccaagagtctggac
gaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaa
gccgagtactctaccattgagttagcaacccagataaactaggatcttag
atacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtca
aaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaa
acactcagataccaaaaggaacgggtctggttaaatttcactaaggttaa
aagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttat
ttaaataaataaataaatcttgtatgagagtacaaacaagagggacaacc
ctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggata
aagagacttatccatctacaaattctagatgttcaatggacctttgtcaa
tggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaata
aaggcttatgttggattggcacgacataaatccagtcaaaaataggatta
agaaagaggcaaattatttctaaatagttagacaacaagggatactagtg
agcctctcctagttttttattcatttagttcttcaattaactcaaagttc
tttctttttctttaaagaattccgccttccttaaaatatcagaaacggtt
cttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatt
taaacaagtttgattctttatcggatcataaaaacctacttttcgaagat
ctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagaca
gcttattgggatagatgtagataaataaagccccccctagaaacgtatag
gaggttttctcctcatacggctcgagaatatgacttgcattaatttccgt
acagaaaaaacaaatttcatttatactcatgactcaagttgactaatttt
gattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtct
ctaaactcttttctttgcctcatctcgaacaaattcacttttattcctta
ttccggtccaattctattgttgagacagttgaaaatcgtgtttacttgtt
cgggaatcctttatctttgatttgtgaaatccttgggtttaaacattact
tcgggaattcttattcttttttctttcaaaagagtagcaacatacccttt
tttcttatttccttcgataaagcatttccctcttctatagaaatcgaata
tgagcgattgattctgatagactttaatcaaaagagttttcccatatctt
ccaaaattggactttcttcttattttaaccttttgatttctatattattt
cgatttctatattaagggtagaatgacaaagttggcctaatttattagtt
ttcactaaccctagattctttcccttgataaaaaataaattctgtcctct
cgagctccatcgtgtactatttacttagcttacttacaaacaacccagcg
aaaattcggttcgggacgaatagaacagactatgtcgagccaagagcatt
ttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtc
cttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaa
cataacattcctctaatttcattgcaaagtgttatagggaattgatccaa
tatggatggaatcatgaatagtcattagtttcgttttttgtatactaatt
caaacttgctttgctatctatggagaaatatgaataaaagaaattaagta
tttatcgggaaagactccgcaaagagccaatttatttaaacccatattct
atcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgctt
aagacttatttattatggaatttccatcctcaacagaggactcgagatga
tcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaa
taaactatcaacctcccgtttaattaatttaattaatatattagattagc
aatctatttttccataccatttttccgtaacaaaactaattaactattaa
ctagttaaactattgcaatgaaaagaaagttttttggtagttatagaatt
ctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaa
aaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttatt
ttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattct
atctaacgagcagttcttatcttatctttaccgggatggatcattctgga
tatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaa
aaagaagaaggaaccttttttactaataaaatactataaaaaaaatttat
ctctatcataaatctatctctaccataaaggaataggtctcgttttttat
acaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttca
atttgactggacttgacactggattatgttttctgagacagaaaatgaac
gcattaggactgcatcgaatctaagagtttataagagaaaaaaattctct
ttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgt
ttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacggg
accaaaacccgctgccttaccacttggccacgccccatttcgggttttat
gcgacactaataaacagtattatgtttatttcttattcgtcaatcctact
tcaattacataaaaatggggggtattctcttggtaggattctagacatgc
gaataatatagaatccaaaaaatgcattgatcattacatggaattctatt
aagatattatatgaaagtcgaatttcttccactctcatttgagagtgcga
atacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggatttt
gaatcctccttttcctttttcccttagaaaaataactcaatcaaaatcca
attatctactctacaagaacgaaacgcttgttatgcctaatatacttagt
ttaacctgtatttgttttaattctgttatttatccgactagttttttctt
cgccaaattgcccgaagcttatgccattttcaatccaatcgtggatttta
tgcctgtcatacctgtactcttttttctattagcctttgtttggcaagct
gctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatc
atgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtct
tatattatgaaccttcgattctaaaattcaaattcttctacattgaatgt
atagctgcagcaataaatttggatcagcctttctactccctgcatctacg
ttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattga
taagagtgcttattataaatcaattcttgcaatttttttcaaaaattgat
ttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttg
tgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggaga
ttatgtaatgcttactctcaaactttttgtttatacagtagtgatattct
ttgtttccctctttatctttggattcttatctaatgatccaggacgtaat
cctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggattt
gtttcatacatttatctacgagaaaatccgggggtcagaattccttccaa
ttcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaacc
ctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtcc
actcagccatctctccccgttccaaatcgaaaggtttccgtgatatgaca
gaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagt
tcaaaaaaattatattgccaattccattttagttatattcttttttctta
atgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaa
aattggatattggctaaaagacaatcagatagattttctcttcagcaggc
atttccatataggacttgttataataaaacaagcaggttatagaaaaaaa
ctcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaacc
aacccaccccataaaattggaaagaaagataaagtaagtggacctgactc
cttgaatgaggcctctatccgctattctgatatataaattcgatgtagat
gaaattgtataagtggatttttttgtatttccttagacttagaccacgca
aggcaagaatttctcgctatttactatttcatattcttgttactagatgt
tctataggaataagaagaaatcgcaacccctttccgctacacataaaaat
ggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaa
tcctatttttgttcttatacccatgcaatagagagcgagtgggaaaaggg
aggttactttttttcattttttccttaaaaaataggctttcttggaaata
ggaatcatggaataatctgaattccaatgtttatttctatagtataagaa
aaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccc
catagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggca
ttgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagt
gatataaggtgctcggaaatggttgaagtaattgaataggaggatcacta
tgactatagcccttggtagagttactaaagaagaaaatgatttatttgat
attatggacgactggttacgaagggaccgttttgtttttgtaggatggtc
tggcctattgctttttccttgtgcttatttcgctttaggaggttggttta
cagggacaacttttgtaacttcttggtatacccatggattggcgagttcc
tatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaa
tagtttagcacactctttgttgctactatggggcccggaagcacaagggg
attttactcgttggtgtcaattaggtggtctgtggacttttgttgctctc
catggggcttttgcactaataggtttcatgttacgtcaatttgaacttgc
tcggtctgttcaattgcggccttataatgcaatttcattctctggcccaa
tcgctgtttttgtttccgtattcctgatttatccactggggcaatccggt
tggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcct
cttcttccaaggatttcataattggacgttgaacccatttcatatgatgg
gagttgccggagtattaggcgcggctctgctatgcgctattcatggggca
accgtgga
4
Gene-Finding by Computer
  • Starting from early 1980s
  • Ab initio or de novo algorithms GeneMark,
    GenScan, FgeneSH, Genie, based on gene-structure
    models and training data. (Our on-going project
    BGF, the BGI Gene Finder)
  • Homolog methods based on sequence alignment with
    known genes in databases
  • Mixed approach using both strategy TwinScan

5
Different Stages of Gene-Finding
  • Use all possible existing programs and services
    on the web with a public-domain or home-made
    genome viewer
  • Write your own gene-finder, trained for the
    specific organism
  • A dream for the time being design a
    self-training and self-developing program for
    any species which would improve itself
    iteratively starting from a few available reads,
    cDNAs, and ESTs

6
Performance of Gene-Finders in Eukaryote Genomes
  • M. Q. Zhang, Nature Review Genetics, 3 (2002)
    698-710 (mostly for the human genome)
  • Nucleotide level 80
  • Exon level 45
  • Whole gene structure 20
  • FgeneSH and BGF for rice (our tests on 128
    cDNA-confirmed single-gene genomic sequences)
  • Nucleotide level 90
  • Exon level 60
  • Whole gene structure 40

7
5
3
5
3
  • Each strand carries the same amount of
    information, but different sets of genes.
  • Two strands are equivalent in information
    content.
  • Two strands are not equivalent in gene content.
  • Biological processing (duplication,
    transcription) goes from 5 to 3.
  • Finding genes on one strand at a time or on two
    strands at the same time one-pass or two-pass
    programs.

8
start
stop
5
Genomic DNA
3
transcribe
RNA Pol II
Pre-mRNA
splicesome u1u2u4u5u6RNP
splice
mRNA
5-UTR
3-UTR
translate
ribsome init. elong.
factors term. chaperonine
AA seq ( protein primary seq )
fold
Protein fold
9
Three Scales of Search
  • Local signals with minimal signature (start,
    stop, splicing) movable signals (caps,
    promoters, polyAs, branching points, some very
    weak) --- clustering, discrimination analysis,
    various statistical models
  • Intermediate exons, introns, intergenic ---
    Markov, semi-Markov, Hidden-Markov models
    intron length distribution
  • Global optimal combination of the above ---
    dynamic programming

10
()?(.)(.)(.)?()
Transcription
Translation

Translation
Transcription
start
start

end
end
  • Signals
  • transcription start (downstream of
    promoters)
  • transcription end (upstream of poly-A)
  • ? translation start (ctg, 1/64 in a random
    seq.)
  • ? translation end (tag, tga, taa, 3/64)
  • ( splicing donor site (minimal signalgt,
    1/16)
  • ) splicing accepter site (ag, 1/16)
  • branching point (very weak a)

11
()?(.)(.)(.)?()
Transcription
Translation

Translation
Transcription
start
start

end
end
  • ?( First exon
  • )( Internal exon
  • )? Last exon
  • ( Non-coding 5 exon
  • )? Non-coding 5 exon
  • (.) Intron
  • ?( Non-coding 3 exon (rare)
  • ) Non-coding 3 exon (rare)
  • Intergenic region

12
Signal and Sequence Models
  • eiid equal probability independently and
    identically distributed
  • niid non-equal probability independently and
    identically distributed
  • WWM Windowed weight matrix, etc.
  • MMn Markov chain model of order n homogeneous
    and period-3 MM5 are used in many gene-finders
  • Consensus sequence

13
Consensus Sequences
  • TATAAT ( Pribnov or -10 box )
  • T80A95T45A60A50T96
  • TTGACA ( -35 box )
  • T82T84G78A65C54A45
  • CAAT ( CAAT or 75 box )
  • GGYCAATCT
  • TATA ( TATA or Goldberger-Hogness box )
  • TATAWAW
  • ATG ( Transcription start point )
  • However, in Aful ATG 76 GTG 22 TTG
    2

14
(No Transcript)
15
GT-AG Rule for Intron
  • 5 splicing
  • donor site
  • exon A64G73 G100T100A62A68G84T63
  • 12PyNC65A100G100 Nexon

  • 3 splicing

  • acceptor site

16
(No Transcript)
17
Exon and intron size distribution
18
Algorithms
  • Sequence models and scores for signals
  • Dynamic programming optimal parse
  • Hidden Markov Model geometric distribution of
    intron lengths
  • Semi-Hidden Markov Model needs
    sequence-generating models and length probability
    for each node
  • Language theory approach

19
Flow Chart of GenScan
  • Chris Burge (1996) A 27-state semi-HMM
  • A simpler model 19-state
  • A model taking UTR introns into account
    35-state

20
  • FigureN, intergenic
  • region P,promotor F,
  • 5UTR , single-
  • exon gene , initial
  • exon phase
  • k internal exon ,ter
  • -minal exon T, 3UTR
  • A,polyadenylation signal
  • and, , phase k
  • intron. ) strand.

21
Problems Minor and Major
  • Ambiguity symbols (N, W, S, R, )
  • (1-p) at flanking D-type nodes
  • Indels and frame-shifts
  • Gradient effects in gene structure
  • Introns in 5-UTRs and 3-UTRs leading to
    35-state Markov Models
  • Alternative splicing and sub-optimal paths
  • Limit of probabilistic models
  • Deterministic approaches

22
Dyck language A language of nested parentheses
  • Many types of parentheses
  • Finite depth of nesting
  • Context-free language
  • Our case
  • Only 3 types of parentheses
  • Shallow nesting
  • Conjecture may be regular language

23
Two Test Datasets for RiceGene-Finders
  • The 28469 japonica full-length cDNAs (Kikuchi et
    al., Science 301 (18 July 2003)
  • Select a high-quality subset without overlaps
    with publically available cDNAs
  • A single-gene set 500 sequences with one gene in
    each
  • A multi-gene set 46 sequences with 199 genes in
    total (at least 4 genes in a sequence)

24
Assessment of Gene-Finders
  • Test done between 22 July and 2 August 2003
  • FgeneSH (trained on monocotyledons)
  • GeneMark.hmm
  • RiceHMM
  • GlimmerR
  • GenScan (trained on maize)
  • BGF

25
Our Ultimate Goal
  • An iterative, self-training, self-improving
    gene-finder for any species, starting from a
    small number of reads with or without EST, cDNA
    supports
  • Annotaion and re-annotation of the rice genomes
  • Plant comparative genomics, especially, that of
    Gramene and Crucifers

26
tRNA features
  • tRNA gene ? pre-tRNA ? mature tRNA
  • Mature tRNA 75 95 bases
  • Cloverleaf like structure
  • Five arms acceptor arm, D arm, anticodon arm, V
    loop (extra arm), T C arm

27
How many tRNA genes are present in an
organism?
  • Codon ?? tRNA ?? amino acid
  • 61 encoding codons
  • 20 amino acids
  • Are there 61 species of tRNA with all possible
    anticodons ?
  • Met (M) has one codon but two tRNAs

28
Wobble hypothesis Crick, 1966
  • Many tRNAs recognize more than one codon
  • Through non-Watson-Crick base pairings
  • Less than 61 tRNAs are needed

29
The Modified Wobble Hypothesis(Guthrie Abelson
1982)
  • In eukaryotes, 46 different tRNA species would be
    enough.
  • The modified wobble hypothesis is almost
    perfectly hold in H. sapiens, S. cerevisiae, A.
    thaliana, C.elegans whose complete collection of
    tRNAs are now known.

30
tRNA copies in Arabidopsis, C. elegans, and Human
F
C
Y
S



W
L
H
R
P
Q
I
N
S
T
K
R
M
D
V
A
G
E
31
tRNA Genes in the Rice Genome(Found by
tRNAScan-SE BLASTN)
32
Chloroplast tRNA genes in ssp. indica and japonica
  • 33 tRNA genes found in indica and japonica genome
    respectively.
  • They are completely identical, no mutation is
    found (E. C. Kemmerer and Ray Wu found two tRNA
    genes perfectly conserved).
  • It is remarkable that in spite of more than 9000
    years of separation no mutation could be observed
    in the chloroplast tRNA genes in the two ssp.
Write a Comment
User Comments (0)
About PowerShow.com