Title: Universitat%20Aut
1 explosion of biological data
2 genome technologies
- DNA sequencing
- DNA microarrays
- mass spectroscopy and 2-D gels
- yeast two hibrids
- X-ray cristallography and NMR
3 growth of sequence data
4 Moores law
5 google hits X-informatics
bioinformatics 2,270,000
chemoinformatics 10,600
astroinformatics 31
neuroinformatics 49,300
socioinformatics 318
geoinformatics 38,000
meteoinformatics 2
econoinformatics 83
ecoinformatics 36,400
biology 17,000,000
6decodificació del genoma
the genome sequence
- ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGA
AGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTA
GCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACT
CAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGG
GACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAA
GGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCC
CCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTG
TCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAG
CCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGA
AAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGA
GGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGG
GGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAG
GCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAG
GGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGT
TGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGT
TGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAG
TTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTG
TGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCT
CGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCC
CATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGA
GGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAG
CGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCA
GCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCA
GCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGC
CTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTT
TTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCT
CTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAA
TTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGT
TAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGAT
GAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGT
TCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGC
CATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCA
CCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACC
ATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCT
TCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCT
GGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGAC
AGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACA
CAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATC
CCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGG
CTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCC
GAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGT
CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTA
GGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGG
GAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGA
GACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACAC
CTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCC
AGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCT
GGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGA
TTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATT
TGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTC
AGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCA
AGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAG
GCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTG
GGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGC
TGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGG
CCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCAT
TCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGC
TTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCT
CTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGT
CCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAG
ATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAG
CGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCA
CCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCG
TCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTC
ACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCC
TTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC
7 the genome sequence
ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGA
AGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTA
GCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACT
CAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGG
GACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAA
GGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCC
CCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTG
TCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAG
CCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGA
AAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGA
GGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGG
GGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAG
GCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAG
GGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGT
TGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGT
TGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAG
TTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTG
TGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCT
CGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCC
CATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGA
GGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAG
CGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCA
GCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCA
GCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGC
CTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTT
TTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCT
CTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAA
TTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGT
TAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGAT
GAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGT
TCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGC
CATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCA
CCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACC
ATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCT
TCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCT
GGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGAC
AGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACA
CAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATC
CCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGG
CTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCC
GAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGT
CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTA
GGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGG
GAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGA
GACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACAC
CTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCC
AGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCT
GGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGA
TTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATT
TGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTC
AGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCA
AGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAG
GCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTG
GGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGC
TGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGG
CCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCAT
TCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGC
TTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCT
CTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGT
CCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAG
ATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAG
CGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCA
CCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCG
TCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTC
ACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCC
TTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC
8 the genome sequence
ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGA
AGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTA
GCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACT
CAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGG
GACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAA
GGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCC
CCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTG
TCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAG
CCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGA
AAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGA
GGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGG
GGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAG
GCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAG
GGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGT
TGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGT
TGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAG
TTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTG
TGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCT
CGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCC
CATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGA
GGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAG
CGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCA
GCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCA
GCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGC
CTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTT
TTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCT
CTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAA
TTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGT
TAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGAT
GAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGT
TCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGC
CATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCA
CCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACC
ATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCT
TCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCT
GGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGAC
AGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACA
CAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATC
CCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGG
CTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCC
GAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGT
CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTA
GGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGG
GAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGA
GACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACAC
CTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCC
AGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCT
GGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGA
TTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATT
TGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTC
AGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCA
AGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAG
GCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTG
GGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGC
TGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGG
CCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCAT
TCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGC
TTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCT
CTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGT
CCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAG
ATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAG
CGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCA
CCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCG
TCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTC
ACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCC
TTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC
9(No Transcript)
10gagttttatcgcttccatgacgcagaagttaacactttcggatatttctg
atgagtcgaaaaattatcttgataaagcaggaattactactgcttgttta
cgaattaaatcgaagtggactgctggcggaaaatgagaaaattcgaccta
tccttgcgcagctcgagaagctcttactttgcgacctttcgccatcaact
aacgattctgtcaaaaactgacgcgttggatgaggagaagtggcttaata
tgcttggcacgttcgtcaaggactggtttagatatgagtcacattttgtt
catggtagagattctcttgt
MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEALYLVCGERGFFY
TPKARREVEGPQVGALELAGGPGAGGLEGPPQKRGIVEQCCASVCSLYQL
ENYCN
11probabilistic patterns ingene predictionroderic
guigó serrarobert castelo
12decodificació del genoma
the genome sequence
- ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGA
AGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTA
GCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACT
CAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGG
GACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAA
GGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCC
CCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTG
TCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAG
CCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGA
AAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGA
GGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGG
GGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAG
GCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAG
GGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGT
TGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGT
TGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAG
TTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTG
TGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCT
CGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCC
CATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGA
GGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAG
CGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCA
GCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCA
GCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGC
CTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTT
TTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCT
CTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAA
TTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGT
TAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGAT
GAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGT
TCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGC
CATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCA
CCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACC
ATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCT
TCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCT
GGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGAC
AGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACA
CAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATC
CCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGA
GACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAA
AACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGG
CTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCC
GAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGT
CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTA
GGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGG
GAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGA
GACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACAC
CTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCC
AGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCT
GGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGA
TTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATT
TGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTC
AGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCA
AGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAG
GCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTG
GGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGC
TGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGG
CCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCAT
TCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGC
TTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCT
CTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGT
CCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAG
ATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAG
CGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCA
CCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCG
TCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTC
ACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCC
TTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC
13 the amino acid sequence of the proteins
- QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQES
KPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERI
EKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDL
FIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSP
ESEQFRADHPFLFLIKHNPTNTIVYFGRYWS
14eukaryotic gene structure
15(No Transcript)
16eukaryotic gene structure
acceptor
donor
17modeling donor sites
GGG GTGAGCCCAG GTG GTAAGAGACA TAG GTGAGTGTGA GCG
GTAGGTACTC CAG GTAATTTTCT AAG GTAGGCTCTG AGG
GTGAGTCCAG GAG GTGGGTCACA CAG GTCAGTCTTT ACG
GTAAGACCTG CAG GTGGGTGCTG CAG GTAAGCAGTG AGG
GTGAGTTCAG CAG GTAAGCATTG AGG GTGAGTTCAG
18the donor site pattern reflects underlying
biological constraints
19the donor site pattern reflects underlying
biological constraints
20the donor site pattern
21prediction of splice sites
22modeling dependencies
23modeling dependencies, first order markov models
Weigth Array Models (WAM) Zhang and Marr (1993)
24(No Transcript)
25extending the Markov order
- Salzberg et al., (1998) Interpolated Markov
Models - Cawley (2000) Variable length Markov Models
26modeling non-local dependencies in splice sites
- Burge and Karlin, 1997. Maximal Dependence
Decomposition (MDD) - Agarwal and Bafna, 1998
- Yeo and Burge, 2003
- Zhao et al., 2004. Permutated Variable Length
Markov Models (PVMLL) - Cai et al., 2000 Dash and Gopalakrishman, 2001.
Bayesian Networks - Castelo and Guigó, 2004, Inclusion-Driven Learned
Bayesian Networks (idlBNS)
27idlBNs
- Bayesian Networks allow one to learn from the
data those (in)dependencies that conform an
acyclic digraph (DAG). - Inclusion-driven structure learning algorithms
(Castelo and Kocka, 2003) under the assumption
that the data is sampled from a DAG-distribution,
and in the limit of the size of the sample they
learn a correct DAG structure using a consistent
scoring metric.
28(No Transcript)
29prediction of splice sites vs. gene prediction
30sites
exons
genes
e8
e1
the gene prediction problem
31(No Transcript)
32gene prediction accuracy
BG-570 SN SP (SN.SP)/2
PWM 0.36 0.35 0.355
FMM 0.38 0.43 0.405
idlBN 0.45 0.37 0.410
SN fraction of true exons predicted correctly
SP fraction of predicted exons that are correct
33(codon usage table)
34coding statistics
35the real accuracy
Accuracy on human chromosome 22
sensitivity specificity
genscan 0.79 0.53
twinscan 0.80 0.62
SGP 0.79 0.66
36search for additional patterns
- real exons with weak splice sites, Fairbrother
et al., 2002 - pseudoexons with strong splice sites, Zhang and
Chasin, 2004
37Fairbrother et al., 2002. splicing enhancers in
exons with weak sites
38Zhang and Chasin, 2004. splicing silencers in
pseudoexons with strong sites
39Bioinformatic approach scheme
40G-rich motifs are able to influence 5 splice
site recognition
NE
?U1
U1
(1)
(2)
(3)
41in collaboration with Juan Valcárcel, CRG
42INHIBITORY EFFECT OF ON 5SS RECOGNITION BY U1
snRNP
U1
Weak 5ss followed by a G-rich element
(1)
Deletion of the G-rich element in (1)
(2)
Strong 5 ss
(3)
43TIA-1 promotes U1 snRNP binding to weak 5 splice
sites Followed by uridine-rich sequences
XXXXXXXXX
TIA-1
44(No Transcript)
45the second genetic code
- genetic code
- mapping of nucleotide triplets into 3 into the
twenty aminoacids - highly deterministic a given triplet always
codes for the same amino acid - splicing code
- mapping of nucleotide sequences into 3 and 5
intron boundaries. - inherently stochastic the probability of an
splicing sequence to participate in the
definition of an inron boundary ranges from zero
to one, and it is conditionated to very many
different factors (which could be other sequences)