Phylogenetic Shadowing - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogenetic Shadowing

Description:

The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be ... Internal nodes are the common ancestor (unobserved) ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 18
Provided by: Danie308
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Shadowing


1
Phylogenetic Shadowing
  • Daniel L. Ong

2
Abstract
  • The human genome contains about 3 billion base
    pairs!
  • Algorithms to analyze these sequences must be
    linear to be tractable
  • Finding genes is important to Molecular
    Biologists, first step to understanding

3
Outline
  • Introduction
  • Alignments
  • Phylogenetic trees
  • Sequence models
  • Example mRNA and scRNA models
  • Conclusions

4
Introduction to Biosequences
  • 4 nucleotides A matches T G matches C
  • In RNA, U replaces T
  • The NIH GenBank has 188 GB of sequence data UC
    Santa Cruz has another 128 GB
  • The central dogma

http//web.mit.edu/esgbio/www/dogma/dogma.html
5
Alignments
  • Alignment given two sequences, insert gaps or
    allow mismatches in input sequences to minimize a
    cost function
  • Similar to edit distance
  • Generalizes to n sequences
  • Exploited to predict genes
  • Greater similarity in protein-coding genes
  • Mutated as a pair in structural RNA genes

http//hanuman.math.berkeley.edu/kbrowser
(Chakrabarti Pachter, 2004)
6
Multiple alignment
  • Considering multiple sequences allows us to
    leverage the comparative genomics paradigm
  • Functionally important regions of the genome are
    more likely to be conserved across species
  • The converse is also true
  • Genomes should be closely related
  • About 5-7 species of a family (Boffelli, et. al.
    2003)
  • Additional genomes increase sensitivity (true
    positives) and decrease specificity (true
    negatives)

7
Phylogenetic Trees
Durbin, et. al., 1998
  • Use directed binary tree to track the
    relationships between organisms
  • Each node represents the nucleotide at a
    particular position in an aligned sequence
  • Current organisms are leaves of tree (observed)
  • Internal nodes are the common ancestor
    (unobserved)
  • Edges are speciation events and represent
    evolutionary distance as an extra parameter
  • Assume each nucleotide evolves independently
    (site independent evolution)

8
Phylogenetic Tree
  • Site independent model computes probability of
    independent columns
  • Used for protein-coding genes
  • Pairwise site dependent model computes
    probability of base-paired columns
  • Used for scRNA genes

Marty Yanofsky http//www-biology.ucsd.edu/labs/ya
nofsky/images/mads/phylogenetic20tree.jpg
9
How to find a Phylogenetic Tree?
  • Given n sequences, we want to find the correct
    tree topology
  • Search works for small n
  • Maximum likelihood choose the tree that
    maximizes the probability of the alignment

10
Biosequence analysis
  • Phylogenetic trees encapsulate evolutionary time
    across sequences
  • Sequence model predicts changes along the length
    of a particular sequence
  • Sequence models are typically HMMs

11
Example mRNA genes
  • Suppose we want to identify coding genes with an
    HMM
  • Exon DNA segment that gets transcribed to mRNA
  • Have states in HMM corresponding to exon regions
    (Alexandersson, et. al., 2003)
  • Other types of RNA that get transcribed from DNA
    but not translated into protein are noncoding

12
Structural RNA (scRNA)
  • A sequence with many self-binding sites, forming
    a stable structure.
  • Implicated in regulating critical biochemical
    pathways

Michael W. King http//www.indstate.edu/thcme/mwki
ng/trna.gif
13
Example Structural RNA
Chakrabarti Ong, 2004
  • Due to semi-palindromic structure, sequence model
    would be a PCFG
  • Violates the site-independent assumption of
    phylogenetic trees
  • Modify to allow pairwise site-dependencies in
    addition to non-matches
  • Gene length can be in the thousands
  • Limit the length of scRNA to constant L time
    O(L3 NL2), N length of multi-alignment

14
Example completed
Chakrabarti Ong, 2004
  • Can combine HMM and the PCFG to form a supermodel
  • Use a generic framework to identify mRNA, scRNA,
    and other regions

15
Phylogenetic shadowing
  • Use multiple alignment of several closely related
    genomes
  • Analysis of data becomes more reliable (Boffelli,
    et. al., 2003)
  • More genomes reduce probability of false
    positives
  • Still need closely related species to decrease
    chance of false negatives

16
Conclusions
  • Phylogenetic shadowing uses a multiple alignment
    to analyze multiple genomes simultaneously,
    increasing success
  • AI techniques have been proven useful in
    Computational Biology
  • Still many more problems to solve

17
References
  • M. Alexandersson, S. Cawley, and L. Pachter.
    SLAM Cross-Species Gene Finding and Alignment
    with a Generalized Pair Hidden Markov Model.
    Genome Research, 13 (2003) p 496--502.
    http//www.genome.org/cgi/content/abstract/13/3/49
    6
  • D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D.
    Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin.
    Phylogenetic shadowing of primate sequences to
    find functional regions of the human genome.
    Science, 299 (2003), p 1391-1394.
    http//www.sciencemag.org/cgi/content/short/299/56
    11/1391
  • K. Chakrabarti and D.L. Ong. Computational
    Identification of Noncoding RNA Genes through
    Phylogenetic Shadowing. ACM/ISCB RECOMB 8
    (2004), poster. http//recomb04.sdsc.edu/posters/
    kushalcATuclink.berkeley.edu_168.pdf
  • K. Chakrabarti and L. Pachter. Visualization of
    multiple genome annotations and alignments with
    the K-BROWSER. Genome Research 14 (2004), p
    716--720. http//www.genome.org/cgi/content/abstr
    act/14/4/716
  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison.
    Biological Sequence Analysis Probabilistic
    Models of Proteins and Nucleic Acids. New York
    Cambridge University Press, 1998.
Write a Comment
User Comments (0)
About PowerShow.com