Data representation: techniques and tradeoffs - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Data representation: techniques and tradeoffs

Description:

WTF? Two conflicting goals for representing data. Human-readable ... WTF? Can't we all just get along? NO - inherent conflicts in representation goals ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 15
Provided by: robkn
Category:

less

Transcript and Presenter's Notes

Title: Data representation: techniques and tradeoffs


1
Data representation techniques and trade-offs
  • Rob Knight
  • Dept. Chem. Biochem.
  • CU Boulder

2
Two conflicting goals for representing data
  • Machine-readable
  • Dot-bracket (structures)
  • Array of chars (alignments)
  • Allows standardizationacross algorithms,
    high-throughput analyses, but
  • Difficult to relate to expertknowledge

WTF?
3
Two conflicting goals for representing data
  • Human-readable
  • Secondary structure pictures
  • Mutated alignments basedon hand-crafted rules
  • Prohibits standardizationacross algorithms,
    high-throughput analyses, but
  • Allows efficient exploitationof expert knowledge

WTF?
4
Cant we all just get along?
  • NO - inherent conflicts in representation goals
  • Need to either
  • (a) find middle ground that is acceptable for
    bothpurposes, or
  • (b) agree to separate representations that cannot
    be interconverted without substantial manual
    intervention and/or loss of data

5
Example 1 motif representation
  • Conflicting goals
  • Want human-readable, unique specification of
    motif
  • Want to be able to recapture arbitrary
    interactions among bases in the motif

6
What are we prepared to give up?
  • Example kink-turn motif and reverse-kink-turn
    motif
  • If we discard out-of-order interactions, can use
    something like
  • GCWWGASHGASHAGHS,GAAGCWWCGWW

Leontis et al. 2006
7
What are we prepared to give up?
  • but if we include the long-range interactions,
    we must number the bases and include arbitrary
    pairs
  • GiCj8WWGi1Aj7SHGi2Aj6SHAi3Gj5HS,GAAGi6
    Cj1WWCi7GjWW-Ai5Gj2SSGi6Aj6SS
  • Not really human-readable or machine-readable!
  • (problem representing arbitrary graphs is hard)

8
So whats the solution?
  • Two proposals
  • Tiered system of less complex - more complex
    linear representations depending on the type of
    motif (think of chemical nomenclature for
    substituents or cycles)
  • Use common names but require deposition of formal
    motif definition with unique accession in
    central db

j
i
9
Advantages and disadvantages
  • Tiered system
  • More human-readable
  • Difficult to parse (unless readability is
    expended),liable to incomplete or ambiguous
    specifications
  • Probably wont be able to do text search anyway
    because of journal formatting
  • Accession system
  • Easy to parse (store machine-readable connect
    list, incl. ambiguities) so can automate
    analyses
  • Can generate human-readable diagrams as output
  • Can generate specification using graphical tools
    so need not require familiarity with the file
    format
  • Requires central repository and enforcement of
    deposition
  • Question is the community prepared to reify and
    enforce the current motif nomenclature?

Leontis et al. 2006RNA
10
Example 2 homology
  • Fundamental problem systems that are homologous
    at one level are not necessarily homologous at
    other levels
  • E.g. bat wings and bird wings homologous as
    pentadactyl limbs, but not homologous as wings
  • Homology is hierarchical andcan partially
    overlap at any level(e.g. Griffiths 2006)

Bat forelimbs
Bird forelimbs
Frog forelimbs
Rodent forelimbs
Mammal forelimbs
Tetrapod forelimbs
Ridley Evolution 3rd ed.
11
Is one alignment enough?
  • Example Lorsch Szostak (1994) evolution of
    polynucleotide kinase from ATP aptamer (15 mut.)
  • Some recovered classes retained sequence
    similarity to the ATP aptamer, but formed new
    active sites that ignored thissimilarity
  • Therefore, aligning thefunctional regions
    isincompatible withaligning the most
    similarsequences -- need diff.alignments for
    ancestryand functionality

12
Multi-level alignments
  • Can align aptamer to Class IV and Class V by
    sequence alone, but cannot align to the others
    without structural info (except in small regions)
  • General problem (for othersequences, not shown
    here)coarse-grained structurevaries too, so
    cannotmatch up the right stemsand loops w/o
    sequenceinformation

S1
B1
S2
S2
B2
S1
13
Solution iterative approach?
  • Need alignment to producereasonable tree and
    structure
  • Need structure and tree toproduce
    reasonablealignment
  • Need sequence sim.to anchor structuresim. b/c
    structure changes
  • Main drawback tocurrent techniquesassume that
    all partsof sequence areequally important for
    aln!

14
Conclusions
  • Goal in both cases should be to connect expert
    knowledge with automated approaches -- too big a
    gap at present
  • Motifs central database with accessions has many
    advantages, but will the community support it?
  • Alignments probably need to move away from the
    annotate one alignment model towards many
    alignments for the same set of sequences
    depending on task model -- needs hierarchical
    view of homology, new techniques for connecting
    levels (not clustal!)
Write a Comment
User Comments (0)
About PowerShow.com