Title: Initial Proposal for the RNA Alignment Ontology
1Initial Proposal for the RNA Alignment Ontology
- Rob Knight
- Dept Chem Biochem
- CU Boulder
2What do we want to do?
- Represent detailed structural info and other
metadata on alignment - Avoid horizontal and vertical expansion
- Explicitly annotate correspondences at the level
where they occur
3What do alignments look like now?
4Why is this a problem?
5so real alignments look like this, to shoehorn
everything into columns that are assumed to be
homologous
6Homology is problematic
- Fundamental problem systems that are homologous
at one level are not necessarily homologous at
other levels - E.g. bat wings and bird wings homologous as
pentadactyl limbs, but not homologous as wings - Homology is hierarchical andcan partially
overlap at any level(e.g. Griffiths 2006)
Bat forelimbs
Bird forelimbs
Frog forelimbs
Rodent forelimbs
Mammal forelimbs
Tetrapod forelimbs
Ridley Evolution 3rd ed.
7and correspondence need not be homology at all!
- Example from SELEX hammerhead ribozymes
independently evolved at least three times in
nature, and in Jack Szostak and Ron Breakers
labs - However, we still want to be able to align the
functionally equivalent sequences although there
is not evolutionary relationship
8So what are going to use the alignment ontology
for?
9Use case 1 aligning rRNA
10Problem have millions of fragments, want to
align (incl. noncanonical pairs) assign
named regions
11Solution
- Use existing alignment, try to fit new seqs in
- Would be improved if we could explicitly annotate
helices, noncanonical pairs, etc. on the sequence
overall - For display, need to easily show/hide groups of
sequences and/or regions of the sequence
12Use case 2 SELEX
- From large number of unaligned sequences, want to
identify motifs like this (Majerfeld Yarus 2005)
13How is this currently done?
- Find regions that are similar in more sequences
than chance - Group these sequences centered on the motif
- See if the parts of the motif can be related by
helices - See if anything else is reliably found by the
motif - Repeat for other families and see if there are
relationships between them - Group these families together, then iterate
14e.g. here we discovered unpaired G important
15So how do we handle all this? A proposal
- Entities
- sequence_region a thing that defines a set of
bases relative to some sequence (i.e. with
indices for each base) - paired_sequence_region two regions linked by
pairs - helical_sequence_region two regions completely
paired - base region that consists of single nucleotide
- base_pair region that consists of two, paired
bases - canonical_base_pair base pair that is cis-WW
- loop contiguous sequence_region stretching from
i to j such that i-1 and j1 are a base pair - etc. (bulge, internal_loop, junction, etc.)
16So how do we handle all this? A proposal
- Relationships
- correspondence relation among set of
sequence_regions implying all share a feature
(with metadata about how determined) - homology correspondence implying continuous
chain of descent preserving the relation - sequence_similarity correspondence implying
regions are similar in primary sequence - two_d_structure_similarity correspondence
implying regions are similar in 2D structure,
i.e. nested canonical base pairs - secondary_structure_similarity correspondence
implying regions are similar in secondary
structure, i.e. incl. pseudoknots/noncanonicals - tertiary_structure_similarity correspondence
implying regions are similar in 3D structure
17So how do we handle all this? A proposal
- Relationships
- pairing relation that asserts that two
sequence_regions each have parts of at least one
base_pair that connects them - helical_pairing pairing that includes several
base_pairs (not necessarily contiguous) between
two sequence_regions - unbroken_helical_pairing helical_pairing that
includes no bases in the sequence_regions that
are not paired with the other sequence_region, in
order - base_pairing pairing that connects exactly two
bases, annotated with the Leontis-Westhof
classification - More exotic uses for alignment
- microrna_target pairing relation in which one
member is a miRNA and the other is an mRNA
according to SO - same_microrna_target a relation among a set of
sequences that have microrna_target relation to
the same miRNA
18Implementation notes
- Must be able to name regions (e.g. P3 in RNaseP)
and subclass them (e.g. P3 in firmicutes) - Must be able to subclass homologies, e.g.
homologous as wing vs. homologous as limb - Correspondences are all symmetric and transitive,
so can implement as set of regions that share the
correspondence - (probably) dont want to reify names of parts of
well-known RNAs in the overall RNAO?
19Acknowledgements
- RNA Alignment Ontology working group
- James. W. Brown
- Fabrice Jossinet
- Rym Kachouri
- B. Franz. Lang
- Neocles Lenotis
- Gerhard Steger
- Jesse Stombaugh
- Eric Westhof
- Other coauthors
- Amanda Birmingham
- Paul Griffiths
- Franz Lang
-
- Knight Lab members
- Cathy Lozupone
- Micah Hamady
- Chris Lauber
- Jesse Zaneveld
- Jeremy Widmann
- Elizabeth Costello
- Jens Reeder
- Daniel McDonald
- Anh Vu
- Ryan Kennedy
- Julia Goodrich
- Meg Pirrung
- Reece Gesumaria
-
- Trp project
- Irene Majerfeld
- Jana Chochosolousova
- Vikas Malaiya
- Matthew Iyer
- Mike Yarus
-
NSF RCN grant 0443508