Title: Inferring phylogenetic trees from unaligned molecular sequences
1Inferring phylogenetic trees from un-aligned
molecular sequences
Presented by Mark Ragan based on work by Michael
Höhl
ARC Centre of Excellence in Bioinformatics and
Institute for Molecular Bioscience The
University of Queensland
2Accepted approaches to phylogenetic inference
involve multiple sequence alignment
Distancemethods
Distances
Un-aligned homologous sequences
MSA
Tree(s)
Maximum parsimony Maximum likelihood Bayesian
inference
3A multiple sequence alignment is a hypothesis of
homology at each and every sequence position
(alignment column)
4But multiple sequence alignment is hard
NP-hard, in fact Hence diversity of heuristic
approaches (local vs global, anchored, tree-based
etc.) How to estimate parameters ?
and available data are getting much bigger
5Can we skip multiple sequence alignment?
?
Un-aligned sequences
Tree(s)
Alignment-free methods
6What advantages might we hope to realise by
taking an alignment-free approach?
- At minimum, we could eliminate the NP-hard MSA
step (and perhaps replace it with something more
computationally tractable) for analysis of
nucleotide and protein datasets - We should be able to make inferences based
directly on rearranged, recombined, shuffled
permuted sequences - Perhaps we can also make better inferences from
incomplete and/or noisy sequences - If results are promising, there may be scope for
further improvement (unlike MSA, which has been
so well-studied that most future improvements
will probably be marginal)
7Proteins with circular permutationsin ProDom
Pfam
From J Weiner III E Bornberg-Bauer, MBE
23734-743 (2006)
8Two key steps in alignment-free approaches (lots
more detail to follow !)
1 Extract homology information from a set of
homologous but un-aligned sequences 2 Then use
this information to infer trees
9Types of information we can extract
Shared words (k-mers) Shared patterns Features
(e.g. lengths) of shared sub-strings Complexity
measures (from information theory) (Non-sequence
information, e.g. structure)
To what extent are these homology information?
10Step 1 Extracting homology information from a
set of homologous but un-aligned sequences
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
11Words in sequences
- Alphabet A with c characters
- For example, c 20 for amino acids in proteins
- w ck different words (k-mers) of length k
- 201 20 different 1-mers (A, , R, ..., V)
- 202 400 different 2-mers (AA, AR, ..., RA,
..., VV) - 203 8000 different 3-mers (AAA, AAR, ..., VVV)
Words (unlike patterns, introduced next) are
non-degenerate
12Finding words in sequences
- Example 3-mers
- M A C A D A M I A1 2 31 2 31 2 3 MAC,
ADA, MIA, 1 2 31 2 31 2 ACA, DAM,
1 2 31 2 31 CAD, AMI - There are L-k1 words (occurrences)
- L length of sequence
- In the example above 9-31 7 words
- Finding them is fast O(L)
13Generalising words patterns (1) Elementary
patterns
L matching residues in window of width
W Example L 3, W 3 L 2, W
3 X ..ILM.... ..IVM.....Y
....ILM.. ...ILM..
ILM I. M The pattern must occur
in at least K 2 sequences We discover patterns
using the Teiresias algorithm (Rigoutsos
Floratos 1998) O W L m log m W m,
where m size of input set
14Generalising words patterns (2) Maximal patterns
- Example G.RE REA. EA.T G.REA.T
(L 3, W 4)for X ...GrREAaT Y
GgREAtT.....
15False positive (non-homologous) patterns
- Short degenerate patterns can occur by chance
(i.e. non-homologously) multiple times in a
sequence - Response filter patterns
- Remove patterns with gt 1 instance in a sequence
- ExampleX ..BATH...BATH...Y .....BOTH.......
- Reduces false positives (and some true
positives!) - A variant of this filtering will be introduced
later
16Converting sets of words (k-mers) or patterns to
distances and trees
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
17(Squared) Euclidean distance
- dE Blaisdell (1986)
- Based on words
- Example
- p1(x1,y1) p2(x2,y2)
- len(p1,p2) v( (x1-x2)2 (y1-y2)2 )
- dE(p1,p2) (x1-x2)2 (y1-y2)2
- Sequences instead of points
- k-mer counts are coordinates()
- X (AAA0, ..., MAC1, ADA1, MIA1, ..., VVV0)
- () if sequences are modelled as Markov chains,
the k-mer counts - are generalisable as the corresponding
transition matrices
18Standardised Euclidean distance
- dS Wu, Burke Davison (1997)
-
where si are standard deviations - Problem words may overlap (affects variance
probabilities) - Example 4-mers (start at )
- AAAAAAAA LALALALA RACQRACQ
-
- AAAA5 LALA3 RACQ2
- To account for variance, we need equilibrium
frequencies - Variance Gentleman Mullin (1989)
19Fraction of common k-mer counts
- dF Edgar (2004)
- Distance based on fraction of common k-mer counts
- Idea Similar sequences share k-mers
100 ID 50 ID X RACQ
RACQ Y RACQ RAK I Common
2-mers X RA, AC, CQ RA, AC, CQ Y RA, AC, CQ
RA, AR, K I
(e is often set to 0.1)
20Probabilities of common k-mer counts
- dP adapted from Van Helden (2004)
- Distance based on probabilities of common k-mer
counts under a multiplicative Poisson model - Idea weigh common words by probabilities under a
Poisson distribution. Self-overlapping words are
removed. - Equilibrium frequencies are required
21Composition distance
- dC Hao Qi (2004)
- Idea describe k-mer from shorter words
Expected counts under Markov model - of order k-2 E(RACQ) Observed
counts c(RAC), c(ACQ), c(AC) - How different are these k-mer counts? v(RACQ)
(c-E) / E - Measure angle between the two composition
vectors v(X) (v-values for words in X) v(Y)
(v-values for words in Y)
(Derivation of the composition vectors v isnt
shown, but goes pretty much as expected)
22W-metric
- dW Vinga, Gouveia-Oliveira Almeida (2004)
- Based on frequencies f of words of length 1 (i.e.
letters) - Idea weight 1-mers according to a similarity
matrix - Here, we base the pairwise weights on BLOSUM62
- Thus the approach incorporates a model of
sequence change
23Pattern-based approach
- Find all instances of patterns meeting criteria
(L, W, K) - Extract and concatenate these for each pair of
sequences - (this necessarily yields a pair of strings of
identical length) - X GrREAaTPATTeRNiNSTaNCES Y
GgREAtTPATTiRNaNSTeNCES - Variant dPB-ML estimate pairwise distances by ML
under a - defined model of sequence change (we use JTT)
- Variant dPB-SIM transform similarity matrix S
into distance - matrix D (Di j Si i Si j 2Si j ), then
compute distances - using the BLOSUM62 matrix
24Patterns can conflict
- For two sequences X ...ACADEMIA...
Y .AHEM.MACADAMIA. - Patterns A.EM, ACAD.MIA
- The two patterns align residue E in X
differently - (probably contradicting the notion of
homology) - Position matters !
Extension to three sequences X
...ACADEMIA... Y .AHEM.MACADAMIA. Z
.........ADAM..... Patterns A.EM, ACAD.MIA,
AD.M X residue E aligns to residues
E, A, A Majority consensus is A This is the
approach taken by variant dPBMC
25Average Common Substrings
- dACS Ulitsky, Burstein, Tuller Chor (2006)
- Idea sum over lengths of (maximal) common
substrings - Example
- X MACADAMIA..... 1 2
- Y ...ACADEMIA.. 2 1
- Step 1 M matches (length 1)
- Step 2 ACAD matches (length 4)
- Finding substrings is fast (suffix arrays)
where m n are lengths of X Y
26Lempel-Ziv complexity
- dLZ Otu Sayood (2003)
- Not based on words or patterns
- Idea
- AAAAAAAAA MACADAMIA
- simple complex
- Related algorithmic complexity pkzip, gzip etc.
where (XY) refers to a concatenation of X and
Y and c(X) is the number of components required
to produce X
27BAliBASE 2.0 Thompson, Plewniak Poch 1999
ff. Reference set 1 equidistant sequences at
various degrees of divergence Reference set 2
families aligned with highly divergent orphan
sequence Reference set 3 families with low
(lt25) sequence identity Reference set 4
sequences with N- and C-terminal
extensions Reference set 5 sequences with
internal insertions (Reference set 8 families
with circular permutations)
BAliBASE 3.0 is now available
28Simple k-mer distances
dE (Euclidean distance) is sequence-length
dependent dS (standardised Euclidean
distance) and dW (W-metric) likewise show
little linearity with sequence divergence
dE
dS
dW
29Simple k-mer distances (cont.)
dC (composition distance) is somewhat linear
over a narrow divergence range for AA, but
this is lost with CE alphabet dLZ shows limited
linearity with both AA (shown) and CE
dC, AA
dC, CE
dLZ
30Simple k-mer distances (cont.)
dACS is linear over a wider divergence range
than is dC dF behaves much like dC, but with
greater dynamic range however, 25 of
distances undefined unless e gt 0 dP
saturates 30 of distances are1.0
(numerical instability from multiplying small
probabilities)
dACS
dF
dP
31Do patterns perform better than words (k-mers)
?First, we must parameterise Teiresias
- Remember L matches in window of width W over k
sequences - Parameterisation using BAliBASE families
- (all sequences ? 49 amino acids in length)
For 1 lt L lt 4, there are too many false
positives For L 4, examined W 8, 9, 16,
i.e. identity 50 to 25 All pairwise distances
are defined only for W ? 16
W defined W defined W
defined 8 97.36 11 99.63 14 99.94 9 98.53 12 9
9.73 15 99.98 10 99.27 13 99.91 16 100.0
32Parameterising Teiresias (cont.)
If we could make sequences more similar, all
distances should be defined at smaller W
- Encode using chemical equivalences (CE)
- AG, DE, FY, KR, ILMV, QN, ST
- X DDELPMKEGDCMTI DDDIPIKDADCISI
Y
EEDIDLHLGDILTV DDDIDIHIADIISI
With BAliBASE, all distances are now defined at W
8 For comparison purposes, base distances on
original AA data
33 Pattern-based distances AA alphabet linear up
to 2.5 substitutions/site acceptable
dynamic range CE alphabet similar linearity
with greater dynamic range Variant dPBMC
(CE) high variance at higher sequence
divergence
AA
dPBMC
CE
34Step 2 Using extracted homology information to
infer trees
35One alternative Having computed all pairwise
distances from the homology information), we can
generate a distance tree using e.g.
neighbour-joining or Fitch-Margoliash
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
36Other alternatives There are also ways to infer
trees from words and patterns that dont involve
computing a distance
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
37One non-distance approach to inference from
extracted homology information
Every k-mer or pattern can be represented as a
distinct character with state 1 (present) or 0
(absent) k-mers AA AC AD AE AF . . .
ZZ Sequence A 1 1 0 0 0 . . . 0
Sequence B 1 0 0 1 1 . . . 0
Sequence Y 0 1 1 0 1 . . . 0 A
tree can then be inferred from this
character-state matrix using parsimony or
(better) a Bayesian approach
Other such strategies can be imagined
38Comparing and evaluatingalignment-free approaches
- We want to examine
- Alignment-free distances are they linear?
metric? - Pattern-based methods how to estimate
parameters? - Influence of alphabet (full amino acid vs
restricted) - Is the correct topology returned in standard
cases? - Is the correct topology returned in difficult
cases, - e.g. shuffled domains?
- Computational complexity
- Implementation in software
39Synthetic datasets
- Synthetic trees generated using PhyloGen
(Rambaut) - Seven 8-taxon tree distributions with 100 trees
each - Pairwise distances medians ( quartiles)
- Sets 1-7 0.75 ... 3.42 0.38
substitutions/site - Sequences evolved along trees under JTT model
using - SEQ-GEN (Rambaut Grassly)
- Control (no ASRV) 1000 AA, no ASRV
- ASRV 1000 AA, high ASRV (a 0.5)
40How accurate are phylogenetic trees inferred
using alignment-free distances?
Alignment-free approach
Compute distances
Tree(s)
Synthetic datasets
Tree-comparison metrics
Classical approach
Tree(s)
MSA
41Robinson-Foulds topology-comparison metric
- RF(T1,T2) 2
- RF(T1,T3) 6 (maximum)
T1
T3
T2
42Tree reconstruction accuracy
Control data (no ASRV)
43Tree reconstruction accuracy
ASRV data
44Statistical assessment (1)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method
45Statistical assessment (2)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW
46Statistical assessment (3)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW dML is
significantly more-accurate than any
alignment-free method
47Empirical (putative ortholog) datasets
- 22,432 optimised MSAs and the corresponding
protein-family trees (N 4 taxa) from 144
prokaryote genomes (Beiko, Harlow Ragan) - We further require
- -- all sequences from taxa with 4
representative genomes in dataset - -- strong support for clade (deep branch
with BPP 0.95) - -- alignment length 200 AA with ? 10 weak
columns - Sort trees by number divergence of sequences
- -- number few (4-8) vs many (12-20)
- -- divergence short (0.5-1.0) vs long
(2.5-3.0) mean subst/site, SD ? 0.5 - Thus four subsets
- -- few-short (50 alignments)
- -- few-long (52 alignments)
- -- many-short (80 alignments)
- -- many-long (38 alignments)
48 False negative distances (x10) for
empirical (144-genome) dataset, ordered by
ranksum
49Another way of looking at performance counts of
unrecovered deep branches, ordered by
50Selection of optimal k (k-mer size)
- Data generated with/without across-sites rate
variation (ASRV) - Calculate distances varying k (k-mer word length)
- Trees via neighbour-joining (NJ)
Fitch-Margoliash (FM) - Robinson-Foulds (RF) Quartet (Q) distance
comparison metrics - 4 combinations RF-NJ, RF-FM, Q-NJ, Q-FM
k-mer distance approach, varying k
(Alphabet reduction)
Tree(s)
Tree-comparison metrics
Synthetic data
MSA approach
Tree(s)
51- dE with sets 1, 4 7
- Alphabets AA CE
- Distance minimised at k 3-5
- Representative for most word-based AF methods
Set 1
Set 4
Set 7
52- dC with sets 1, 4 7
- Alphabets AA CE
- Distance minimised at k 4-5
- Rough parameter space
Set 1
Set 4
Set 7
53B-bin for set 2 control vs ASRV data
- Alphabets AA CE
- Distance minimised at k 3-4
- Representative for most word-based AF methods
Control
ASRV
54B-bin for set 4 control vs ASRV data
- Distances minimised at k 3-5
- RF distance halved when ASRV
Control
ASRV
55B-bin for set 6 control vs ASRV data
- Distance minimised at k 3-4
- Much smaller distance when ASRV
- Marked optimum when ASRV
Control
ASRV
56A broadly optimal k (k-mer size) ?
Topological difference between alignment-free and
MSA trees is minimised for all approaches and
datasets within narrow range of k For full AA
alphabet, k 3-5 For reduced (CE) alphabet, k
4-6 a pleasantly surprising result (but no
theory to indicate why this might be the case)
57Summary
- It is indeed possible to compute reasonably
accurate phylogenetic trees from un-aligned
molecular sequences - These trees are less accurate than those inferred
from aligned sequences using the best methods - Properly parameterised patterns extract homology
information more fully than do non-degenerate
k-mers (words) - Word length (k) shows an unexpectedly tight range
of optimality across the datasets we examined - A reduced alphabet (CE) yields more-accurate
trees when distances are relatively large - A Bayesian alternative yields good (but not the
best) trees - We introduce an order-of-magnitude faster
implementation (PB-SIM) at a small cost
in accuracy
58Perspective outlook Unlike MSA algorithms,
which have been under development for decades,
alignment-free methods are newer, suggesting that
there may be substantial scope for
improvement However, we should be cautious about
introducing computationally intensive
refinements, as these could undermine one of the
major motivations for alignment-free methods in
the first place Although the Bayesian approach
didnt yield more-accurate trees, it may
nonetheless offer other advantages, e.g. based on
posterior probabilities or ML estimates of branch
lengths Alignment-free methods might find other
applications, e.g. in phylogenetic inference
based on metagenomic data, low-coverage genomic
sequence, ESTs, or unalignable (e.g.
intergenic) regions
59Software available
- Package decafpy
- (DistancE Calculation using Alignment-Free
methods in Python) - Available from http//www.bioinformatics.org.au/
- Free (under GPL)
- Suite of commandline tools, complete with
description of all options, an object-oriented
library, and programmatic access (API)
60Acknowledgements
Michael Höhl !!!
- Isidore Rigoutsos, IBM TJ Watson Research Center
- Rob Beiko, IMB / Dalhousie University
- Denis Baurain, Université de Liège
- Tamir Tuller, Tel-Aviv University
61Special thanks to Australian Research
Council ARC Centre of Excellence in
Bioinformatics Institute for Molecular Bioscience