Inferring phylogenetic trees from unaligned molecular sequences - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Inferring phylogenetic trees from unaligned molecular sequences

Description:

Patterns: A.EM, ACAD.MIA. The two patterns 'align' residue E in X differently ... neighbour-joining or Fitch-Margoliash. Un-aligned homologous. sequences ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 62

Provided by: wwwmath

Category:

more less

Transcript and Presenter's Notes

Title: Inferring phylogenetic trees from unaligned molecular sequences

1
Inferring phylogenetic trees from un-aligned
molecular sequences
Presented by Mark Ragan based on work by Michael
Höhl
ARC Centre of Excellence in Bioinformatics and
Institute for Molecular Bioscience The
University of Queensland
2
Accepted approaches to phylogenetic inference
involve multiple sequence alignment
Distancemethods
Distances
Un-aligned homologous sequences
MSA
Tree(s)
Maximum parsimony Maximum likelihood Bayesian
inference
3
A multiple sequence alignment is a hypothesis of
homology at each and every sequence position
(alignment column)
4
But multiple sequence alignment is hard
NP-hard, in fact Hence diversity of heuristic
approaches (local vs global, anchored, tree-based
etc.) How to estimate parameters ?
and available data are getting much bigger
5
Can we skip multiple sequence alignment?
?
Un-aligned sequences
Tree(s)
Alignment-free methods
6
What advantages might we hope to realise by
taking an alignment-free approach?

At minimum, we could eliminate the NP-hard MSA
step (and perhaps replace it with something more
computationally tractable) for analysis of
nucleotide and protein datasets
We should be able to make inferences based
directly on rearranged, recombined, shuffled
permuted sequences
Perhaps we can also make better inferences from
incomplete and/or noisy sequences
If results are promising, there may be scope for
further improvement (unlike MSA, which has been
so well-studied that most future improvements
will probably be marginal)

7
Proteins with circular permutationsin ProDom
Pfam
From J Weiner III E Bornberg-Bauer, MBE
23734-743 (2006)
8
Two key steps in alignment-free approaches (lots
more detail to follow !)
1 Extract homology information from a set of
homologous but un-aligned sequences 2 Then use
this information to infer trees
9
Types of information we can extract
Shared words (k-mers) Shared patterns Features
(e.g. lengths) of shared sub-strings Complexity
measures (from information theory) (Non-sequence
information, e.g. structure)
To what extent are these homology information?
10
Step 1 Extracting homology information from a
set of homologous but un-aligned sequences
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
11
Words in sequences

Alphabet A with c characters
For example, c 20 for amino acids in proteins
w ck different words (k-mers) of length k
201 20 different 1-mers (A, , R, ..., V)
202 400 different 2-mers (AA, AR, ..., RA,
..., VV)
203 8000 different 3-mers (AAA, AAR, ..., VVV)

Words (unlike patterns, introduced next) are
non-degenerate
12
Finding words in sequences

Example 3-mers
M A C A D A M I A1 2 31 2 31 2 3 MAC,
ADA, MIA, 1 2 31 2 31 2 ACA, DAM,
1 2 31 2 31 CAD, AMI
There are L-k1 words (occurrences)
L length of sequence
In the example above 9-31 7 words
Finding them is fast O(L)

13
Generalising words patterns (1) Elementary
patterns
L matching residues in window of width
W Example L 3, W 3 L 2, W
3 X ..ILM.... ..IVM.....Y
....ILM.. ...ILM..
ILM I. M The pattern must occur
in at least K 2 sequences We discover patterns
using the Teiresias algorithm (Rigoutsos
Floratos 1998) O W L m log m W m,
where m size of input set
14
Generalising words patterns (2) Maximal patterns

Example G.RE REA. EA.T G.REA.T
(L 3, W 4)for X ...GrREAaT Y
GgREAtT.....

15
False positive (non-homologous) patterns

Short degenerate patterns can occur by chance
(i.e. non-homologously) multiple times in a
sequence
Response filter patterns
Remove patterns with gt 1 instance in a sequence
ExampleX ..BATH...BATH...Y .....BOTH.......
Reduces false positives (and some true
positives!)
A variant of this filtering will be introduced
later

16
Converting sets of words (k-mers) or patterns to
distances and trees
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
17
(Squared) Euclidean distance

dE Blaisdell (1986)
Based on words
Example
p1(x1,y1) p2(x2,y2)
len(p1,p2) v( (x1-x2)2 (y1-y2)2 )
dE(p1,p2) (x1-x2)2 (y1-y2)2
Sequences instead of points
k-mer counts are coordinates()
X (AAA0, ..., MAC1, ADA1, MIA1, ..., VVV0)
() if sequences are modelled as Markov chains,
the k-mer counts
are generalisable as the corresponding
transition matrices

18
Standardised Euclidean distance

dS Wu, Burke Davison (1997)
where si are standard deviations
Problem words may overlap (affects variance
probabilities)
Example 4-mers (start at )
AAAAAAAA LALALALA RACQRACQ
AAAA5 LALA3 RACQ2
To account for variance, we need equilibrium
frequencies
Variance Gentleman Mullin (1989)

19
Fraction of common k-mer counts

dF Edgar (2004)
Distance based on fraction of common k-mer counts
Idea Similar sequences share k-mers

100 ID 50 ID X RACQ
RACQ Y RACQ RAK I Common
2-mers X RA, AC, CQ RA, AC, CQ Y RA, AC, CQ
RA, AR, K I
(e is often set to 0.1)
20
Probabilities of common k-mer counts

dP adapted from Van Helden (2004)
Distance based on probabilities of common k-mer
counts under a multiplicative Poisson model
Idea weigh common words by probabilities under a
Poisson distribution. Self-overlapping words are
removed.
Equilibrium frequencies are required

21
Composition distance

dC Hao Qi (2004)
Idea describe k-mer from shorter words
Expected counts under Markov model
of order k-2 E(RACQ) Observed
counts c(RAC), c(ACQ), c(AC)
How different are these k-mer counts? v(RACQ)
(c-E) / E
Measure angle between the two composition
vectors v(X) (v-values for words in X) v(Y)
(v-values for words in Y)

(Derivation of the composition vectors v isnt
shown, but goes pretty much as expected)
22
W-metric

dW Vinga, Gouveia-Oliveira Almeida (2004)
Based on frequencies f of words of length 1 (i.e.
letters)
Idea weight 1-mers according to a similarity
matrix
Here, we base the pairwise weights on BLOSUM62
Thus the approach incorporates a model of
sequence change

23
Pattern-based approach

Find all instances of patterns meeting criteria
(L, W, K)
Extract and concatenate these for each pair of
sequences
(this necessarily yields a pair of strings of
identical length)
X GrREAaTPATTeRNiNSTaNCES Y
GgREAtTPATTiRNaNSTeNCES
Variant dPB-ML estimate pairwise distances by ML
under a
defined model of sequence change (we use JTT)
Variant dPB-SIM transform similarity matrix S
into distance
matrix D (Di j Si i Si j 2Si j ), then
compute distances
using the BLOSUM62 matrix

24
Patterns can conflict

For two sequences X ...ACADEMIA...
Y .AHEM.MACADAMIA.
Patterns A.EM, ACAD.MIA
The two patterns align residue E in X
differently
(probably contradicting the notion of
homology)
Position matters !

Extension to three sequences X
...ACADEMIA... Y .AHEM.MACADAMIA. Z
.........ADAM..... Patterns A.EM, ACAD.MIA,
AD.M X residue E aligns to residues
E, A, A Majority consensus is A This is the
approach taken by variant dPBMC
25
Average Common Substrings

dACS Ulitsky, Burstein, Tuller Chor (2006)
Idea sum over lengths of (maximal) common
substrings
Example
X MACADAMIA..... 1 2
Y ...ACADEMIA.. 2 1
Step 1 M matches (length 1)
Step 2 ACAD matches (length 4)
Finding substrings is fast (suffix arrays)

where m n are lengths of X Y
26
Lempel-Ziv complexity

dLZ Otu Sayood (2003)
Not based on words or patterns
Idea
AAAAAAAAA MACADAMIA
simple complex
Related algorithmic complexity pkzip, gzip etc.

where (XY) refers to a concatenation of X and
Y and c(X) is the number of components required
to produce X
27
BAliBASE 2.0 Thompson, Plewniak Poch 1999
ff. Reference set 1 equidistant sequences at
various degrees of divergence Reference set 2
families aligned with highly divergent orphan
sequence Reference set 3 families with low
(lt25) sequence identity Reference set 4
sequences with N- and C-terminal
extensions Reference set 5 sequences with
internal insertions (Reference set 8 families
with circular permutations)
BAliBASE 3.0 is now available
28
Simple k-mer distances
dE (Euclidean distance) is sequence-length
dependent dS (standardised Euclidean
distance) and dW (W-metric) likewise show
little linearity with sequence divergence
dE
dS
dW
29
Simple k-mer distances (cont.)
dC (composition distance) is somewhat linear
over a narrow divergence range for AA, but
this is lost with CE alphabet dLZ shows limited
linearity with both AA (shown) and CE
dC, AA
dC, CE
dLZ
30
Simple k-mer distances (cont.)
dACS is linear over a wider divergence range
than is dC dF behaves much like dC, but with
greater dynamic range however, 25 of
distances undefined unless e gt 0 dP
saturates 30 of distances are1.0
(numerical instability from multiplying small
probabilities)
dACS
dF
dP
31
Do patterns perform better than words (k-mers)
?First, we must parameterise Teiresias

Remember L matches in window of width W over k
sequences
Parameterisation using BAliBASE families
(all sequences ? 49 amino acids in length)

For 1 lt L lt 4, there are too many false
positives For L 4, examined W 8, 9, 16,
i.e. identity 50 to 25 All pairwise distances
are defined only for W ? 16
W defined W defined W
defined 8 97.36 11 99.63 14 99.94 9 98.53 12 9
9.73 15 99.98 10 99.27 13 99.91 16 100.0
32
Parameterising Teiresias (cont.)
If we could make sequences more similar, all
distances should be defined at smaller W

Encode using chemical equivalences (CE)
AG, DE, FY, KR, ILMV, QN, ST
X DDELPMKEGDCMTI DDDIPIKDADCISI
Y
EEDIDLHLGDILTV DDDIDIHIADIISI

With BAliBASE, all distances are now defined at W
8 For comparison purposes, base distances on
original AA data
33
Pattern-based distances AA alphabet linear up
to 2.5 substitutions/site acceptable
dynamic range CE alphabet similar linearity
with greater dynamic range Variant dPBMC
(CE) high variance at higher sequence
divergence
AA
dPBMC
CE
34
Step 2 Using extracted homology information to
infer trees
35
One alternative Having computed all pairwise
distances from the homology information), we can
generate a distance tree using e.g.
neighbour-joining or Fitch-Margoliash
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
36
Other alternatives There are also ways to infer
trees from words and patterns that dont involve
computing a distance
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
37
One non-distance approach to inference from
extracted homology information
Every k-mer or pattern can be represented as a
distinct character with state 1 (present) or 0
(absent) k-mers AA AC AD AE AF . . .
ZZ Sequence A 1 1 0 0 0 . . . 0
Sequence B 1 0 0 1 1 . . . 0
Sequence Y 0 1 1 0 1 . . . 0 A
tree can then be inferred from this
character-state matrix using parsimony or
(better) a Bayesian approach
Other such strategies can be imagined
38
Comparing and evaluatingalignment-free approaches

We want to examine
Alignment-free distances are they linear?
metric?
Pattern-based methods how to estimate
parameters?
Influence of alphabet (full amino acid vs
restricted)
Is the correct topology returned in standard
cases?
Is the correct topology returned in difficult
cases,
e.g. shuffled domains?
Computational complexity
Implementation in software

39
Synthetic datasets

Synthetic trees generated using PhyloGen
(Rambaut)
Seven 8-taxon tree distributions with 100 trees
each
Pairwise distances medians ( quartiles)
Sets 1-7 0.75 ... 3.42 0.38
substitutions/site
Sequences evolved along trees under JTT model
using
SEQ-GEN (Rambaut Grassly)
Control (no ASRV) 1000 AA, no ASRV
ASRV 1000 AA, high ASRV (a 0.5)

40
How accurate are phylogenetic trees inferred
using alignment-free distances?
Alignment-free approach
Compute distances
Tree(s)
Synthetic datasets
Tree-comparison metrics
Classical approach
Tree(s)
MSA
41
Robinson-Foulds topology-comparison metric

RF(T1,T2) 2
RF(T1,T3) 6 (maximum)

T1
T3
T2
42
Tree reconstruction accuracy
Control data (no ASRV)
43
Tree reconstruction accuracy
ASRV data
44
Statistical assessment (1)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method
45
Statistical assessment (2)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW
46
Statistical assessment (3)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW dML is
significantly more-accurate than any
alignment-free method
47
Empirical (putative ortholog) datasets

22,432 optimised MSAs and the corresponding
protein-family trees (N 4 taxa) from 144
prokaryote genomes (Beiko, Harlow Ragan)
We further require
-- all sequences from taxa with 4
representative genomes in dataset
-- strong support for clade (deep branch
with BPP 0.95)
-- alignment length 200 AA with ? 10 weak
columns
Sort trees by number divergence of sequences
-- number few (4-8) vs many (12-20)
-- divergence short (0.5-1.0) vs long
(2.5-3.0) mean subst/site, SD ? 0.5
Thus four subsets
-- few-short (50 alignments)
-- few-long (52 alignments)
-- many-short (80 alignments)
-- many-long (38 alignments)

48
False negative distances (x10) for
empirical (144-genome) dataset, ordered by
ranksum
49
Another way of looking at performance counts of
unrecovered deep branches, ordered by
50
Selection of optimal k (k-mer size)

Data generated with/without across-sites rate
variation (ASRV)
Calculate distances varying k (k-mer word length)
Trees via neighbour-joining (NJ)
Fitch-Margoliash (FM)
Robinson-Foulds (RF) Quartet (Q) distance
comparison metrics
4 combinations RF-NJ, RF-FM, Q-NJ, Q-FM

k-mer distance approach, varying k
(Alphabet reduction)
Tree(s)
Tree-comparison metrics
Synthetic data
MSA approach
Tree(s)
51

dE with sets 1, 4 7
Alphabets AA CE
Distance minimised at k 3-5
Representative for most word-based AF methods

Set 1
Set 4
Set 7
52

dC with sets 1, 4 7
Alphabets AA CE
Distance minimised at k 4-5
Rough parameter space

Set 1
Set 4
Set 7
53
B-bin for set 2 control vs ASRV data

Alphabets AA CE
Distance minimised at k 3-4
Representative for most word-based AF methods

Control
ASRV
54
B-bin for set 4 control vs ASRV data

Distances minimised at k 3-5
RF distance halved when ASRV

Control
ASRV
55
B-bin for set 6 control vs ASRV data

Distance minimised at k 3-4
Much smaller distance when ASRV
Marked optimum when ASRV

Control
ASRV
56
A broadly optimal k (k-mer size) ?
Topological difference between alignment-free and
MSA trees is minimised for all approaches and
datasets within narrow range of k For full AA
alphabet, k 3-5 For reduced (CE) alphabet, k
4-6 a pleasantly surprising result (but no
theory to indicate why this might be the case)
57
Summary

It is indeed possible to compute reasonably
accurate phylogenetic trees from un-aligned
molecular sequences
These trees are less accurate than those inferred
from aligned sequences using the best methods
Properly parameterised patterns extract homology
information more fully than do non-degenerate
k-mers (words)
Word length (k) shows an unexpectedly tight range
of optimality across the datasets we examined
A reduced alphabet (CE) yields more-accurate
trees when distances are relatively large
A Bayesian alternative yields good (but not the
best) trees
We introduce an order-of-magnitude faster
implementation (PB-SIM) at a small cost
in accuracy

58
Perspective outlook Unlike MSA algorithms,
which have been under development for decades,
alignment-free methods are newer, suggesting that
there may be substantial scope for
improvement However, we should be cautious about
introducing computationally intensive
refinements, as these could undermine one of the
major motivations for alignment-free methods in
the first place Although the Bayesian approach
didnt yield more-accurate trees, it may
nonetheless offer other advantages, e.g. based on
posterior probabilities or ML estimates of branch
lengths Alignment-free methods might find other
applications, e.g. in phylogenetic inference
based on metagenomic data, low-coverage genomic
sequence, ESTs, or unalignable (e.g.
intergenic) regions
59
Software available

Package decafpy
(DistancE Calculation using Alignment-Free
methods in Python)
Available from http//www.bioinformatics.org.au/
Free (under GPL)
Suite of commandline tools, complete with
description of all options, an object-oriented
library, and programmatic access (API)

60
Acknowledgements
Michael Höhl !!!