Title: 1 of 47
1CleanTAXAn Infrastructure for Reasoning about
Biological Taxonomies
Dave Thau and Bertram Ludäscher
keywords knowledge management, automatic
reasoning, semantic integration, biological
classification
2Outline
- Brief Overview of Taxonomies
- Impact of Different Taxonomic Views on Data
Analysis - Taxonomies and Relations Between Them
- Using Logic to Determine Inconsistencies and
discover new relations - Initial Results of Large Scale Analysis
- Some Optimizations
- Future Work
3Beginnings of Biological Taxonomy
Egypt, 1500 BC Ebers medical papyrus,
classification of medicinal plants
China, 350 BC Erh-ya dictionary (second century
BC) classifies trees, grasses, herbs, grains,
vegetables
Greece, 300 BC Theophrastus, Historia plantarum
and Causae plantarum 500 plants trees, herbs,
fruiting plants, perennials
4Taxonomies are EverywhereSystematics
Plantae
kingdom
Tracheophyta
phylum
Magnoliopsida
class
Ranunculales
order
family
Ranunculaceae
genus
Ranunculus
Ranunculus asiaticus
species
5Taxonomies are EverywhereThe Dewey Decimal
System
- 000 Computers and general reference
- 100 Philosophy and psychology
- 200 Religion
- 300 Social sciences
- 400 Language
- 500 Science
- 600 Technology
- 700 Arts and Recreation
- 800 Literature
- 900 History and geography
6Taxonomies are EverywherePhylogenies
From Thomas D. Als, Roger Vila, Nikolai P.
Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng
Hsu, André A. Mignault, Jacobus J. Boomsma and
Naomi E. Pierce. Nature 432, 386-390.
7Taxonomies are EverywhereProtein Structure
From Ed Green http//compbio.berkeley.edu/people/e
d/SeqCompEval/
8Taxonomies are Useful, But Slippery
- In all of these cases, taxonomies
- Help us organize information
- Allow us to make inferences at many levels of
generality - However, taxonomies are simply "views" of real
data - Dewey Decimal or Library of Congress?
- Benson's view of Ranunculus or Kartesz's view?
- Conflicting phylogenies are common
- SCOP versus CATH
9Different Taxonomies Can Lead To Different Results
Predicted Distribution of Anhinga melanogaster
based on Clement's 4th Edition
Predicted Distribution of Anhinga melanogaster
based on
Clement's 5th Edition
Anhinga
Anhinga
is a
is a
is a
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Anhinga melanogaster
?
?
contained in
contained in
contained in
Articulations by Santa Barbara Software Products
Georeferenced observation data retrieved from The
Global Biodiversity Information Facility
www.gbif.org. Distribution maps created using
the GARP niche modeling algorithm embedded in a
Kepler workflow.
10Different Taxonomies Complicate Data Analysis
- What were the average number of Ranunculus
arizonicus seen in transect 1 in 2005?
11Reasoning With Taxonomic Concepts
- Peet05 articulates relation between Benson48 and
Kartesz04 names - Is that articulation consistent?
- Can we infer additional information?
12Problem Statement
- What are taxonomies, anyway?
- How do you know a taxonomy makes sense?
- Given some articulations meant to translate
between taxonomies - do they make sense, or are there internal
contradictions? - have they left out anything which may be inferred
logically?
13What are Taxonomies?
- A simple definition A directed acyclic graph of
nodes and edges, where the edges represent a
"subtype" relation
Anhinga
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Potential additional constraints
- children are disjoint (child-disjointness, D)
- children partition their parents (coverage, C)
- nodes are non-empty (non-emptiness, N)
We call these "latent taxonomic assumptions"
- More than one LTA may apply
- 8 combinationsnone, C, D, N, CD, CN, DN, CDN
14Inconsistency in a Taxonomy
- Inconsistent under the ND (non-emptiness and
disjoint children) LTA.
A
B
C
D
If B and C are children of A, then they must be
disjoint. However, they both contain elements
of D
15How do Taxonomies Relate?
- Articulations relate nodes between taxonomies
Between any two nodes in the taxonomies, one, and
only one, of the following five relations must
hold
(ii) proper inclusion
(iii) proper inverse inclusion
M ? N
M gt N
M lt N
M o N
M x N
16Many Possible Articulation Sets
FNA-03, 1997
Benson, 1948
lt
Ranunculus aquatilis
Ranunculus aquatilis
º
R.a. var aquatilis
R.a. var diffusus
R.a. var hispidulus
R.a. var capillaceus
R.a. var calvescens
º
lt
º
lt
Five relationships, plus "unknown/unstated
relation", and 3 x 4 nodes results in 612 (over
2 billion) sets of articulations.
17Articulations Some Make Sense
Taxonomy 1
Taxonomy 2
A lt D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
B lt F
18Articulations Some Are Impossible
Taxonomy 1
Taxonomy 2
A
D
isa
isa
isa
isa
C
B
E
F
C gt F
B lt F
Assuming non-emptiness, and disjoint children LTAs
19Articulations Some Imply other Articulations
Taxonomy 1
Taxonomy 2
A ? D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
Implies B ? F
Assuming non-emptiness, disjoint children and
coverage LTAs
20The Relation Lattice
- Sometimes, a single relation between two nodes
is unknown. - The relation lattice shows all 32 possible
combined relations. - Each node represents a disjunction of relations.
21The Complexity of Developing Articulations
The Ranunculus data set 9 Taxonomies 654 Taxa 704
Articulations visualization by Martin Graham
22Example Articulation Set
Benson, 1948
Kartesz, 2004
O
O
C
B
A
A
B
C
D
K
L
M
I
J
E
F
G
H
X
is included in
A R. petioralis B R. macrantus C R.
fascicularis
equals
O
overlaps
X
disjoint
23Goal To Help Bob Know
- that the taxonomies he's working with are
consistent - when he's introduced an articulation that leads
to inconsistency - when an articulation is implied by others
- about ambiguous articulations
24Berendsohn, et. al, 2003 - MoReTaX
25Logic Based Approach
- Devise a language LTax
- First-order logic constraints on single-place
predicates, where each predicate is a "taxon" - Render taxonomies and articulations between them
into a set of first-order formulas - Then can ask,
- does a taxonomy follow your definition of
taxonomy? - is a pair of taxonomies plus articulations
between them consistent? - are there unstated articulations?
26Translating Taxonomy into Logic
Taxonomy and LTA Formulas
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ?x N1(x) ? ?N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
Articulation Formulas
Congruence M ? N ?xM(x) ? N(x)
Proper Inclusion M gt N ?xN(x) ? M(x) ? ?a M(a) ? ?N(a)
Proper Inverse Inclusion M lt N ?xM(x) ? N(x) ? ?a N(a) ? ?M(a)
Partial Overlap M o N ??a?b?c M(a) ? N(a) ? M(b) ? ?N(b) ? ?M(c) ? N(c)
Exclusion M x N ??x M(x) ? N(x)
Taxonomy and latent-taxonomic assumption rules
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ??x N1(x) ? N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
27Theorem Proving
28CleanTax Methodology
Given a set of taxonomies and articulations
between them
- Check each taxonomy under each LTA set to see if
it's consistent - Check the articulations under each LTA set to see
if they are consistent - Check the taxonomies plus the articulations under
the LTA sets from above and make sure the
combination is consistent - If so, for each pair-wise combination of nodes,
try to prove each possible relationship under
each consistent LTA set.
Implemented using python. The theorem prover
prover9, and the model searcher mace4, are used
to prove relationships and check consistency.
29The CleanTAX Infrastructure
- Features
- Designed to plug in a variety of reasoners
- Works with computer clusters (Sun Grid Engine)
- Can work with whole taxonomies or subsets
- Command line options
- Specify taxonomies and articulation sets to test
- Specify relations to test
- Specify LTAs to test
- Specify nodes to test
- Pass parameters to the reasoners
- Inputs
- Taxonomic Concept Schema (an XML spec)
- Individual reasoner files
- Internal representation
- Example Reports
- Which taxonomies are consistent under which LTAs
- For each pair of nodes tested, for each relation,
under each LTA, whether or not it can be proven
true - For each set of taxonomies and articulations,
under each LTA, a graph showing new infered
relations
30Initial results
We ran two Ranunculus taxonomies (Benson 1948,
218 Taxa and Kartesz 2004, 142 Taxa) and 206
Articulations from Peet 2005. When the
taxonomies and the articulations were analyzed as
a whole, only two LTA combinations were provably
consistent no LTAs and non-emptiness. This
involved 928,680 judgments and took 46.0
hours. To get a better sense for the impact of
LTAs, the combined taxonomies and articulations
were divided into 82 connected subgraphs Among
these we found 5 inconsistencies and 1946 new
articulations This involved 166,920 judgments
and took 4.8 hours.
31Discovered Inconsistent Mappingunder the
coverage, disjointness, non-emptiness LTA set
Benson, 1948
Kartesz, 2004
gt
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
R.h. var natans
º
º
Peet, 2005 B.1948R.h.stolonifer is congruent
to K.2004R.h.stolonifer B.1948R.h.typicus is
congruent to K.2004R.h.typicus B.1948R.
hydrocharoides is congruent to K.2004R.
hydrocharoides
The most likely fix here is to change the
congruence relation between the top two nodes to
instead state that Benson's R. hydrocharoides
includes Kartesz's
32Formal Proof of Inconsistency
33Inferring Additional Knowledge
Does C E? Or, is C gt E?
Benson, 1948
Kartesz, 2004
lt
A Ranunculus hispidus B R.h. var caricetorum C
R.h. var hispidus D R.h. var nitidus E
Ranunculus hispidus F R.h. var eurylobus G R.h.
var greenmanii H R.h. var marilandicus I R.h.
var typicus J R. septentrionalis K R.
carolinanis
E
A
J
K
F
I
H
G
B
C
D
lt
lt
lt
lt
º
º
Taxonomy provided isa (?)
Articulated Proper Inverse Inclusion (lt)
Articulated Congruence (?)
34Most Informative Relation (MIR)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
35Latent Taxonomic Assumptions vs New Maximally
Informative Relations
The Basic Five Relations The Other 28 Relations
No LTAs 245 304
All Three LTAs 475 74
Numbers represent novel provably true relations
within 75 sub-taxonomies. Main finding More
constraints lead to more specificity in provably
true relations
36Optimizations
LTA Optimization
If a set of axioms is inconsistent under one
node, it will be inconsistent under all the
supersets of that node.
37Finding the MIRAlgorithm 1 Bottom Up (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
Try relations on the bottom rank in order, then,
if none is true, go to the next rank.
38Finding the MIRAlgorithm 2 Top Down (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
((A ? B ? C ? D) ? ?E) ? ((B ? C ? D ? E) ?
?A) ? (B ? C ? D )
?
Just check the relations in penultimate rank
39Relation Lattice Optimization Results 1
Comparing the two full taxonomies, under the
nonemptiness LTA shows a strong improvement for
the top-down optimization
A0 A? A?
Number of Judgments 928,680 912,779 154,780
Time (hours) 46.0 45.3 7.8 (a 5.8x speedup)
Logical Steps (millions) 2,634 2,589 442
40Relation Lattice Optimization Results 2
Under more restrictive constraints, the bottom-up
optimization improves. Results are for 75
sub-taxonomies under the NDC LTA.
A0 A? A?
Number of Judgments 17,019 2,194 2,745
Time (seconds) 574.59 83.61 (a 6.9x speedup) 100.47 (a 5.7x speedup)
Logical Steps (thousands) 2,484 384 394
41Summary Contributions To Date
- Represented taxonomies and articulations between
them in logic - Clarified and represented latent taxonomic
assumptions - Created an infrastructure capable of applying
reasoners large taxonomies and articulation sets - discovering inconsistencies
- discovering interesting new relations
- elucidating impact of LTAs on reasoning
- Described and tested three optimizations
42Future Work Applications
- Paul Craig and Jessie Kennedy (2007), School of
Computing, Napier University, Edinburgh
43Future Work Suggesting Fixes
Benson, 1948
Kartesz, 2004
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var natans
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
º
º
- Inconsistency found, suggested fixes
- Change relation between Ranunculus hydrocharoides
(Benson, 1948) and Ranunculus hydrocharoides
(Kartesz, 2004) from ? to gt. - Relax Non-Emptiness constraint, allowing
Ranunculus hydrocharoides var. natans to be
empty. - Relax Coverage constraint, allowing R.
hydrocharoides to contain specimens not contained
in its children
44Future Work Other Logics DL
Benson, 1948
Kartesz, 2004
Ranunculus
Ranunculus
Ranunculus petiolaris
Ranunculus petiolaris
Ranunculus macranthus
lt
?gt
45Other Future Work
- Better parallelization
- Better interfaces (GUI, Web Services)
- Applications to other domains
- Enhancing reporting tools to better support data
curation
46Conclusions
- Taxonomies are more complicated than you may have
thought. - Logic is a useful tool for discovering
inconsistencies and new relations in taxonomies
and articulations between them. - This is an interesting interdisciplinary line of
research combining elements from systematics,
artificial intelligence, and high-performance
computing.
47Thanks!
Invaluable Consultation Bertram Ludäscher and Shawn Bowers
Ranunculus Data Set Bob Peet
Visualization Tools Jessie Kennedy, Martin Graham and Paul Craig
Niche Modeling Kirsten Menger-Anderson
Funding and Context The SEEK project
References
D. Thau and B. Ludäscher. Reasoning about
Taxonomies in First-Order Logic. Journal of
Ecological Informatics, (accepted for publication
in 2007). D. Thau and B. Ludäscher. Toward
Optimizing CleanTAX An Automated Reasoning
Method for Taxonomies and Articulations.
(submitted to 2007 IEEE/WIC/ACM International
Conference on Web Intelligence.
SEEK is supported by the National Science
Foundation under awards 0225676. 0225665,
0225635, and 0533368.