1 of 47

About This Presentation

Title:

1 of 47

Description:

Predicted Distribution of Anhinga melanogaster based on. Clement's 4th Edition ... Anhinga. Anhinga. melanogaster. is a. is a. Articulations by Santa Barbara ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 48

Provided by: davetha

Category:

Tags: anhinga

more less

Transcript and Presenter's Notes

Title: 1 of 47

1
CleanTAXAn Infrastructure for Reasoning about
Biological Taxonomies
Dave Thau and Bertram Ludäscher
keywords knowledge management, automatic
reasoning, semantic integration, biological
classification
2
Outline

Brief Overview of Taxonomies
Impact of Different Taxonomic Views on Data
Analysis
Taxonomies and Relations Between Them
Using Logic to Determine Inconsistencies and
discover new relations
Initial Results of Large Scale Analysis
Some Optimizations
Future Work

3
Beginnings of Biological Taxonomy
Egypt, 1500 BC Ebers medical papyrus,
classification of medicinal plants
China, 350 BC Erh-ya dictionary (second century
BC) classifies trees, grasses, herbs, grains,
vegetables
Greece, 300 BC Theophrastus, Historia plantarum
and Causae plantarum 500 plants trees, herbs,
fruiting plants, perennials
4
Taxonomies are EverywhereSystematics
Plantae
kingdom
Tracheophyta
phylum
Magnoliopsida
class
Ranunculales
order
family
Ranunculaceae
genus
Ranunculus
Ranunculus asiaticus
species
5
Taxonomies are EverywhereThe Dewey Decimal
System

000 Computers and general reference
100 Philosophy and psychology
200 Religion
300 Social sciences
400 Language
500 Science
600 Technology
700 Arts and Recreation
800 Literature
900 History and geography

6
Taxonomies are EverywherePhylogenies
From Thomas D. Als, Roger Vila, Nikolai P.
Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng
Hsu, André A. Mignault, Jacobus J. Boomsma and
Naomi E. Pierce. Nature 432, 386-390.
7
Taxonomies are EverywhereProtein Structure
From Ed Green http//compbio.berkeley.edu/people/e
d/SeqCompEval/
8
Taxonomies are Useful, But Slippery

In all of these cases, taxonomies
Help us organize information
Allow us to make inferences at many levels of
generality
However, taxonomies are simply "views" of real
data
Dewey Decimal or Library of Congress?
Benson's view of Ranunculus or Kartesz's view?
Conflicting phylogenies are common
SCOP versus CATH

9
Different Taxonomies Can Lead To Different Results
Predicted Distribution of Anhinga melanogaster
based on Clement's 4th Edition
Predicted Distribution of Anhinga melanogaster
based on
Clement's 5th Edition
Anhinga
Anhinga
is a
is a
is a
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Anhinga melanogaster
?
?
contained in
contained in
contained in
Articulations by Santa Barbara Software Products
Georeferenced observation data retrieved from The
Global Biodiversity Information Facility
www.gbif.org. Distribution maps created using
the GARP niche modeling algorithm embedded in a
Kepler workflow.
10
Different Taxonomies Complicate Data Analysis

What were the average number of Ranunculus
arizonicus seen in transect 1 in 2005?

11
Reasoning With Taxonomic Concepts

Peet05 articulates relation between Benson48 and
Kartesz04 names
Is that articulation consistent?
Can we infer additional information?

12
Problem Statement

What are taxonomies, anyway?
How do you know a taxonomy makes sense?
Given some articulations meant to translate
between taxonomies
do they make sense, or are there internal
contradictions?
have they left out anything which may be inferred
logically?

13
What are Taxonomies?

A simple definition A directed acyclic graph of
nodes and edges, where the edges represent a
"subtype" relation

Anhinga
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Potential additional constraints

children are disjoint (child-disjointness, D)
children partition their parents (coverage, C)
nodes are non-empty (non-emptiness, N)

We call these "latent taxonomic assumptions"

More than one LTA may apply
8 combinationsnone, C, D, N, CD, CN, DN, CDN

14
Inconsistency in a Taxonomy

Inconsistent under the ND (non-emptiness and
disjoint children) LTA.

A
B
C
D
If B and C are children of A, then they must be
disjoint. However, they both contain elements
of D
15
How do Taxonomies Relate?

Articulations relate nodes between taxonomies

Between any two nodes in the taxonomies, one, and
only one, of the following five relations must
hold
(ii) proper inclusion
(iii) proper inverse inclusion
M ? N
M gt N
M lt N
M o N
M x N
16
Many Possible Articulation Sets
FNA-03, 1997
Benson, 1948
lt
Ranunculus aquatilis
Ranunculus aquatilis
º
R.a. var aquatilis
R.a. var diffusus
R.a. var hispidulus
R.a. var capillaceus
R.a. var calvescens
º
lt
º
lt
Five relationships, plus "unknown/unstated
relation", and 3 x 4 nodes results in 612 (over
2 billion) sets of articulations.
17
Articulations Some Make Sense
Taxonomy 1
Taxonomy 2
A lt D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
B lt F
18
Articulations Some Are Impossible
Taxonomy 1
Taxonomy 2
A
D
isa
isa
isa
isa
C
B
E
F
C gt F
B lt F
Assuming non-emptiness, and disjoint children LTAs
19
Articulations Some Imply other Articulations
Taxonomy 1
Taxonomy 2
A ? D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
Implies B ? F
Assuming non-emptiness, disjoint children and
coverage LTAs
20
The Relation Lattice

Sometimes, a single relation between two nodes
is unknown.
The relation lattice shows all 32 possible
combined relations.
Each node represents a disjunction of relations.

21
The Complexity of Developing Articulations
The Ranunculus data set 9 Taxonomies 654 Taxa 704
Articulations visualization by Martin Graham
22
Example Articulation Set
Benson, 1948
Kartesz, 2004
O
O
C
B
A
A
B
C
D
K
L
M
I
J
E
F
G
H
X
is included in
A R. petioralis B R. macrantus C R.
fascicularis
equals
O
overlaps
X
disjoint
23
Goal To Help Bob Know

that the taxonomies he's working with are
consistent
when he's introduced an articulation that leads
to inconsistency
when an articulation is implied by others
about ambiguous articulations

24
Berendsohn, et. al, 2003 - MoReTaX
25
Logic Based Approach

Devise a language LTax
First-order logic constraints on single-place
predicates, where each predicate is a "taxon"
Render taxonomies and articulations between them
into a set of first-order formulas
Then can ask,
does a taxonomy follow your definition of
taxonomy?
is a pair of taxonomies plus articulations
between them consistent?
are there unstated articulations?

26
Translating Taxonomy into Logic
Taxonomy and LTA Formulas
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ?x N1(x) ? ?N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
Articulation Formulas
Congruence M ? N ?xM(x) ? N(x)
Proper Inclusion M gt N ?xN(x) ? M(x) ? ?a M(a) ? ?N(a)
Proper Inverse Inclusion M lt N ?xM(x) ? N(x) ? ?a N(a) ? ?M(a)
Partial Overlap M o N ??a?b?c M(a) ? N(a) ? M(b) ? ?N(b) ? ?M(c) ? N(c)
Exclusion M x N ??x M(x) ? N(x)
Taxonomy and latent-taxonomic assumption rules
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ??x N1(x) ? N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
27
Theorem Proving
28
CleanTax Methodology
Given a set of taxonomies and articulations
between them

Check each taxonomy under each LTA set to see if
it's consistent
Check the articulations under each LTA set to see
if they are consistent
Check the taxonomies plus the articulations under
the LTA sets from above and make sure the
combination is consistent
If so, for each pair-wise combination of nodes,
try to prove each possible relationship under
each consistent LTA set.

Implemented using python. The theorem prover
prover9, and the model searcher mace4, are used
to prove relationships and check consistency.
29
The CleanTAX Infrastructure

Features
Designed to plug in a variety of reasoners
Works with computer clusters (Sun Grid Engine)
Can work with whole taxonomies or subsets
Command line options
Specify taxonomies and articulation sets to test
Specify relations to test
Specify LTAs to test
Specify nodes to test
Pass parameters to the reasoners
Inputs
Taxonomic Concept Schema (an XML spec)
Individual reasoner files
Internal representation
Example Reports
Which taxonomies are consistent under which LTAs
For each pair of nodes tested, for each relation,
under each LTA, whether or not it can be proven
true
For each set of taxonomies and articulations,
under each LTA, a graph showing new infered
relations

30
Initial results
We ran two Ranunculus taxonomies (Benson 1948,
218 Taxa and Kartesz 2004, 142 Taxa) and 206
Articulations from Peet 2005. When the
taxonomies and the articulations were analyzed as
a whole, only two LTA combinations were provably
consistent no LTAs and non-emptiness. This
involved 928,680 judgments and took 46.0
hours. To get a better sense for the impact of
LTAs, the combined taxonomies and articulations
were divided into 82 connected subgraphs Among
these we found 5 inconsistencies and 1946 new
articulations This involved 166,920 judgments
and took 4.8 hours.
31
Discovered Inconsistent Mappingunder the
coverage, disjointness, non-emptiness LTA set
Benson, 1948
Kartesz, 2004
gt
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
R.h. var natans
º
º
Peet, 2005 B.1948R.h.stolonifer is congruent
to K.2004R.h.stolonifer B.1948R.h.typicus is
congruent to K.2004R.h.typicus B.1948R.
hydrocharoides is congruent to K.2004R.
hydrocharoides
The most likely fix here is to change the
congruence relation between the top two nodes to
instead state that Benson's R. hydrocharoides
includes Kartesz's
32
Formal Proof of Inconsistency
33
Inferring Additional Knowledge
Does C E? Or, is C gt E?
Benson, 1948
Kartesz, 2004
lt
A Ranunculus hispidus B R.h. var caricetorum C
R.h. var hispidus D R.h. var nitidus E
Ranunculus hispidus F R.h. var eurylobus G R.h.
var greenmanii H R.h. var marilandicus I R.h.
var typicus J R. septentrionalis K R.
carolinanis
E
A
J
K
F
I
H
G
B
C
D
lt
lt
lt
lt
º
º
Taxonomy provided isa (?)
Articulated Proper Inverse Inclusion (lt)
Articulated Congruence (?)
34
Most Informative Relation (MIR)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
35
Latent Taxonomic Assumptions vs New Maximally
Informative Relations
The Basic Five Relations The Other 28 Relations
No LTAs 245 304
All Three LTAs 475 74
Numbers represent novel provably true relations
within 75 sub-taxonomies. Main finding More
constraints lead to more specificity in provably
true relations
36
Optimizations
LTA Optimization
If a set of axioms is inconsistent under one
node, it will be inconsistent under all the
supersets of that node.
37
Finding the MIRAlgorithm 1 Bottom Up (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
Try relations on the bottom rank in order, then,
if none is true, go to the next rank.
38
Finding the MIRAlgorithm 2 Top Down (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
((A ? B ? C ? D) ? ?E) ? ((B ? C ? D ? E) ?
?A) ? (B ? C ? D )
?
Just check the relations in penultimate rank
39
Relation Lattice Optimization Results 1
Comparing the two full taxonomies, under the
nonemptiness LTA shows a strong improvement for
the top-down optimization
A0 A? A?
Number of Judgments 928,680 912,779 154,780
Time (hours) 46.0 45.3 7.8 (a 5.8x speedup)
Logical Steps (millions) 2,634 2,589 442
40
Relation Lattice Optimization Results 2
Under more restrictive constraints, the bottom-up
optimization improves. Results are for 75
sub-taxonomies under the NDC LTA.
A0 A? A?
Number of Judgments 17,019 2,194 2,745
Time (seconds) 574.59 83.61 (a 6.9x speedup) 100.47 (a 5.7x speedup)
Logical Steps (thousands) 2,484 384 394
41
Summary Contributions To Date

Represented taxonomies and articulations between
them in logic
Clarified and represented latent taxonomic
assumptions
Created an infrastructure capable of applying
reasoners large taxonomies and articulation sets
discovering inconsistencies
discovering interesting new relations
elucidating impact of LTAs on reasoning
Described and tested three optimizations

42
Future Work Applications

Paul Craig and Jessie Kennedy (2007), School of
Computing, Napier University, Edinburgh

43
Future Work Suggesting Fixes
Benson, 1948
Kartesz, 2004
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var natans
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
º
º

Inconsistency found, suggested fixes
Change relation between Ranunculus hydrocharoides
(Benson, 1948) and Ranunculus hydrocharoides
(Kartesz, 2004) from ? to gt.
Relax Non-Emptiness constraint, allowing
Ranunculus hydrocharoides var. natans to be
empty.
Relax Coverage constraint, allowing R.
hydrocharoides to contain specimens not contained
in its children

44
Future Work Other Logics DL
Benson, 1948
Kartesz, 2004
Ranunculus
Ranunculus
Ranunculus petiolaris
Ranunculus petiolaris
Ranunculus macranthus

lt
?gt
45
Other Future Work

Better parallelization
Better interfaces (GUI, Web Services)
Applications to other domains
Enhancing reporting tools to better support data
curation

46
Conclusions

Taxonomies are more complicated than you may have
thought.
Logic is a useful tool for discovering
inconsistencies and new relations in taxonomies
and articulations between them.
This is an interesting interdisciplinary line of
research combining elements from systematics,
artificial intelligence, and high-performance
computing.

47
Thanks!

Acknowledgements

Invaluable Consultation Bertram Ludäscher and Shawn Bowers
Ranunculus Data Set Bob Peet
Visualization Tools Jessie Kennedy, Martin Graham and Paul Craig
Niche Modeling Kirsten Menger-Anderson
Funding and Context The SEEK project
References
D. Thau and B. Ludäscher. Reasoning about
Taxonomies in First-Order Logic. Journal of
Ecological Informatics, (accepted for publication
in 2007). D. Thau and B. Ludäscher. Toward
Optimizing CleanTAX An Automated Reasoning
Method for Taxonomies and Articulations.
(submitted to 2007 IEEE/WIC/ACM International
Conference on Web Intelligence.
SEEK is supported by the National Science
Foundation under awards 0225676. 0225665,
0225635, and 0533368.

Write a Comment

User Comments (0)