1 of 47 - PowerPoint PPT Presentation

About This Presentation
Title:

1 of 47

Description:

Predicted Distribution of Anhinga melanogaster based on. Clement's 4th Edition ... Anhinga. Anhinga. melanogaster. is a. is a. Articulations by Santa Barbara ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 48
Provided by: davetha
Category:
Tags: anhinga

less

Transcript and Presenter's Notes

Title: 1 of 47


1
CleanTAXAn Infrastructure for Reasoning about
Biological Taxonomies
Dave Thau and Bertram Ludäscher
keywords knowledge management, automatic
reasoning, semantic integration, biological
classification
2
Outline
  • Brief Overview of Taxonomies
  • Impact of Different Taxonomic Views on Data
    Analysis
  • Taxonomies and Relations Between Them
  • Using Logic to Determine Inconsistencies and
    discover new relations
  • Initial Results of Large Scale Analysis
  • Some Optimizations
  • Future Work

3
Beginnings of Biological Taxonomy
Egypt, 1500 BC Ebers medical papyrus,
classification of medicinal plants
China, 350 BC Erh-ya dictionary (second century
BC) classifies trees, grasses, herbs, grains,
vegetables
Greece, 300 BC Theophrastus, Historia plantarum
and Causae plantarum 500 plants trees, herbs,
fruiting plants, perennials
4
Taxonomies are EverywhereSystematics
Plantae
kingdom
Tracheophyta
phylum
Magnoliopsida
class
Ranunculales
order
family
Ranunculaceae
genus
Ranunculus
Ranunculus asiaticus
species
5
Taxonomies are EverywhereThe Dewey Decimal
System
  • 000 Computers and general reference
  • 100 Philosophy and psychology
  • 200 Religion
  • 300 Social sciences
  • 400 Language
  • 500 Science
  • 600 Technology
  • 700 Arts and Recreation
  • 800 Literature
  • 900 History and geography

6
Taxonomies are EverywherePhylogenies
From Thomas D. Als, Roger Vila, Nikolai P.
Kandul, David R. Nash, Shen-Horn Yen, Yu-Feng
Hsu, André A. Mignault, Jacobus J. Boomsma and
Naomi E. Pierce. Nature 432, 386-390.
7
Taxonomies are EverywhereProtein Structure
From Ed Green http//compbio.berkeley.edu/people/e
d/SeqCompEval/
8
Taxonomies are Useful, But Slippery
  • In all of these cases, taxonomies
  • Help us organize information
  • Allow us to make inferences at many levels of
    generality
  • However, taxonomies are simply "views" of real
    data
  • Dewey Decimal or Library of Congress?
  • Benson's view of Ranunculus or Kartesz's view?
  • Conflicting phylogenies are common
  • SCOP versus CATH

9
Different Taxonomies Can Lead To Different Results
Predicted Distribution of Anhinga melanogaster
based on Clement's 4th Edition
Predicted Distribution of Anhinga melanogaster
based on
Clement's 5th Edition
Anhinga
Anhinga
is a
is a
is a
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Anhinga melanogaster
?
?
contained in
contained in
contained in
Articulations by Santa Barbara Software Products
Georeferenced observation data retrieved from The
Global Biodiversity Information Facility
www.gbif.org. Distribution maps created using
the GARP niche modeling algorithm embedded in a
Kepler workflow.
10
Different Taxonomies Complicate Data Analysis
  • What were the average number of Ranunculus
    arizonicus seen in transect 1 in 2005?

11
Reasoning With Taxonomic Concepts
  • Peet05 articulates relation between Benson48 and
    Kartesz04 names
  • Is that articulation consistent?
  • Can we infer additional information?

12
Problem Statement
  • What are taxonomies, anyway?
  • How do you know a taxonomy makes sense?
  • Given some articulations meant to translate
    between taxonomies
  • do they make sense, or are there internal
    contradictions?
  • have they left out anything which may be inferred
    logically?

13
What are Taxonomies?
  • A simple definition A directed acyclic graph of
    nodes and edges, where the edges represent a
    "subtype" relation

Anhinga
is a
is a
is a
Anhinga rufa
Anhinga nova.
Anhinga melanogaster
Potential additional constraints
  • children are disjoint (child-disjointness, D)
  • children partition their parents (coverage, C)
  • nodes are non-empty (non-emptiness, N)

We call these "latent taxonomic assumptions"
  • More than one LTA may apply
  • 8 combinationsnone, C, D, N, CD, CN, DN, CDN

14
Inconsistency in a Taxonomy
  • Inconsistent under the ND (non-emptiness and
    disjoint children) LTA.

A
B
C
D
If B and C are children of A, then they must be
disjoint. However, they both contain elements
of D
15
How do Taxonomies Relate?
  • Articulations relate nodes between taxonomies

Between any two nodes in the taxonomies, one, and
only one, of the following five relations must
hold
(ii) proper inclusion
(iii) proper inverse inclusion
M ? N
M gt N
M lt N
M o N
M x N
16
Many Possible Articulation Sets
FNA-03, 1997
Benson, 1948
lt
Ranunculus aquatilis
Ranunculus aquatilis
º
R.a. var aquatilis
R.a. var diffusus
R.a. var hispidulus
R.a. var capillaceus
R.a. var calvescens
º
lt
º
lt
Five relationships, plus "unknown/unstated
relation", and 3 x 4 nodes results in 612 (over
2 billion) sets of articulations.
17
Articulations Some Make Sense
Taxonomy 1
Taxonomy 2
A lt D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
B lt F
18
Articulations Some Are Impossible
Taxonomy 1
Taxonomy 2
A
D
isa
isa
isa
isa
C
B
E
F
C gt F
B lt F
Assuming non-emptiness, and disjoint children LTAs
19
Articulations Some Imply other Articulations
Taxonomy 1
Taxonomy 2
A ? D
A
D
isa
isa
isa
isa
C
B
E
F
C ? E
Implies B ? F
Assuming non-emptiness, disjoint children and
coverage LTAs
20
The Relation Lattice
  • Sometimes, a single relation between two nodes
    is unknown.
  • The relation lattice shows all 32 possible
    combined relations.
  • Each node represents a disjunction of relations.

21
The Complexity of Developing Articulations
The Ranunculus data set 9 Taxonomies 654 Taxa 704
Articulations visualization by Martin Graham
22
Example Articulation Set
Benson, 1948
Kartesz, 2004
O
O
C
B
A
A
B
C
D
K
L
M
I
J
E
F
G
H
X
is included in
A R. petioralis B R. macrantus C R.
fascicularis
equals
O
overlaps
X
disjoint
23
Goal To Help Bob Know
  • that the taxonomies he's working with are
    consistent
  • when he's introduced an articulation that leads
    to inconsistency
  • when an articulation is implied by others
  • about ambiguous articulations

24
Berendsohn, et. al, 2003 - MoReTaX
25
Logic Based Approach
  • Devise a language LTax
  • First-order logic constraints on single-place
    predicates, where each predicate is a "taxon"
  • Render taxonomies and articulations between them
    into a set of first-order formulas
  • Then can ask,
  • does a taxonomy follow your definition of
    taxonomy?
  • is a pair of taxonomies plus articulations
    between them consistent?
  • are there unstated articulations?

26
Translating Taxonomy into Logic
Taxonomy and LTA Formulas
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ?x N1(x) ? ?N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
Articulation Formulas
Congruence M ? N ?xM(x) ? N(x)
Proper Inclusion M gt N ?xN(x) ? M(x) ? ?a M(a) ? ?N(a)
Proper Inverse Inclusion M lt N ?xM(x) ? N(x) ? ?a N(a) ? ?M(a)
Partial Overlap M o N ??a?b?c M(a) ? N(a) ? M(b) ? ?N(b) ? ?M(c) ? N(c)
Exclusion M x N ??x M(x) ? N(x)
Taxonomy and latent-taxonomic assumption rules
isa for each edge M isa N add ?xM(x) ? N(x)
Non-Emptiness (N) for each node N, add ?x N(x)
Child Disjointness (D) for each two children N1, N2 of M, add ??x N1(x) ? N2(x)
Coverage (C) for each node M with children N1,..NL, add ?xM(x) ? N1(x) ? ? NL(x)
27
Theorem Proving
28
CleanTax Methodology
Given a set of taxonomies and articulations
between them
  1. Check each taxonomy under each LTA set to see if
    it's consistent
  2. Check the articulations under each LTA set to see
    if they are consistent
  3. Check the taxonomies plus the articulations under
    the LTA sets from above and make sure the
    combination is consistent
  4. If so, for each pair-wise combination of nodes,
    try to prove each possible relationship under
    each consistent LTA set.

Implemented using python. The theorem prover
prover9, and the model searcher mace4, are used
to prove relationships and check consistency.
29
The CleanTAX Infrastructure
  • Features
  • Designed to plug in a variety of reasoners
  • Works with computer clusters (Sun Grid Engine)
  • Can work with whole taxonomies or subsets
  • Command line options
  • Specify taxonomies and articulation sets to test
  • Specify relations to test
  • Specify LTAs to test
  • Specify nodes to test
  • Pass parameters to the reasoners
  • Inputs
  • Taxonomic Concept Schema (an XML spec)
  • Individual reasoner files
  • Internal representation
  • Example Reports
  • Which taxonomies are consistent under which LTAs
  • For each pair of nodes tested, for each relation,
    under each LTA, whether or not it can be proven
    true
  • For each set of taxonomies and articulations,
    under each LTA, a graph showing new infered
    relations

30
Initial results
We ran two Ranunculus taxonomies (Benson 1948,
218 Taxa and Kartesz 2004, 142 Taxa) and 206
Articulations from Peet 2005. When the
taxonomies and the articulations were analyzed as
a whole, only two LTA combinations were provably
consistent no LTAs and non-emptiness. This
involved 928,680 judgments and took 46.0
hours. To get a better sense for the impact of
LTAs, the combined taxonomies and articulations
were divided into 82 connected subgraphs Among
these we found 5 inconsistencies and 1946 new
articulations This involved 166,920 judgments
and took 4.8 hours.
31
Discovered Inconsistent Mappingunder the
coverage, disjointness, non-emptiness LTA set
Benson, 1948
Kartesz, 2004
gt
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
R.h. var natans
º
º
Peet, 2005 B.1948R.h.stolonifer is congruent
to K.2004R.h.stolonifer B.1948R.h.typicus is
congruent to K.2004R.h.typicus B.1948R.
hydrocharoides is congruent to K.2004R.
hydrocharoides
The most likely fix here is to change the
congruence relation between the top two nodes to
instead state that Benson's R. hydrocharoides
includes Kartesz's
32
Formal Proof of Inconsistency
33
Inferring Additional Knowledge
Does C E? Or, is C gt E?
Benson, 1948
Kartesz, 2004
lt
A Ranunculus hispidus B R.h. var caricetorum C
R.h. var hispidus D R.h. var nitidus E
Ranunculus hispidus F R.h. var eurylobus G R.h.
var greenmanii H R.h. var marilandicus I R.h.
var typicus J R. septentrionalis K R.
carolinanis
E
A
J
K
F
I
H
G
B
C
D
lt
lt
lt
lt
º
º
Taxonomy provided isa (?)
Articulated Proper Inverse Inclusion (lt)
Articulated Congruence (?)
34
Most Informative Relation (MIR)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
35
Latent Taxonomic Assumptions vs New Maximally
Informative Relations
The Basic Five Relations The Other 28 Relations
No LTAs 245 304
All Three LTAs 475 74
Numbers represent novel provably true relations
within 75 sub-taxonomies. Main finding More
constraints lead to more specificity in provably
true relations
36
Optimizations
LTA Optimization
If a set of axioms is inconsistent under one
node, it will be inconsistent under all the
supersets of that node.
37
Finding the MIRAlgorithm 1 Bottom Up (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
?
Try relations on the bottom rank in order, then,
if none is true, go to the next rank.
38
Finding the MIRAlgorithm 2 Top Down (A?)
?gtltox
?gtlto
?gtltx
?gtox
gtltox
?ltox
ltox
?gto
gtlto
gtltx
?ltx
?lto
gtox
?gtx
?ox
?gtlt
gtlt
gto
lto
ox
gtx
?x
?lt
ltx
?o
?gt
lt
gt
o
?
x
((A ? B ? C ? D) ? ?E) ? ((B ? C ? D ? E) ?
?A) ? (B ? C ? D )
?
Just check the relations in penultimate rank
39
Relation Lattice Optimization Results 1
Comparing the two full taxonomies, under the
nonemptiness LTA shows a strong improvement for
the top-down optimization
A0 A? A?
Number of Judgments 928,680 912,779 154,780
Time (hours) 46.0 45.3 7.8 (a 5.8x speedup)
Logical Steps (millions) 2,634 2,589 442
40
Relation Lattice Optimization Results 2
Under more restrictive constraints, the bottom-up
optimization improves. Results are for 75
sub-taxonomies under the NDC LTA.
A0 A? A?
Number of Judgments 17,019 2,194 2,745
Time (seconds) 574.59 83.61 (a 6.9x speedup) 100.47 (a 5.7x speedup)
Logical Steps (thousands) 2,484 384 394
41
Summary Contributions To Date
  • Represented taxonomies and articulations between
    them in logic
  • Clarified and represented latent taxonomic
    assumptions
  • Created an infrastructure capable of applying
    reasoners large taxonomies and articulation sets
  • discovering inconsistencies
  • discovering interesting new relations
  • elucidating impact of LTAs on reasoning
  • Described and tested three optimizations

42
Future Work Applications
  • Paul Craig and Jessie Kennedy (2007), School of
    Computing, Napier University, Edinburgh

43
Future Work Suggesting Fixes
Benson, 1948
Kartesz, 2004
º
Ranunculus hydrocharoides
Ranunculus hydrocharoides
R.h. var natans
R.h. var stolonifer
R.h. var typicus
R.h. var stolonifer
R.h. var typicus
º
º
  • Inconsistency found, suggested fixes
  • Change relation between Ranunculus hydrocharoides
    (Benson, 1948) and Ranunculus hydrocharoides
    (Kartesz, 2004) from ? to gt.
  • Relax Non-Emptiness constraint, allowing
    Ranunculus hydrocharoides var. natans to be
    empty.
  • Relax Coverage constraint, allowing R.
    hydrocharoides to contain specimens not contained
    in its children

44
Future Work Other Logics DL
Benson, 1948
Kartesz, 2004
Ranunculus
Ranunculus
Ranunculus petiolaris
Ranunculus petiolaris
Ranunculus macranthus


lt
?gt
45
Other Future Work
  • Better parallelization
  • Better interfaces (GUI, Web Services)
  • Applications to other domains
  • Enhancing reporting tools to better support data
    curation

46
Conclusions
  • Taxonomies are more complicated than you may have
    thought.
  • Logic is a useful tool for discovering
    inconsistencies and new relations in taxonomies
    and articulations between them.
  • This is an interesting interdisciplinary line of
    research combining elements from systematics,
    artificial intelligence, and high-performance
    computing.

47
Thanks!
  • Acknowledgements

Invaluable Consultation Bertram Ludäscher and Shawn Bowers
Ranunculus Data Set Bob Peet
Visualization Tools Jessie Kennedy, Martin Graham and Paul Craig
Niche Modeling Kirsten Menger-Anderson
Funding and Context The SEEK project
References
D. Thau and B. Ludäscher. Reasoning about
Taxonomies in First-Order Logic. Journal of
Ecological Informatics, (accepted for publication
in 2007). D. Thau and B. Ludäscher. Toward
Optimizing CleanTAX An Automated Reasoning
Method for Taxonomies and Articulations.
(submitted to 2007 IEEE/WIC/ACM International
Conference on Web Intelligence.
SEEK is supported by the National Science
Foundation under awards 0225676. 0225665,
0225635, and 0533368.
Write a Comment
User Comments (0)
About PowerShow.com