Title: Benchmarking ontology-based annotation tools for the Semantic Web
1Benchmarking ontology-based annotation tools for
the Semantic Web
- Diana Maynard
- University of Sheffield, UK
2 3What?
- Work in the context of the EU Network of
Excellence KnowledgeWeb - Case studies in field of bioinformatics
- Developing benchmarking tools and test suites for
ontology generation and evaluation - New metrics for evaluation
- New visualisation tools
- Development of usability criteria
4Why?
- Increasing interest in the use of ontologies in
bioinformatics, as a means of accessing
information automatically from large databases - Ontologies such as GO enable annotation and
querying of large databases such as SWISS-PROT. - Methods for IE have become extremely important in
these fields. - Development of OBIE applications is hampered by
lack of standardisation and suitable metrics for
testing and evaluation - Main focus till now on performance over practical
aspects such as usability and accessibility.
5Gene Ontology
- Collaborative ontology construction has been
practiced in the gene ontology community for a
long time compared with other communities. - This makes it a good case study for testing
applications and metrics. - Used in KnowledgeWeb to show that the SOA tools
supporting communities creating their own
ontologies can be further advanced by suitable
evaluation techniques, amongst other things.
6Automatic Annotation Tools
- Semantic annotation is used to create metadata
linking the text to one or more ontologies - Enables us to combine and associate existing
ontologies, to perform more detailed analysis of
the text, and to extract deeper and more accurate
knowledge - Semantic annotation generally relies on
ontology-based IE techniques - Suitable evaluation metrics and tools for these
new techniques are currently lacking
7 8Requirements for Semantic Annotation Tools
- Expected functionality level of automation,
target domain, text size, speed - Interoperability ontology format, annotation
format, platform, browser - Usability installation, documentation, ease of
use, aesthetics - Accessibility flexibility of design, input and
display alternatives - Scalability text and ontology size
- Reusability range of applications
9Performance Evaluation Metrics
- Evaluation metrics mathematically define how to
measure the systems performance against
human-annotated gold standard - Scoring program implements the metric and
provides performance measures - for each document and over the entire corpus
- for each type of annotation
- may also evaluate changes over time
- A gold standard reference set also needs to be
provided this may be time-consuming to produce - Visualisation tools show the results graphically
and enable easy comparison
10GATE AnnotationDiff Tool
11Correct and incorrect instances attached to
concepts
12Evaluation of instances by source
13Methods of evaluation
- Traditional IE is evaluated in terms of
Precision, Recall and F-measure. - But these are not sufficient for ontology-based
IE, because the distinction between right and
wrong is less obvious - Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong - Similarity metrics need to be integrated so that
items closer together in the hierarchy are given
a higher score, if wrong
14Learning Accuracy
- LA Hahn98 originally defined to measure how
well a concept had been added in the right level
of the ontology, i.e. ontology generation - Later used to measure how well the instance has
been added in the right place in the ontology,
i.e. ontology population. - Main snag is that it doesnt consider the height
of the Key concept, only the height of the
Response concept. - Also means that similarity is not bidirectional,
which is intuitively wrong.
15Balanced Distance Metric
- We propose BDM as an improvement over LA
- Considers the relative specificity of the
taxonomic positions of the key and response - Does not distinguish between the directionality
of this relative specificity, e.g. Key can be a
specific concept (e.g. 'car') and the response a
general concept (e.g. 'relation'), or vice versa.
- Distances are normalised wrt average length of
chain - Makes the penalty in terms of node traversal
relative to the semantic density of the concepts
in question
16BDM the metric
- BDM is calculated for all correct and partially
correct responses
CP distance from root to MSCA DPK distance
from MSCA to Key DPR distance from MSCA to
Response n1 average length of the set of chains
containing the key or the response concept,
computed from the root concept.
17Augmented Precision and Recall
BDM is integrated with traditional Precision and
Recall in the following way
18Conclusions
- Semantic annotation evaluation requires
- New metrics
- Usability evaluation
- Visualisation software
- Bioinformatics field is a good testbench, e.g.
evaluation of protein name taggers - Implementation in GATE
- Knowledge Web benchmarking suite for evaluating
ontologies and ontology-based tools
19 A final thought on evaluation
We didnt underperform. You overexpected.