Title: Driving the Terminology Hub
1Driving the Terminology Hub
- RDF Triplets as a means to express lexical and
referential data. - Therese Vachon, NIBR, Unit Head UltraLink
Technologies - W3C Workshop on RDF Access to Relational
Databases - 25-26 October, 2007 Boston, MA, USA
2Requirements
- Cross-linking of database information on e.g.
genes, proteins, metabolic pathways, compounds,
ligands. to the original sources is a key issue. - The productivity for accessing, sharing,
searching, navigating, cross-linking and
analyzing internal data and external data
relevant for the Pharmaceutical industry should
be increased
3Strategy
- In NIBR, we have been developing a semantic
integration layer on top of knowledge resources
that has been implemented within various services
and applications. - It uses
- A rich domain-specific terminology (biology,
chemistry and medicine) containing 1.6 Mio terms - A Terminology Hub containing 8 GB of referential
data (cross-references between data
repositories.) - Using that knowledge, the scientist can access
all data at hand with just a single mouse-click.
4Application Areas for Terminologies
- Categorization of documents (via associated
taxonomies) - Search for concepts
- Semantic expansion of queries using synonyms and
related terms - Identification and extraction of relevant
concepts (like e.g. targets, genes, diseases,
products) from texts - Annotation of textual data with controlled terms
as referential anchors - Construction of a semantic layer on top of
information sources allowing navigation
context-sensitive navigation (Ultralink)
5Application Areas for the Terminology Hub
- Coherent mapping between Terminologies and Coding
Systems (e.g. Uniprot Accession Number for a
Protein) - Coherent mapping between internal knowledge
repositories(e.g. Biological Assays and Chemical
Compounds) - Coherent mapping between external knowledge
repositories (e.g. HUGO and OMIM) - Coherent mapping between internal and external
knowledge repositories (e.g. Internal Project
Code and Product Name)
6Activation Ultralink
Ultralink Plug-in icon
2
Activation Concept Types Frame
UltraLink
7(No Transcript)
8The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
- The Ultralink makes use of a broad range of
knowledge sources both internal to Novartis and
external. The linkage of these terminologies
provide the routes along which you can navigate
when using the Ultralink. - The linkage between the resources is created
automatically via a rule-based mapping procedure
and manually by annotation. The latter is
extremely important for connecting internal
knowledge sources together and to external ones. - The annotations built on the fly by the UltraLink
could be stored as RDF annotations associated to
a document and be accessed by other computer
programs just in the spirit of the Semantic Web
9The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
10Underlying terminologies used at NIBR
- gt 15000 Companies with gt 35000 terms
- gt 2000 Diseases with gt19000 terms
- gt 150000 Genes with about 400000 terms
- gt 5000 Modes of Action with gt 12000 terms
- gt 95000 Products with gt 380000 terms
- gt 170000 Targets with gt 250000 terms
- gt 310000 Species with gt 435000 terms
- complete MESH and EMTREE
- More than 1600000 terms
- The terminology consists of terms, and relations
between terms (main entry normalized terms,
synonyms, broader terms, narrower terms)
11Principles used for the construction of the
terminology and organization of terms
- In order to create the terminology of reference,
terms are extracted from available terminologies
(e.g. UniProt, EntrezGene, HGNC, etc.) and the
references to the source systems are preserved. - Terms specific to a database are referred as
local terms. These local terms are stored in a
dedicated data structure, the Metastore. Besides
the flat set of terms, thesaurus relations such
as synonymy, broader term and narrower terms are
extracted as well thus allowing to create a
thesaurus. - For each entry in the terminology like e.g. for a
gene name or for a product, a term is chosen
among the list of synonyms and is declared as a
normalized term - Normalized / global terms, synonyms / local terms
as well as broader and narrower terms together
with their sources of reference constitute the
terminology content behind the UltraLink and are
used by the Terminology Hub.
12Creating Reference the Terminology Hub
- Different knowledge repositories have different
ways to encode a concept - Registry Number
- Unique Internal ID
- Concept Identifier
- Enumerating terms
- Just using different terms without any constraints
More than 8 GB of cross-referencing information
Searching a term T both in source A and B may
lead to different results because of different
naming/referencing conventions (false negatives
in IR)
- Terminology Hub ensures coherent mapping
- Between coding systems
- Between different representation levels (e.g. ID
vs. Concept) - Between local terms and global terms
13Classes of objects covered by the Terminology Hub
- Coding systems
- A coding system provides a predefined set of
(sometimes hierarchical) codes to represent a
classification, a nomenclature, a controlled
vocabulary, a thesaurus or chemical structures.
For example, you can use the MeSH Tree number
C06.405.205.697 to refer to Gastritis in a
specific sub-tree of MeSH - References
- Unique and unequivocal identifiers based on a
coding system create references in their
corresponding data repository. By nature, they
are technical artifacts and not part of our
scientific natural language (e.g. FTY720),
nevertheless most of them deserve to be
identified, being used in scientific literature. - Pointers and cross-referencing information
- The Metastore contains pointers that allow to
cross reference knowledge sources and
applications.
14Classes of objects covered by the Terminology Hub
- Terms
- A term is the smallest meaningful linguistic unit
on which our domains of discourse (biology,
chemistry, medicine) are based. A term is
something different than a word because a term
can consist of multiple meaningful words such as
chronic obstructive pulmonary disease. - Concepts
- A concept is an abstraction based on properties
of individuals that we observe in the world.
Individuals that belong to the same concept share
a set of common properties. For example,
targets share the property that they should be
druggable. - Data Repositories also named Knowledge Sources
- For all kinds of different data, we use the
general notion of a data repository. Using the
term data repository we emphasize the fact that
there is a source where some data resides without
making any commitments about physical
representation (e.g. database or text file) or
format of representation (e.g. structured or free
text).
15Classes of objects covered by the Terminology Hub
synonym-ofbroadernarrower
Termsspinal cord vascular endothelial growth
factor CCR5 Glivec ovarian cancer Novartis Cytomeg
alovirus ...
EncodingIUPAC Structures IDs GIF Symbols Formulas
Registry Numbers ...
Data RepositoriesInternal Chemistry DB CI
sources Literature Patents ...
encodes
has-type
ReferenceCompound nos Project codes Competitor
codes PMID 9683255 EntrezGene 450128 CAS
439-14-5 Patent numbers
ConceptsSpecies Products Companies Diseases Genes
Targets Mammalian Genes ...
points--to
is-a
16Achievements and Improvements
- All information about terminologies and
cross-references is stored in a relational
database (Oracle 10.2.0.2). - The data in the database can be accessed through
WebServices allowing user to find normalized
terms, pointers for a specific concept-type etc.
17Metastore Web ServiceGet all synonyms for a
normalized form
18UltraLink Web ServicesGet all accessible pointer
types for a normalized form
19Achievements and Improvements
- We intend to improve the semantic representation
of the data in order to facilitate reuse,
interoperability and exchange. - RDF notation and RDF coding standards provide an
adequate means for a richer semantic
representation. - We use SKOS, DublinCore and other RDF-based
coding standards and supplement them with our own
RDF vocabulary.
20Simple Knowledge Organisation System (example)
21Terminology for Diseases (SKOS fragment)
22Converting Terminologies to RDF
- Clear separation of terminologies from
ontologies. We assign a type (rdftype) to the
URI of a term as reference to a concept in an
ontology. - Conversion to RDF increased the amount of data
rougly by the factor 3. - We obtained more than 5 Mio RDF triplets as a
preliminary representation of our terminologies. - We are currently setting up the entire workflow
for generation, storing and querying RDF.
23Conclusion
- The first phase of transforming the terminology
to RDF-XML is completed - We are currently developing a model for
representing the Terminology Hub in RDF. We
expect that an RDF notation of the Terminology
Hub will comprise approximately 50 Mio. RDF
triples - We intend to test the framework thoroughly
(performance, effective semantic gain compared to
the current technology) - Closer collaboration with the W3C Healthcare
group
24Acknowledgements
Thanks to the ULT team
Semantic Text Analytics Layer Martin
Romacker Pierre Parisot Nicolas Grandjean Data
Integration Services Layer Alexander
Fromm Laurent Mentek Application Layer Daniel
Cronenberger Olivier Kreim
Thanks to Manuel Peitsch