Driving the Terminology Hub - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Driving the Terminology Hub

Description:

Semantic expansion of queries using synonyms and related terms ... Normalized / global terms, synonyms / local terms as well as broader and ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 25
Provided by: tvac
Category:

less

Transcript and Presenter's Notes

Title: Driving the Terminology Hub


1
Driving the Terminology Hub
  • RDF Triplets as a means to express lexical and
    referential data.
  • Therese Vachon, NIBR, Unit Head UltraLink
    Technologies
  • W3C Workshop on RDF Access to Relational
    Databases
  • 25-26 October, 2007 Boston, MA, USA

2
Requirements
  • Cross-linking of database information on e.g.
    genes, proteins, metabolic pathways, compounds,
    ligands. to the original sources is a key issue.
  • The productivity for accessing, sharing,
    searching, navigating, cross-linking and
    analyzing internal data and external data
    relevant for the Pharmaceutical industry should
    be increased

3
Strategy
  • In NIBR, we have been developing a semantic
    integration layer on top of knowledge resources
    that has been implemented within various services
    and applications.
  • It uses
  • A rich domain-specific terminology (biology,
    chemistry and medicine) containing 1.6 Mio terms
  • A Terminology Hub containing 8 GB of referential
    data (cross-references between data
    repositories.)
  • Using that knowledge, the scientist can access
    all data at hand with just a single mouse-click.

4
Application Areas for Terminologies
  • Categorization of documents (via associated
    taxonomies)
  • Search for concepts
  • Semantic expansion of queries using synonyms and
    related terms
  • Identification and extraction of relevant
    concepts (like e.g. targets, genes, diseases,
    products) from texts
  • Annotation of textual data with controlled terms
    as referential anchors
  • Construction of a semantic layer on top of
    information sources allowing navigation
    context-sensitive navigation (Ultralink)

5
Application Areas for the Terminology Hub
  • Coherent mapping between Terminologies and Coding
    Systems (e.g. Uniprot Accession Number for a
    Protein)
  • Coherent mapping between internal knowledge
    repositories(e.g. Biological Assays and Chemical
    Compounds)
  • Coherent mapping between external knowledge
    repositories (e.g. HUGO and OMIM)
  • Coherent mapping between internal and external
    knowledge repositories (e.g. Internal Project
    Code and Product Name)

6
Activation Ultralink
Ultralink Plug-in icon
2
Activation Concept Types Frame
UltraLink
7
(No Transcript)
8
The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
  • The Ultralink makes use of a broad range of
    knowledge sources both internal to Novartis and
    external. The linkage of these terminologies
    provide the routes along which you can navigate
    when using the Ultralink.
  • The linkage between the resources is created
    automatically via a rule-based mapping procedure
    and manually by annotation. The latter is
    extremely important for connecting internal
    knowledge sources together and to external ones.
  • The annotations built on the fly by the UltraLink
    could be stored as RDF annotations associated to
    a document and be accessed by other computer
    programs just in the spirit of the Semantic Web

9
The Landscape of Knowledge - Rooting the
Ultralink in Data Sources/Terminologies
10
Underlying terminologies used at NIBR
  • gt 15000 Companies with gt 35000 terms
  • gt 2000 Diseases with gt19000 terms
  • gt 150000 Genes with about 400000 terms
  • gt 5000 Modes of Action with gt 12000 terms
  • gt 95000 Products with gt 380000 terms
  • gt 170000 Targets with gt 250000 terms
  • gt 310000 Species with gt 435000 terms
  • complete MESH and EMTREE
  • More than 1600000 terms
  • The terminology consists of terms, and relations
    between terms (main entry normalized terms,
    synonyms, broader terms, narrower terms)

11
Principles used for the construction of the
terminology and organization of terms
  • In order to create the terminology of reference,
    terms are extracted from available terminologies
    (e.g. UniProt, EntrezGene, HGNC, etc.) and the
    references to the source systems are preserved.
  • Terms specific to a database are referred as
    local terms.  These local terms are stored in a
    dedicated data structure, the Metastore. Besides
    the flat set of terms, thesaurus relations such
    as synonymy, broader term and narrower terms are
    extracted as well thus allowing to create a
    thesaurus.
  • For each entry in the terminology like e.g. for a
    gene name or for a product, a term is chosen
    among the list of synonyms and is declared as a
    normalized term
  • Normalized / global terms, synonyms / local terms
    as well as broader and narrower terms together
    with their sources of reference constitute the
    terminology content behind the UltraLink and are
    used by the Terminology Hub.

12
Creating Reference the Terminology Hub
  • Different knowledge repositories have different
    ways to encode a concept
  • Registry Number
  • Unique Internal ID
  • Concept Identifier
  • Enumerating terms
  • Just using different terms without any constraints

More than 8 GB of cross-referencing information
Searching a term T both in source A and B may
lead to different results because of different
naming/referencing conventions (false negatives
in IR)
  • Terminology Hub ensures coherent mapping
  • Between coding systems
  • Between different representation levels (e.g. ID
    vs. Concept)
  • Between local terms and global terms

13
Classes of objects covered by the Terminology Hub
  • Coding systems
  • A coding system provides a predefined set of
    (sometimes hierarchical) codes to represent a
    classification, a nomenclature, a controlled
    vocabulary, a thesaurus or chemical structures.
    For example, you can use the MeSH  Tree number
    C06.405.205.697 to refer to Gastritis in a
    specific sub-tree of MeSH
  • References
  • Unique and unequivocal identifiers based on a
    coding system create references in their
    corresponding data repository. By nature, they
    are technical artifacts and not part of our
    scientific natural language (e.g. FTY720),
    nevertheless most of them deserve to be
    identified, being used in scientific literature.
  • Pointers and cross-referencing information
  • The Metastore contains pointers that allow to
    cross reference knowledge sources and
    applications.

14
Classes of objects covered by the Terminology Hub
  • Terms
  • A term is the smallest meaningful linguistic unit
    on which our domains of discourse (biology,
    chemistry, medicine) are based. A term is
    something different than a word because a term
    can consist of multiple meaningful words such as
    chronic obstructive pulmonary disease.
  • Concepts
  • A concept is an abstraction based on properties
    of individuals that we observe in the world.
    Individuals that belong to the same concept share
    a set  of common properties. For example,
    targets share the property that they should be
    druggable.
  • Data Repositories also named Knowledge Sources
  • For all kinds of different data, we use the
    general notion of a data repository. Using the
    term data repository we emphasize the fact that
    there is a source where some data resides without
    making any commitments about physical
    representation (e.g. database or text file) or
    format of representation (e.g. structured or free
    text).

15
Classes of objects covered by the Terminology Hub
synonym-ofbroadernarrower
Termsspinal cord vascular endothelial growth
factor CCR5 Glivec ovarian cancer Novartis Cytomeg
alovirus ...
EncodingIUPAC Structures IDs GIF Symbols Formulas
Registry Numbers ...
Data RepositoriesInternal Chemistry DB CI
sources Literature Patents ...
encodes
has-type
ReferenceCompound nos Project codes Competitor
codes PMID 9683255 EntrezGene 450128 CAS
439-14-5 Patent numbers
ConceptsSpecies Products Companies Diseases Genes
Targets Mammalian Genes ...
points--to
is-a
16
Achievements and Improvements
  • All information about terminologies and
    cross-references is stored in a relational
    database (Oracle 10.2.0.2).
  • The data in the database can be accessed through
    WebServices allowing user to find normalized
    terms, pointers for a specific concept-type etc.

17
Metastore Web ServiceGet all synonyms for a
normalized form
18
UltraLink Web ServicesGet all accessible pointer
types for a normalized form
19
Achievements and Improvements
  • We intend to improve the semantic representation
    of the data in order to facilitate reuse,
    interoperability and exchange.
  • RDF notation and RDF coding standards provide an
    adequate means for a richer semantic
    representation.
  • We use SKOS, DublinCore and other RDF-based
    coding standards and supplement them with our own
    RDF vocabulary.

20
Simple Knowledge Organisation System (example)
21
Terminology for Diseases (SKOS fragment)
22
Converting Terminologies to RDF
  • Clear separation of terminologies from
    ontologies. We assign a type (rdftype) to the
    URI of a term as reference to a concept in an
    ontology.
  • Conversion to RDF increased the amount of data
    rougly by the factor 3.
  • We obtained more than 5 Mio RDF triplets as a
    preliminary representation of our terminologies.
  • We are currently setting up the entire workflow
    for generation, storing and querying RDF.

23
Conclusion
  • The first phase of transforming the terminology
    to RDF-XML is completed
  • We are currently developing a model for
    representing the Terminology Hub in RDF. We
    expect that an RDF notation of the Terminology
    Hub will comprise approximately 50 Mio. RDF
    triples
  • We intend to test the framework thoroughly
    (performance, effective semantic gain compared to
    the current technology)
  • Closer collaboration with the W3C Healthcare
    group

24
Acknowledgements
Thanks to the ULT team
Semantic Text Analytics Layer Martin
Romacker Pierre Parisot Nicolas Grandjean Data
Integration Services Layer Alexander
Fromm Laurent Mentek Application Layer Daniel
Cronenberger Olivier Kreim
Thanks to Manuel Peitsch
Write a Comment
User Comments (0)
About PowerShow.com