Literature Based Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Literature Based Discovery

Description:

Literature Based Discovery Dimitar Hristovski dimitar.hristovski_at_mf.uni-lj.si Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 42
Provided by: Dimit85
Category:

less

Transcript and Presenter's Notes

Title: Literature Based Discovery


1
Literature Based Discovery
  • Dimitar Hristovski dimitar.hristovski_at_mf.uni-lj.
    si
  • Institute of Biomedical Informatics, Faculty of
    Medicine, University of Ljubljana, Ljubljana,
    Slovenia

2
Let me introduce myself Research and Development
  • BS Biomedicina Slovenica database
  • Research Evaluation Decision Support System
  • Medical Information Systems
  • Surgical clinics
  • Genetic laboratory
  • Biochemical laboratory
  • Web User Behaviour Analysis
  • Data warehousing and OLAP

3
Motivation
  • Overspecialization
  • Information overload
  • Large databases
  • For many diseases the chromosomal region known,
    but not the exact gene

4
Background
  • Literature-based discovery (Swanson)

New Relation?
Concept X(Disease)
Concepts Y(Pathologycal or Cell Function, )
Concepts Z(Genes)
5
Biomedical Discovery Support System (BITOLA)
  • Goal
  • discover potentially new relations (knowledge)
    between biomedical concepts
  • to be used as research idea generator and/or as
  • an alternative way to search Medline
  • System user (researcher or intermediary)
  • interactively guides the discovery process
  • evaluates the proposed relations

6
Extending and Enhancing Literature Based
Discovery
  • Goal
  • Make literature based discovery more suitable for
    disease candidate gene discovery
  • Decrease the number of candidate relations
  • Method
  • Integrate background knowledge
  • Chromosomal location of diseases and genes
  • Gene expression location
  • Disease manifestation location

7
Usage Scenarios
  • For a disease with known chromosomal location,
    find a candidate gene
  • For a gene, find a disease that might be
    influenced
  • For a disease and gene found to be related by
    linkage study, find the mechanism of the relation
    (intermediate concepts should help)

8
System Overview
Knowledge Base
Concepts
Background Knowledge (Chromosomal Locations, )
Discovery Algorithm
Association Rules
User Interface
Knowledge Extraction
Databases (Medline, LocusLink, HUGO, OMIM, )
9
Databases
  • Medline source of known relationships between
    biomedical concepts
  • Set of concepts
  • MeSH (Medical Subject Headings) Controlled
    dictionary and thesaurus used for indexing and
    searching the Medline database
  • HUGO official gene symbols, names and aliases
  • LocusLink gene symbols, aliases and
    chr.locations
  • OMIM genetic diseases
  • UMLS (Unified Medical Language System)
  • Entrez used to search PubMed, GenBank, ...
  • UniGene gene expression

10
Knowledge Extraction
  • Build master set of concepts (MeSH terms and gene
    symbols)
  • Extract occurrence of concepts from each Medline
    record (MeSH terms from MH field, gene symbols
    from Title and Abstract)
  • Association rule mining (concept co-occurrence)
  • Chromosomal location extraction (from LocusLink
    and HUGO)
  • Load into knowledge base

11
Terminology Problems during Knowledge Extraction
  • Gene names
  • Gene symbols
  • MeSH and genetic diseases

12
Detected Gene Symbols by Frequency
  • type666548
  • II552584
  • III201776
  • component179643
  • CT175973
  • AT151337
  • ATP147357
  • IV123429
  • CD499657
  • p5389357
  • MR88682
  • SD85889
  • GH84797
  • LPS68982
  • 5967272
  • E264616
  • 8263521
  • AMP61862
  • TNF59343
  • RA58818
  • CD857324
  • O256847
  • ACTH54933
  • CO253171
  • PKC51057
  • EGF50483
  • T349632
  • MS46813
  • A244896
  • ER43212
  • upstream41820
  • PRL41599

13
Gene Symbol Disambiguation
  • Find MEDLINE docs in which we can expect to find
    gene symbols
  • JD indexing (Susanne Humphrey) as possible
    solution
  • Identifies the semantic context of docs
  • If semantic context not genetic, then gene symbol
    probably false positive
  • Example of false positive
  • Ethics in a twist "Life Support", BBC1. BMJ 1999
    Aug 7319(7206)390
  • breast basic conserved 1 (BBC1) gene, v.s. BBC1
    television station featuring new drama series
    Life Support

14
JD Indexing
  • JDs are 127 Journal Descriptors (e.g., JDs for
    journal Hum Mol Genet Cytogenetics Genetics,
    Medical)
  • Training set docs (435,000) inherit JDs from
    journals
  • Training set provides co-occurrence data between
    inherited JDs and
  • indexing terms assigned to docs directly
  • words in docs
  • Docs having indexing terms/words occurring often
    with genetics JDs in tr. set assumed to have
    genetics context
  • Extended to indexing by 134 UMLS semantic types
    (e.g. Gene or Genome, Gene Function,)

15
System Overview
Knowledge Base
Concepts
Background Knowledge (Chromosomal Locations, )
Discovery Algorithm
Association Rules
User Interface
Knowledge Extraction
Databases (Medline, LocusLink, HUGO, OMIM, )
16
Binary Association Rules
  • X?Y (confidence, support)
  • If X Then Y (confidence, support)
  • Confidence of docs containing Y within the X
    docs
  • Support number (or ) of docs containing both X
    and Y
  • The relation between X and Y not known.
  • Examples
  • Multiple Sclerosis ? Optic Neuritis (2.02, 117)
  • Multiple Sclerosis ? Interferon-beta (5.17, 300)

17
Discovery Algorithm
Candidate Gene?
Concepts Y(Pathologycal or Cell Function, )
Concept X(Disease)
Concepts Z(Genes)
Chromosomal Region
Chromosomal Location
Match
Manifestation Location
Expression Location
Match
18
Discovery Algorithm
  • Let X be starting concept of interest.
  • Find all Y for which X ? Y.
  • Find all Z for which Y ? Z.
  • Eliminate those Z for which X-gtZ already exists.
  • Eliminate those Z that do not match the
    chromosomal region of X
  • Eliminate those Z that do not match the
    expression location of X
  • Remaining Z are candidates for new relation
    between X and Z.
  • In general
  • X ? Y1 ? ? Yn ? Z, but not X ? Z
  • Example
  • X disease
  • Y (pato)physiology of X
  • Z (de)regulators of Y (drugs, proteins, genes)
  • New relation example Z is candidate gene for
    disease X

19
Ranking Concepts Z
20
Results Concepts in Medline
  • Full Medline (end 2001) analyzed (11,226,520
    recs)
  • Looking for 19,781 MeSH terms and 22,252 human
    genes (14,659 from HUGO and 7,593 from
    LocusLink). 24,613 alias gene symbols added
  • Gene symbols found in 2,689,958 Medline recs.
    Most frequent ambiguous symbols (CT, MR, CO2,)
    or format errors

21
Results Co-occurring Concepts in Medline
  • 29,851,448 distinct pairs of co-occurring
    concepts
  • In 7,106,099 at least one gene symbol appeared
  • In 679,159 pairs both concepts are gene symbols
  • Total co-occurrence frequency 798,366,684
  • 59,702,986 association rules calculated and
    stored

22
Bilateral Perisylvian Polymicrogiria - BPP (OMIM
300388)
  • Polymicrogyria of the cerebral cortex is a
    developmental abnormality characterized by
    excessive surface convolution
  • Clinical characteristics
  • Mental retardation
  • Epilepsy
  • Pseudobulbar palsy (paralysis of the face,
    throat, tongue and the chewing process)
  • X linked dominant inheritance

23
(No Transcript)
24
BPP - pathogenesis
  • It is considered a disorder of neuronal migration
    (unlayered type) or a consequence of intrauterine
    ischemia (layered type)

25
Finding Candidate Genes for Polymicrogyria,
bilateral perisylvan
26
237 genes in Xq28
relation between semantic types Cell Movement and
Gene or gene products
18 gene candidates
Sublocalisation in the Xq28
15 gene candidates
Tissue specific expression
2 gene candidates L1CAM and FLNA
27
(No Transcript)
28
User Interface cgi-bin version
29
Automatically search for supporting Medline
Citations
30
Cleft Palate Predicting Candidate Genes
31
(No Transcript)
32
(No Transcript)
33
Summary and Conclusions
  • We extend and enhance an existing discovery
    support system (BITOLA)
  • The system can be used as
  • Research idea generator, or
  • Alternative method of searching Medline
  • Genetic knowledge about the chromosomal locations
    of diseases and genes included to make BITOLA
    more suitable for disease candidate gene discovery

34
Further Work
  • Increase the number of concepts
  • Gene symbol disambiguation
  • Semantic relations extraction
  • System evaluation
  • Improve the Web version of the system

35
System Availability
  • URL www.mf.uni-lj.si/bitola/

36
Related work SemGenTom Rindflesch et al
  • Extract semantic predications on genetic basis of
    disease
  • Deletions of INK4 occur in malignant tumors
  • INK4ASSOCIATED_WITHMalignant Tumors
  • Evaluation and visualization of SemGen output

37
Semantic Structures
ltgenetic phenomenongt
ETIOLOGY_OF
ltdisordergt
CAUSE
PREDISPOSE
ASSOCIATED_WITH
38
(No Transcript)
39
Statistical Evaluation
  • Assoc. rule base divided into 2 segments older
    (1990-1995) and newer (1996-1999)
  • The system predicts new relations based on the
    older segment
  • Predictions compared with actual new relations in
    the newer segment

40
Summary Statistical Evaluation Results
41
Statistical Evaluation Results
  • With no assoc. rules constraints
  • predicts almost all new relations, but too many
    candidate relations
  • With constraints
  • predicts new relations 6.9 times better than
    random predictions
  • tighter the constraints, better (correct / all
    predictions) ratio (6.5)
Write a Comment
User Comments (0)
About PowerShow.com