Mining Medical Literature - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Mining Medical Literature

Description:

Mining Medical Literature Vignesh Ganapathy (CS 374 : Algorithms in Biology) (FALL 2005) Outline Introduction and Background Mining Technique 1: Identifying ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 60

Provided by: aiStanfo

Category:

more less

Transcript and Presenter's Notes

Title: Mining Medical Literature

1
Mining Medical Literature

Vignesh Ganapathy
(CS 374 Algorithms in Biology)
(FALL 2005)

2
Outline

Introduction and Background
Mining Technique 1
Identifying Functionally Coherent Gene Groups
Mining Technique 2
Extracting Synonymous gene and protein terms
Conclusions

3
Outline

Introduction and Background
Mining Technique 1
Identifying Functionally Coherent Gene Groups
Mining Technique 2
Extracting Synonymous gene and protein terms
Conclusions

4
Introduction

Medical Literature has vast amounts of knowledge
and information
PubMed Central (PMC) ( the U.S. National
Institutes of Health (NIH) free digital archive
of biomedical and life sciences journal
literature)
Amedeo.com (The Medical Literature Guide)
Journals like Science, Nature, Cell ,EMBO, Cell
Biology, PNAS
(and many more..)

5
The Problem

Major task is finding out ways to extract useful
information from these resources.

6
What is Data Mining?

Data Mining is the Process of discovering
meaningful, new correlation patterns and trends
by sifting through large amount of data stored in
repositories, using pattern recognition
techniques as well as statistical and
mathematical techniques.

7
Example Data!

Large amounts of data but no information
Daily transactions at a supermarket
Daily website visit histories
Books/videos rented at a Library
Newspaper, Journal archives

8
Amazon.com
9
Google News

Clustering News items (Google News)

10
More Applications

Improving Sales strategy
Finding items that sell together
(there is a common example of beer and diaper
being related. A supermarket found out that 50
of the times beer was purchased with diapers)
Anomaly Detection and many more

11
Information Retrieval (IR)

Collecting information from text data
(Unstructured Data)
Applications
Search web documents
Natural Language Processing
Term also extends to include multimedia or other
forms of unstructured data

12
Simple flow of Retrieval Process
13
IR System Evaluation

Some measures are
Precision
Recall
F1 measure Combined measure which is a
weighted harmonic mean
Sensitivity
Specificity

14
Precision and Recall

How are Precision and Recall related?

15
Problems with Precision and Recall

Deciding documents relevant and non relevant is
not easy
For recall, difficult to measure the number of
relevant documents in database
Creating pool of relevant records is one solution
In practice, these are still good measures

16
Sensitivity and Specificity

Sensitivity Probability of positive examples
Specificity Probability of negative examples
What is the relation between Sensitivity,
Specificity, Precision and Recall?

17
Outline

Introduction and Background
Mining Technique 1
Identifying Functionally Coherent Gene Groups
Mining Technique 2
Extracting Synonymous gene and protein terms
Conclusion

18
Introduction

Analysis shifting from single gene to family of
genes
Examples of these are
Sequence Data
Gene Expression Clustering
Deletion Phenotypes
Yeast-2-Hybrid screens

19
HOVERGEN a Database of Homologous Vertebrate
Genes

Useful for comparative sequence analysis, or
molecular evolution studies

10 biggest gene families
20
Why identify functional gene groups?

Interesting to know functionally relevant groups
for large gene group sets
Helps to assess the significance of
experimentally derived gene sets
Refine gene groups to find more functionally
relevant groups
Existing algorithms can make use of this
information in finding gene groups

21
Existing Approaches

Use of co occurrence of gene names in abstracts
to create networks of related genes automatically
Use existing vocabulary of gene functions and
assigned genes to decide a functionally relevant
group
(Gene Ontology (GO) consortium and Munich
Information Center for Protein Sequences (MIPS) )

22
Statistical NLP approach

Used for annotating individual genes
Determining gene and protein interactions
Assigning keywords to genes or group of genes

23
Neighbor Divergence Approach

Statistical NLP technique
Will always be up to date if provided with a
current literature base
Cannot specify what the actual function is!

24
Challenges in the Problem

Large number of genes
Genes have multiple functions
Some genes have been extensively studied, others
recently discovered
So the literature about genes reflects these
differences

25
Neighbor Divergence Intuition
26
Neighbor Divergence Algorithm

Representation Of Articles
Identifying Semantic Neighbors for Corpus
Articles
Scoring Articles Relative to Gene Group
Calculating a Theoretical distribution of Scores
Calculating the Difference between empirical and
theoretical distribution

27
ND- Article Representation

Words in articles represented by their inverse
document frequency (to reduce the impact of
common words)
Wi,j 1 (log2 (tfi,j))log2 (N/dfi) if tfi,j
gt 0
Wi,j 0 if
tfi,j 0
where Wi,j weighted count of word i in
document j,
tfi,j the number f times word i is
in document
dfi the number of documents
containing I
N the total number of documents

28
ND Identifying Semantic Neighbors

For each article, K most similar articles are pre
computed (k20 was used)
Cosine similarity measure is used ( Cosine of the
angle between two weighted article vectors)

29
ND Scoring articles

Given a gene group, ND assigns a score to each
article (Si,g)
Score is a count of semantic neighbors that refer
to group genes
frk,g nk,g / nk (Fractional Reference for
each neighbor k)
Si,g round(S(i1 to 20) fr sem(i,j),g) (Score
value)

30
ND Difference in Distributions

Calculating a theoretical Distribution of Scores
Use of Poisson Distribution to represent the non
coherent functional structure
P(S n) ((?)n/n!)e-?
KL Divergence
If 2 distributions are same, divergence is zero
More disparate the distributions, larger the
divergence
Dgh Sum(gi log gi /hi )

31
Observed and Expected Distribution of Article
Scores
32
Results
33
Other methods

Word Divergence

34
Other methods

Best Article Score
Highest article score is used as a measure of the
gene groups functional coherence
Best p-Value
Summed probability of an article having equal or
more neighbors than it has
Neighborhood Divergence No Filter
Filter used is When calculating semantic
neighbors, only articles that refer to different
genes are considered.

35
Evaluation
36
Corrupting Functional Groups
37
Outline

Introduction and Background
Mining Technique 1
Identifying Functionally Coherent Gene Groups
Mining Technique 2
Extracting Synonymous gene and protein terms
Conclusion

38
Introduction

Genes and proteins are associated with multiple
names
LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP,
WSL-1, WSL-LR, Tnfrsf12,
PS2, Alg2, MA-3, alg-2, Pdcd6
GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2
http//bioinformatics.org/textknowledge/synonym.p
hp)

39
Advantage

Automated method will keep the database updated
Extracting synonyms will help
Information retrieval and extraction
Human curators of biological resource

40
Existing approaches

Detecting semantically related words
beer and wine are related terms
Use of WORDNET (a large lexical database of
English words) to evaluate semantic similarity
Most synonymous identification methods do not
consider surrounding context of words

41
Information Extraction and Machine Learning

Requires a large amount of manual labor to
construct and tune extraction systems
Machine learning techniques help to reduce the
manual labor by automatically acquiring rules for
labeled and unlabeled data

42
ML techniques

Supervised Learning
Labeled Training Data available
Semi supervised Learning
Small number of labeled training data
Unsupervised Learning
Data with no labeling
Reinforcement Learning
Learn a mapping form situations to actions by
trial and error interactions

43
Approach Used here

Obtain tagged genes and proteins in text using
existing gene taggers
Four approaches used
Unsupervised Learning
Partially Supervised Learning
Supervised Learning
Hand Crafter System
Use of a final COMBINED system

44
Unsupervised Learning Contextual Similarity

Finds set of words that appear in similar context
using mutual information between the words

45
Unsupervised Learning Contextual Similarity

Mutual Information
Similarity Measure

46
Contextual Similarity

For all terms takes time O(lexicon3 . So
,heuristic search is used
Lots of false positives returned, so useful to
incorporate some domain knowledge

47
Partially supervised Learning- Snowball
48
Snowball

Confidence of a pattern
Calculates confidence of extracted tuples and
discards low confidence tuples

49
Supervised Learning Text classification

User provided positive and negative example gene
and protein pairs
Use SVM to train using this data (radial basis
kernel function of SVMLight)
Classifies pairs of identified genes and proteins
using a confidence score Conf(s)(score assigned
by classifier)
Does not combine evidence from multiple
occurrences of same gene or protein pair

50
Hand Crafted Extraction System- GPE system

Most labor intensive but high quality result
approach
Starts with set of known pairs of synonyms
Manual examination to find patterns of
occurrences
Use of known as or also called
Scans for more synonyms and uses heuristics and
filters to ignore non gene/protein terms
Confidence value of 1 assigned to every returned
result

51
Combined System

Exploits advantages of knowledge based and
machine learning based systems
ConfE(s) represents the confidence score assigned
to s by system E
(1 prob that all systems extracted s
incorrectly)

52
Final parameters used for the different systems
53
Running Times
54
Results and Evaluation
55
Results and Evaluation
56
Outline

Introduction and Background
Mining Technique 1
Identifying Functionally Coherent Gene Groups
Mining Technique 2
Extracting Synonymous gene and protein terms
Conclusion

57
Conclusion and Future Work

Lot of interest in using knowledge from medical
literature to guide bioinformatics algorithms
Functional Gene Groups
Can be used to connect data analysis algorithms
to scientific literature
ND maybe used to define new functional groups,
annotating genes and organizing genes in a
functional hierarchy
Use of full text articles instead of only
abstracts

58
Conclusion and Future Work

Synonym Extraction
Extracted synonyms could be used as a valuable
supplement to the SWISSPROT database
Techniques could use the existing systems to find
other biological relations between genes and
proteins, small molecules, drugs and diseases.