Title: Mining Medical Literature
1Mining Medical Literature
- Vignesh Ganapathy
- (CS 374 Algorithms in Biology)
- (FALL 2005)
2Outline
- Introduction and Background
- Mining Technique 1
- Identifying Functionally Coherent Gene Groups
- Mining Technique 2
- Extracting Synonymous gene and protein terms
- Conclusions
3Outline
- Introduction and Background
- Mining Technique 1
- Identifying Functionally Coherent Gene Groups
- Mining Technique 2
- Extracting Synonymous gene and protein terms
- Conclusions
4Introduction
- Medical Literature has vast amounts of knowledge
and information - PubMed Central (PMC) ( the U.S. National
Institutes of Health (NIH) free digital archive
of biomedical and life sciences journal
literature) - Amedeo.com (The Medical Literature Guide)
- Journals like Science, Nature, Cell ,EMBO, Cell
Biology, PNAS - (and many more..)
5The Problem
- Major task is finding out ways to extract useful
information from these resources.
6What is Data Mining?
- Data Mining is the Process of discovering
meaningful, new correlation patterns and trends
by sifting through large amount of data stored in
repositories, using pattern recognition
techniques as well as statistical and
mathematical techniques.
7Example Data!
- Large amounts of data but no information
- Daily transactions at a supermarket
- Daily website visit histories
- Books/videos rented at a Library
- Newspaper, Journal archives
8Amazon.com
9Google News
- Clustering News items (Google News)
10More Applications
- Improving Sales strategy
- Finding items that sell together
- (there is a common example of beer and diaper
being related. A supermarket found out that 50
of the times beer was purchased with diapers) - Anomaly Detection and many more
11Information Retrieval (IR)
- Collecting information from text data
(Unstructured Data) - Applications
- Search web documents
- Natural Language Processing
- Term also extends to include multimedia or other
forms of unstructured data
12Simple flow of Retrieval Process
13IR System Evaluation
- Some measures are
- Precision
- Recall
- F1 measure Combined measure which is a
weighted harmonic mean - Sensitivity
- Specificity
14Precision and Recall
- How are Precision and Recall related?
15Problems with Precision and Recall
- Deciding documents relevant and non relevant is
not easy - For recall, difficult to measure the number of
relevant documents in database - Creating pool of relevant records is one solution
- In practice, these are still good measures
16Sensitivity and Specificity
- Sensitivity Probability of positive examples
- Specificity Probability of negative examples
- What is the relation between Sensitivity,
Specificity, Precision and Recall?
17Outline
- Introduction and Background
- Mining Technique 1
- Identifying Functionally Coherent Gene Groups
- Mining Technique 2
- Extracting Synonymous gene and protein terms
- Conclusion
18Introduction
- Analysis shifting from single gene to family of
genes - Examples of these are
- Sequence Data
- Gene Expression Clustering
- Deletion Phenotypes
- Yeast-2-Hybrid screens
19HOVERGEN a Database of Homologous Vertebrate
Genes
- Useful for comparative sequence analysis, or
molecular evolution studies
10 biggest gene families
20Why identify functional gene groups?
- Interesting to know functionally relevant groups
for large gene group sets - Helps to assess the significance of
experimentally derived gene sets - Refine gene groups to find more functionally
relevant groups - Existing algorithms can make use of this
information in finding gene groups
21Existing Approaches
- Use of co occurrence of gene names in abstracts
to create networks of related genes automatically - Use existing vocabulary of gene functions and
assigned genes to decide a functionally relevant
group - (Gene Ontology (GO) consortium and Munich
Information Center for Protein Sequences (MIPS) )
22Statistical NLP approach
- Used for annotating individual genes
- Determining gene and protein interactions
- Assigning keywords to genes or group of genes
23Neighbor Divergence Approach
- Statistical NLP technique
- Will always be up to date if provided with a
current literature base - Cannot specify what the actual function is!
24Challenges in the Problem
- Large number of genes
- Genes have multiple functions
- Some genes have been extensively studied, others
recently discovered - So the literature about genes reflects these
differences
25Neighbor Divergence Intuition
26Neighbor Divergence Algorithm
- Representation Of Articles
- Identifying Semantic Neighbors for Corpus
Articles - Scoring Articles Relative to Gene Group
- Calculating a Theoretical distribution of Scores
- Calculating the Difference between empirical and
theoretical distribution
27ND- Article Representation
- Words in articles represented by their inverse
document frequency (to reduce the impact of
common words) - Wi,j 1 (log2 (tfi,j))log2 (N/dfi) if tfi,j
gt 0 - Wi,j 0 if
tfi,j 0 - where Wi,j weighted count of word i in
document j, - tfi,j the number f times word i is
in document - dfi the number of documents
containing I - N the total number of documents
28ND Identifying Semantic Neighbors
- For each article, K most similar articles are pre
computed (k20 was used) - Cosine similarity measure is used ( Cosine of the
angle between two weighted article vectors) -
29ND Scoring articles
- Given a gene group, ND assigns a score to each
article (Si,g) - Score is a count of semantic neighbors that refer
to group genes - frk,g nk,g / nk (Fractional Reference for
each neighbor k) - Si,g round(S(i1 to 20) fr sem(i,j),g) (Score
value)
30ND Difference in Distributions
- Calculating a theoretical Distribution of Scores
- Use of Poisson Distribution to represent the non
coherent functional structure - P(S n) ((?)n/n!)e-?
-
- KL Divergence
- If 2 distributions are same, divergence is zero
- More disparate the distributions, larger the
divergence - Dgh Sum(gi log gi /hi )
31Observed and Expected Distribution of Article
Scores
32Results
33Other methods
34Other methods
- Best Article Score
- Highest article score is used as a measure of the
gene groups functional coherence - Best p-Value
- Summed probability of an article having equal or
more neighbors than it has - Neighborhood Divergence No Filter
- Filter used is When calculating semantic
neighbors, only articles that refer to different
genes are considered.
35Evaluation
36Corrupting Functional Groups
37Outline
- Introduction and Background
- Mining Technique 1
- Identifying Functionally Coherent Gene Groups
- Mining Technique 2
- Extracting Synonymous gene and protein terms
- Conclusion
38Introduction
- Genes and proteins are associated with multiple
names - LARD , DR3 , TR3 , Wsl, DDR3, APO-3, TRAMP,
WSL-1, WSL-LR, Tnfrsf12, - PS2, Alg2, MA-3, alg-2, Pdcd6
- GRIP-1, TIF2, 9530095N19, D1Ertd433e, Ncoa2
- http//bioinformatics.org/textknowledge/synonym.p
hp)
39Advantage
- Automated method will keep the database updated
- Extracting synonyms will help
- Information retrieval and extraction
- Human curators of biological resource
40Existing approaches
- Detecting semantically related words
- beer and wine are related terms
- Use of WORDNET (a large lexical database of
English words) to evaluate semantic similarity - Most synonymous identification methods do not
consider surrounding context of words
41Information Extraction and Machine Learning
- Requires a large amount of manual labor to
construct and tune extraction systems - Machine learning techniques help to reduce the
manual labor by automatically acquiring rules for
labeled and unlabeled data
42ML techniques
- Supervised Learning
- Labeled Training Data available
- Semi supervised Learning
- Small number of labeled training data
- Unsupervised Learning
- Data with no labeling
- Reinforcement Learning
- Learn a mapping form situations to actions by
trial and error interactions
43Approach Used here
- Obtain tagged genes and proteins in text using
existing gene taggers - Four approaches used
- Unsupervised Learning
- Partially Supervised Learning
- Supervised Learning
- Hand Crafter System
- Use of a final COMBINED system
44Unsupervised Learning Contextual Similarity
- Finds set of words that appear in similar context
using mutual information between the words
45Unsupervised Learning Contextual Similarity
- Mutual Information
- Similarity Measure
46Contextual Similarity
- For all terms takes time O(lexicon3 . So
,heuristic search is used - Lots of false positives returned, so useful to
incorporate some domain knowledge
47Partially supervised Learning- Snowball
48Snowball
- Confidence of a pattern
-
- Calculates confidence of extracted tuples and
discards low confidence tuples -
49Supervised Learning Text classification
- User provided positive and negative example gene
and protein pairs - Use SVM to train using this data (radial basis
kernel function of SVMLight) - Classifies pairs of identified genes and proteins
using a confidence score Conf(s)(score assigned
by classifier) - Does not combine evidence from multiple
occurrences of same gene or protein pair
50Hand Crafted Extraction System- GPE system
- Most labor intensive but high quality result
approach - Starts with set of known pairs of synonyms
- Manual examination to find patterns of
occurrences - Use of known as or also called
- Scans for more synonyms and uses heuristics and
filters to ignore non gene/protein terms - Confidence value of 1 assigned to every returned
result
51Combined System
- Exploits advantages of knowledge based and
machine learning based systems - ConfE(s) represents the confidence score assigned
to s by system E - (1 prob that all systems extracted s
incorrectly)
52Final parameters used for the different systems
53Running Times
54Results and Evaluation
55Results and Evaluation
56Outline
- Introduction and Background
- Mining Technique 1
- Identifying Functionally Coherent Gene Groups
- Mining Technique 2
- Extracting Synonymous gene and protein terms
- Conclusion
57Conclusion and Future Work
- Lot of interest in using knowledge from medical
literature to guide bioinformatics algorithms - Functional Gene Groups
- Can be used to connect data analysis algorithms
to scientific literature - ND maybe used to define new functional groups,
annotating genes and organizing genes in a
functional hierarchy - Use of full text articles instead of only
abstracts
58Conclusion and Future Work
- Synonym Extraction
- Extracted synonyms could be used as a valuable
supplement to the SWISSPROT database - Techniques could use the existing systems to find
other biological relations between genes and
proteins, small molecules, drugs and diseases.
59