Title: Automating Discovery from Biomedical Texts
1Automating Discovery from Biomedical Texts
- Marti Hearst Barbara Rosario
- UC Berkeley
- Agyinc Visit
- August 16, 2000
2The LINDI ProjectLinking Information for New
Discoveries
Two Main Thrusts
- UIs for building and reusing hypothesis seeking
strategies.
- Statistical language analysis techniques for
extracting propositions
3Scenario Explore Functions of a Gene
- Objective
- Determine the functions of a newly sequenced Gene
X. - Known facts
- Gene X co-expresses (activated in the same cell)
with Gene A, B, C - The relationship of Gene A, B, C with certain
types of diseases (from medical literature) - Question
- What types of diseases are Gene X related to?
4Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
5Make use of the literature
- Look up what is known about the other genes.
- Different articles in different collections
- Look for commonalities
- Similar topics indicated by Subject Descriptors
- Similar words in titles and abstracts
- adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
6Developing Strategies
- Different strategies seem needed for different
situations - First see what is known about Kallikrein.
- 7341 documents. Too many
- AND the result with disease category
- If result is non-empty, this might be an
interesting gene - Now get 803 documents
7Explore Functions of New Gene X
Medical Literature
Query
Projection
Mapping
Slide adapted from K. Patel
8Developing Strategies
- Different strategies seem needed for different
situations - First see what is known about Kallikrein.
- 7341 documents. Too many
- AND the result with disease category
- If result is non-empty, this might be an
interesting gene - Now get 803 documents
- AND the result with PSA
- Get 11 documents. Better!
9Explore Functions of New Gene X
Medical Literature
Query
Projection
Intersection
10Developing Strategies
- Look for commalities among these documents
- Manual scan through 100 category labels
- Would have been better if
- Automatically organized
- Intersections of important categories scanned
for first
11Explore Functions of New Gene X
Medical Literature
Query
Projection
Intersection
Slicing
Mapping
Slide adapted from K. Patel
12Try a new tack
- Researcher uses knowledge of field to realize
these are related to prostate cancer and
diagnostic tests - New tack intersect search on all three known
genes - Hope they all talk about diagnostics and prostate
cancer - Fortunately, 7 documents returned
- Bingo! A relation to regulation of this cancer
13Explore Functions of New Gene X
Medical Literature
Possible Function For Gene-X
Query
Query
Projection
Intersection
Slicing
Mapping
Slide adapted from K. Patel
14Formulate a Hypothesis
- Hypothesis mystery gene has to do with
regulation of expression of genes leading to
prostate cancer - New tack do some lab tests
- See if mystery gene is similar in molecular
structure to the others - If so, it might do some of the same things they
do
15Strategies again
- In hindsight, combining all three genes was a
good strategy. - Store this for later
- Might not have worked
- Need a suite of strategies
- Build them up via experience and a good UI
16The System
- Doing the same query with slightly different
values each time is time-consuming and tedious - Same goes for cutting and pasting results
- IR systems dont support varying queries like
this very well. - Each situation is a bit different
- Some automatic processing is needed in the
background to eliminate/suggest hypotheses
17The User Interface
- A general search interface should support
- History
- Context
- Comparison
- Operators Intersection, Union, Slicing
- Operator Reuse
- Visualization (where appropriate)
- We have an initial implementation
- It needs lots of work
18Architecture of LINDI UI
- Data Layer
- Annotation Layer
- User Interface Layer
19Data Layer
- Purpose
- Hide different formats of text collections
- Components
- Data Abstractions representing records of a text
collection - Operations performed on the data
- Data
- A set of records
- Each record is a set of tuples with types
- Operations
- union, intersection, projection, mapping
20Annotation Layer
- Purpose
- Associate data set with operations that produced
them (history) - History is a first class object
- Advantage
- Streamline a sequence of operations
- Reuse operations
- Parameterize operations
21User Interface
- Direct manipulation of information objects and
access operations - Query
- Intersection
- Union
- Mapping
- Slicing
- Record and reuse of past operations
- Parameterization of operations
- Streamlining of operations
22Initial Palette
23Query Structure Determined by Collection Type
24Query Operation Results
25Projection Operation and Subsequent Results
26Parameterized Query Repeat operations with
different values
GA
GB
GC
27Intersection over Projected Attribute
28Intersection over Projected Attribute
29Example Interaction with UI Prototype
1 Query on Gene names 2 Project out only mesh
headings 3 Intersect the results 4 Map to create
a ranking 5 Slice out the top-ranked.
30Future Work on UI
- As currently designed
- Better labeling
- Better layout
- Intuitive
- Scalable
- Connection to real backend
- User Testing
- Does direct manipulation work?
- What operator sequences help?
- How to improve parameterization?
- More advanced
- Support for strategies
- Incorporation of NLP
31Language Analysis Component
- Goals
- Extract Propositions from Text
- Make Inferences
32Language Analysis Component
- Why Extract Propositions from Text?
- Text is how knowledge at the propositional level
is communicated - Text is continually being created and updated by
the outside world
33ExampleStatistical Semantic Grammar
- To detect causal relationships between medical
concepts - Title
- Magnesium deficiency implicated in increased
stress levels. - Interpretation
- ltnutrientgtltreductiongt related-to
ltincreasegtltsymptomgt - Inference
- Increase(stress, decrease(mg))
34Statistical Semantic Grammars
- Empirical NLP has made great strides
- But mainly applied to syntactic structure
- Semantic grammars are powerful, but
- Brittle
- Time-consuming to construct
- Idea
- Use what we now know about statistical NLP to
build up a probabilistic grammar
35LINDI Target Components
- Special UI for retrieving appropriate docs
- Language analysis on docs to detect causal
relationships between concepts - Probabilistic representation of concepts and
relationships - UI User Hypothesis creation