Title: Towards Evidence-Based Discovery
1Towards Evidence-Based Discovery
- Catherine Blake
- School of Information and Library Science
- University of North Carolina at Chapel Hill
- http//www.ils.unc.edu/cablake
- cablake_at_email.unc.edu
2Motivation
- Relentless increase in electronically available
text - Life Sciences
- 17 millionth entry added in April 2007
- 5,200 journals indexed
- 12,000 new articles each week !
- Chemistry more than 110,000 articles in 1 year
alone - Consequences
- Hundreds of thousands of relevant articles
- Implicit connections between literature go
unnoticed
Shift from Retrieval to Synthesis
3Information Overload
- One of the diseases of this age is the
multiplicity of books they doth so overcharge
the world that it is not able to digest the
abundance of idle matter that is every day
hatched and brought forth into the world - - Barnaby Rich, 1613
4Evidence-Based Discovery
If I have seen further than others, it is by standing upon the shoulders of giants. Sir Isaac Newton We can't solve problems using the same kind of thinking we used when we created them. Albert Einstein
5(No Transcript)
6Outline
- Motivation
- Case Studies
- METIS
- Human synthesis
- Natural language processing
- Claim Jumping through Scientific Literature
- Next Steps
- Summary
7Systematic Review Process
- Formulate the problem
- Locate and select studies
- Assess quality of studies
- Collect data
- Analyze and present results
- Interpret results
- Improve and update review
28 months from initial idea to publication
Increased demand due to evidence-based medicine
8Manual Synthesis
Guesswork guided by scientifically trained
intuition Rescher (1978)
Select
Verify
Extract
Analyze
9Context Information
- Study Information
- e.g. date, location, ...
- Population Information
- e.g. gender, age, ...
- Risk Factor or Intervention
- e.g. duration of exposure, confounders
- Disease
- e.g. stage, confounders
Loosely coupled to review focus
Tightly coupled to review focus
10Collaborative Information Synthesis
11Key Estimate Missing Information
2
1
What are people with Breast Cancer
exposed to?
What are people in a similar population exposed
to?
- Facts for each study
- number of patients
- age of patients
- geographic location
- risk-factor exposure
- Codebook
- question asked
- age, gender
- responses
Database of risk factors BRFSS
Studies with Breast Cancer patients
3
Are these rates significantly different?
T. Tengs N. D. Osgood (2001) The link between
smoking and Impotence Two Decades of Evidence,
Preventive Medicine, 32447-52
12More than Automated Meta-Analysis
- Traditional analysis
- same study design
- medicine RCT
- epidemiology cohort
- Information Synthesis
- any study that includes required information
- augment missing information
Systematic Review
Key
Main topic
Entire study
Secondary Information
External database
13(No Transcript)
14METIS Information Extractor
- Semantic Grammar
- Features words, numbers, and semantic types in
the Unified Medical Language System (UMLS) - Information extracted
- risk factor exposure (tobacco and alcohol ) ?
gender - age (min, max, mean) ? start and end dates
- number of subjects with medical condition ?
geographical location
termage termof number10ltn2lt110t
ermtonumber10ltn2lt110 The age of breast
cancer subjects ranged between 20 to 64 years
old. semantic type neoplastic process, or
disease
15METIS Info Extractor Evaluation
- Diverse text corpus
- epidemiology, surgery, biology, ...
- cohort studies, case-control trials, ...
- Evaluation
- Metrics (precision, recall)
- Annotators (developer, domain expert, expert
annotator, novice) - Primary topic (breast cancer, impotence)
- Secondary information (tobacco and alcohol
consumption)
16METIS Info Extractor Recall
17METIS Info Extractor Precision
18METIS Verifier
Converted Article
Electronic version of article
Verify information extracted
19METIS Verifier
20METIS Analyzer
- Meta-Analysis
- Developed for agricultural application
- Requires empirical studies with a quantitative
outcome - Unit of study is an article - not a person
- Result a unitless metric called an effect size
- Two common meta-analysis techniques
- Fixed effects
- Randomized-effects model
Evaluation Compared generated effect size with
examples in text books and published
articles , Result Same effect size
21Synthetic Estimate Evaluation
Tobacco Consumption
22(No Transcript)
23(No Transcript)
24Outline
- Motivation
- Case Studies
- METIS
- Claim Jumping
- Human discovery
- Natural language processing
- Human-assisted discovery and synthesis
- Next Steps
- Summary
25(No Transcript)
26Human Discovery
- Day-to-day activities of scientists reflect
- the complex socio-technical environments in which
successful creativity tools will eventually be
embedded - the human cognitive processing surrounding
creativity - Unit of analysis a paper or grant proposal
How do chemists arrive at their research question
?
How do chemists transform an idea into a
publication ?
27Approach
- Recruitment
- experienced scientists (7-45 yrs)
- local chemists and chemical engineers
- response rate 84 (21/25)
- Semi-structured interviews
- Critical incident technique
- seminal paper in their field
- recent paper authored by the participant
- paper authored by the participant that they were
particularly proud of
28Interview Questions
- Discovery Questions
- What is your definition of discovery ?
- What evidence convinced you that the paper
addressed the initial research questions ? - What factors limited the adoption and deployment
of the discovery ? - How did you arrive at the research question ?
- What if any existing evidence prompted the
study/experiment ? - Were there any alternative explanations ?
- Information Usage questions
- Other than the scientific literature, what
information resources do you draw from to aid in
your research processes ? - How many articles did you read last month that
related to each of those projects ? - Is that typical of how many articles you read in
a month for research projects ? - Do you read articles for another purpose ? If so
what? - How many hours do you spend reading journal
articles for research projects? - Which journals do you typically read and draw
from ? - How would you characterize the journals that you
read- are they only within your domain, or do you
read journals that would be considered
non-traditional in your research ? - If you only have a few minutes to read an
article, what parts would you read? - What do you do with the article once you have
read it ?
29Chemists and Chemical Engineers
- Compared with other scientists chemists and
chemical engineers - read more (Brown,1999)
- have more personal subscriptions to journals
(Noble Coughlin, 1997) - spend more time reading (Tenopir King, 2003)
- visit the library more often (Brown, 1999)
- Consequences
- information disseminated quickly
- information has a relative short lifespan
30Human Discovery Findings
- Discovery definition
- Novelty - Balance theory and experimentation
- Build on existing ideas - Practical application
- Simplicity
- Hypothesis generation
- Discussion - Previous experiments
- Combine expertise - Read literature
- Hypothesis validation
- Iterative - Tightly coupled
31(No Transcript)
32Causal Relationships
- Newspaper genre
- Causal relationships (Khoo, Chan, Niu, 1998)
- Biomedical genre
- Causes and treats (Price Delcambre, 2005)
- Causal knowledge (Khoo, Chan, Niu, 2000)
- Universal Grammar
- Causatives (Comrie, 1974, 1981)
- Action verbs (Thomson, 1987)
33Claim Definition
- To assert in the face of possible contradiction
- Example sentence reporting a claim
- This study showed that Tamoxifen reduces the
breast cancer risk - Example Claim Framework
- Tamoxifenagent
- reduceschange
- breast cancer risk object
34The Claim Framework
- Goal
- go beyond genes and proteins
- differentiate between different levels of
confidence in the claim - consider claims made in the full text
- Working hypothesis
- literature will report findings using constructs
within the Claim Framework - human annotators will agree on facets
35Preliminary Results
- 29 articles from TREC Genomics
- Total number of sentences 5535
- Sentences with gt1 claim 1250 (22.6)
- Total number of claims 3228
- Average claims per sentence 2.51
- Claims that did not fit in the Framework 31
- Per article
- Average number of sentences 191
- Average number of sentences with gt1 claim43
36Distribution of Claim Categories
Category Total () Total () Pilot() Pilot() Main() Main()
Explicit 2489 77.11 332 83.42 2157 76.63
Implicit 87 2.70 3 0.75 84 2.98
Observation 298 9.23 24 6.03 274 9.73
Correlation 174 5.39 12 3.02 162 5.75
Comparison 165 5.11 27 6.85 138 4.9
Total 3228 100 398 100 2830 100
37All Documents All Documents All Documents All Documents
Annotation Total () Total () Words (Avg) Words (Avg)
Agent 2894 89.65 5221 1.80
Agent Direction 285 8.83 291 1.02
Agent Modifier 1246 38.60 4448 3.57
Object 3197 99.04 6849 2.14
Object Direction 271 8.40 283 1.04
Object Modifier 1561 48.36 5383 3.44
Change 1897 58.77 1953 1.03
Change Direction 1337 41.42 1358 1.02
Change Modifier 1147 35.53 1618 1.41
Claim Basis 165 5.11 394 2.39
Claim Basis Dir. 42 1.30 43 1.02
Claim Basis Mod. 86 2.66 266 3.09
Total 3228 28107 8.70
38Inter Annotator Agreement
- Information Facet Kappa Agreement
- Agent 0.71 substantial
- Object 0.77 substantial
- Change 0.57 moderate
- ChangeChangeDir 0.88 almost perfect
39Location of Claims
Total Sentences Total Sentences Total Sentences
With
Section Claim Total section claim
Abstract 98 309 31.72 7.84
Introduction 357 979 36.47 28.56
Method 6 1121 0.54 0.48
Result 293 1829 16.02 23.44
Discussion 539 1406 38.34 43.12
Total 1250 5535 22.58 100.00
40(No Transcript)
41User Study
Steven W. Matson Ph.D. Professor and
Chair Department of Biology Robert C Millikan
DVM PhD Barbara Sorenson Hulka Distinguished
Professor Department of Epidemiology School of
Public Health Dr. Rosa Perelmuter,
PhD Director, Moore Undergraduate
Research Apprentice Program Professor of Spanish
and Assistant Dean, Academic Advising
Program Jan F. Prins PhD. Professor of Computer
Science and Chairman, Department of Computer
Science Alexander Tropsha, Ph.D. Professor and
Chair Director, Laboratory for Molecular
Modeling Suzanne West, PhD Researcher Health,
Social and Economics Research RTI International
- Timothy S. Carey, MD, MPH
- Sarah Graham Kenan Professor of Medicine
- Director, Cecil G Sheps Center for Health
Services Research -
- Ila Cote, PhD, DABT
- Acting Division Director
- US Environmental Protection Agency
- National Center for Environmental Assessment
-
- Michael T Crimmins PhD.
- Mary Ann Smith Distinguished Professor of
Chemistry UNC and Department Chair, Department
of Chemistry -
- Paul Jones
- Clinical Associate Professor
- School of Information and Library Science
- Director of ibiblio.org
-
- Rudy L Juliano PhD.
- Boshamer Distinguished Professor of Pharmacology
42(No Transcript)
43Closing Comments
- Accelerate synthesis
- Breast cancer study without METIS would take gt13
years - Without synthetic estimate systematic review
- Accelerate discovery
- Connections between literature
- Speculative and orthogonal views
- Human discovery and synthesis
- As important if not more so than automation
Tap the vast reservoir of human knowledge Louis
Round Wilson, 1929
44Acknowledgements
- Claim Jumping
- Funded in part by
- Faculty fellowship from the Renaissance Computing
Institute - UNC Faculty Award
- Thanks to collaborators
- Nassib Nassar and Mats Rynge (RENCI)
- Amol Bapat and Ryan Jones (SILS)
- Chemists and Chemical Engineers Study
- Funded in part by
- NSF Center for Environmentally Responsible
Solvents and Processes
- METIS
- Funded in part by
- California Breast Cancer Research program
- University of California, Irvine
- Thanks to user groups
- Particularly to Dr. Adams and Dr. Tengs
- Academic mentoring
- Primary Advisor Dr. Wanda Pratt
- Medical Mentor Dr. Catherine Carpenter
- Co-Advisors Dr Dennis Kibler and Dr Michael
Pazzani - Committee Member Dr Paul Dourish
45Questions and Comments Welcome
- Catherine Blake
- cablake_at_email.unc.edu
- School of Information and Library Science
- University of North Carolina at Chapel Hill
- http//www.ils.unc.edu/cablake
46Publication Bias
- Studies that find a correlation between a risk
factor and disease are more likely to be
published (Easterbrook et al, 1991, Ingelfinger
et al, 1994) - METIS provides a new way to explore this bias
Bias introduced by authors, editors, funding, ...