Title: Text Mining: Challenges and Opportunities
1Text Mining Challenges and Opportunities
- Padmini Srinivasan
- School of Library Information Science
- Department of Management Sciences
- The University of Iowa, Iowa City, IA
- http//mingo.info-science.uiowa.edu/padmini
- padmini-srinivasan_at_uiowa.edu
2Acknowledgements University of Iowa Faculty
Scholar Award NSF ITR grant 2003-2006 Students
Aditya Kumar Sehgal, Xin Ying Qiu, Micah
Wedemeyer, Li Zhou. Colleagues Bisharah Libbus,
Olivier Bodenreider, David Eichmann, Marc Light
3Outline
1. What is text mining? 2. Our profile based
approach 3. Studies conducted Turmeric 4. The
search for a general model of text mining
4Text Mining contribute to knowledge discovery
hypothesis generation hypothesis
exploration Mine a collection of texts for
novel and interesting ideas associations Pr
opositions/hypotheses need follow up
verification
5Consider a local example
Weinstock, gastroenterologist (U.
Iowa) drinking a concoction containing
thousands of pig whipworm eggs could protect
people against bowel disease - inflammatory
bowel disease (IBD) IBD is rare in countries
where parasitic infections are more
common BioCure (German) trials with the drink
make serendipity more likely!!
6Interesting links
Finding a plausible hypothesis about a
new treatment for a disease Finding possible new
products in which particular components may be
used. Looking for connections between
people between genes and diseases between drugs
and cellular functions between products and
consumers between politicians and campaign
contributors
all from semi-structured texts
7Single link paths through texts
Person A
(leads)
Political organization B
(member of)
Person C
(member of)
Political organization D
(member of)
Person E
(funds)
Organization F
Person A --?-- Organization F
8Finding these links in texts?
Person A
(co-occurs with)
0.8
Political Organization B
(co-occurs with)
0.5
Person C
(co-occurs with)
0.9
Political Organization D
0.4
(co-occurs with)
0.4
Person E
(co-occurs with)
Organization F
9In fact co-occurrence is often exploited
Jenssen et al. PubGene Nature Genetics 01
(Screenshot from PubGene web site)
10In fact co-occurrence is often exploited
Jenssen et al. PubGene Nature Genetics 01
Does the pair co-occur more than expected by
chance alone? Is the co-occurrence at the
sentence level? Is sentence in a key location?
(Screenshot from PubGene web site)
11Extract the particular relationships
Person A
Person A
(co-occur)
(leads)
Organization B
Organization B
0.8
0.8
A, the president of B As president, A has the
choicest parking spot in B The CEO A of B has to
.. A set the vision for B in During his term as
CEO A made B into a success One of As
responsibilities as president is to see that B
.. etc.
12F gave a check for 10.000 pounds to E.. E has
been paid several times by the charitable group,
F F continues to financially support Es
activities. etc.
0.4
Person E
(funds)
Organization F
13Extraction is a precursor to text mining
Person A
(co-occurs with)
Organization B
0.8
(co-occurs with)
0.5
Person C
(co-occurs with)
0.9
Organization D
0.4
(co-occurs with)
0.4
Person E
(co-occurs with)
Organization F
14Instead of single link paths
You extract the following set of links in the
text collection.
Location 1 Location 2 Location 3 Location
4 Location 5 Location 6
Person A
Person B
(present at the same time)
Person A --?-- Person B
15location1 location2 location3 location4 opinion1 o
pinion2 opinion3 organization1 organization2 affil
iate1 affiliate2 affiliate3
Person A
Person B
Know about person A - looking for similar
people Is this a clustering problem? Should we
start clustering all people? But then which
objects are we going to be interested in next?
16Interested in dishes similar to rice pudding
?
rice pudding
17Interested in recipes similar to rice pudding
sutlac
arroz con leche
0.8
0.9
rice milk sugar almonds pistachio cinnamon vanilla
0.7
ris a lamande
rice pudding
0.6
payasam
0.5
0.8
kheer
0.6
riz bhaleeb
grod
18Flexible solution needed and in fact.
19Different objects
prior experience address degrees projects expe
cted salary hobbies
Person A
Company B
Start with person A, looking for appropriate
companies or the other way
20molecular function cellular function pathologic
function genomic function physiologic
function tissue function
Drug A
Disease B
Have some intuition about a particular drug and a
disease start from both directions.
21Generalizing this a bit we can say in our
approach we.
22Build profiles
molecular function cellular function pathologic
function genomic function physiologic
function tissue function
Drug A
Disease B
Build a profile for drug A and a profile for
disease B then compare to see to what extent
they overlap in dimensions that are interesting
23General approach Topics and Profiles
Topic eg. a gene, a disease, a company or
product. represented by an appropriate subset
of documents from the text collection
(MEDLINE) Profile Set of terms/concepts
characterizing the topic weights to represent
their relative importance. Text mining involves
comparing and connecting profiles.
24Open Discovery
C1
B1 B2 B3 B4 B5 . . Bn
C2
A
. . . .
Cm
Drug
Disease
25Closed Discovery
B1 B2 B3 B4 B5 . . Bn
A
C
Drug
Disease
26Larger groups of topics
A2
A2
common Properties?
A1
A2
A2
A2
Genes from a microarray experiment or a set
of diseases or a group of drugs
27Manjal prototype text mining tool
http//sulu.info-science.uiowa.edu/Manjal.html
w/ Sehgal, Qiu
28(No Transcript)
29Open Discovery
Closed Discovery
Larger Group of Topics
30(No Transcript)
31(No Transcript)
32Current work with Topic Profiles
1. Building profiles for genes (humans,..)
Gene name ambiguity BAD, CART, I, .. w/
Sehgal
2. Using different kinds of profiles for analysis
of microarray data w/Qiu, Bodenreider,
Zheng
3. Exploring differences between the discourses
of patients, journalists and researchers
(journals and clinical trials) on
diseases/health problems Autism w/Zhou
334. Extracting Speculative statements.
Build profiles for topics
Use these to go back into the documents
Pull out the speculative statements
W/Qiu and Light
34Profile for VIOXX (Rofecoxib)
Cornea 2004 This case report suggests that
oral rofecoxib may trigger Stevens-Johnson
syndrome, potentially causing symblepharons,
corneal neovascularization and cicatricial
ectropions. Drug Safety 2004 Do some
inhibitors of COX-2 increase the risk of
thromboembolic events? 2003 Aug Am J Clinical
Oncology Merck Rofecoxib (Vioxx) is used
clinically for osteoarthritis and pain, and in
addition the results described here suggest
that Vioxx may be useful as a chemopreventive in
humans at risk for colorectal neoplasia.
35Open Discovery Exploring Turmeric (Curcumin
Longa) 2004 ISMB Bioinformatics
paper w/Libbus (NLM)
Widely used spice in Asia for hundreds of
years Used for treating burns, ulcers, various
skin diseases etc. Can we use open discovery to
suggest novel uses for turmeric?
36Open Discovery
Curcumin
(genes, enzymes, proteins)
Diseases?
37Open Discovery
Curcumin
(genes, enzymes, proteins)
Diseases?
Retinal diseases, Chron disease, Problems related
to the spinal cord
38Retinal Diseases
TNF-alpha IL1-beta COX-2 JNK ERK
MAPK NFkappaB
Curcumin
TNF-alpha elevated in early stages of diabetic
retinopathy Activation of TNF-alpha may lead to
glaucoma. Anti-TNF-alpha treatment reduced
leukocyte adhesion to eye blood vessels and
vascular leakage (problems in retinopathy) TNF-alp
ha activation followed by NF-KappaB
transcription (suppressed by curcumin)
39General Model Where does one start? Finding
Haystacks with Needles?
Take a substance Identify key
functions/mechanisms Identify other diseases
in which the mechanism is significant Challenge
Properties of a good text mining problem?
40Challenge!
Weinstock, gastroenterologist (U.
Iowa) drinking a concoction containing
thousands of pig whipworm eggs could protect
people against bowel disease - inflammatory
bowel disease (IBD) IBD is rare in developing
countries where parasitic infections are more
common BioCure (German) trials with the drink
41Summary text mining approach topic
profiles highlights of our research with
profiles current challenge finding a model
for interesting text mining problems
Thank you!