Title: Text Mining
1Text Mining
- Walter Daelemans
- CNTS
- Department of Linguistics
- University of Antwerp
- walter.daelemans_at_ua.ac.be
2Centre for Dutch Language and Speech (CNTS)
- Part of department of linguistics, University of
Antwerp - Staff
- 2 tenured 10-15 with temporary funding from EU,
IWT, FWO, NTU, language industry, BOF, - Topics
- Corpus Linguistics (mainly Dutch)
- Child language acquisition / computational
psycholinguistics - Language Technology
- machine learning of language
- shallow parsing
- text mining
3Information Overload
- Language is the most natural and most used
knowledge representation formalism - Non-structured or weakly structured information
- Text
- Databases with text fields
- Web-pages, e-mail messages, blogs, chat,
- (Non-structured) information overload
- Doubles every three months (Gardner)
- Hampers knowledge management and business
intelligence - Translation bottleneck
4Natural Language Understanding?
- Word meaning
- Morphological analysis
- Complex Word Interpretation
- Word Sense Disambiguation
- Sentence Meaning
- Syntactic structure (parsing)
- Sentence interpretation
- Discourse Meaning
- World Knowledge
- Frames, scenarios, grounding, intentions,
- Fremdzugehen
- External train marriages
- The box is in the pen
- I eat a pizza with extra cheese
- I eat a pizza with a fork
- I eat a pizza with my daughter
- The mayors didnt want the students to strike
because they feared violence - The mayors didnt want the students to strike
because they preached the revolution
5State of the Art
- Robust, efficient, accurate, unrestricted
language understanding will not be available for
a long time - AI-complete problem
- Alternative
- text mining automatic extraction of reusable
knowledge from text, based on linguistic analysis
of the text
6Approach
- Text analysis tools (shallow instead of deep
understanding) - Robust / Efficient / Accurate
- Text Mining applications
- Question Answering
- Summarization
- Ontology extraction
- Information extraction
- Text categorization
- For embedding in
- End user applications related to knowledge search
/ management / discovery / communication
7Examples
- Application Areas
- Data mining (KDD) from unstructured and
semi-structured data - (Corporate) Knowledge Management
- Intelligence
- Example Applications
- Email routing and filtering (spam filtering)
- Finding protein interactions in biomedical text
- Brokering
- Matching on-line resumes and vacancies
- Buying and selling property
8Text Data Mining (Discovery)
- Find relevant information
- Information extraction
- Text categorization
- Analyze the text
- Text mining
- Discovery new information
- Integrate different sources
- Data mining
9Don Swanson 1981 medical hypothesis generation
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated
in some migraines - high levels of magnesium inhibit SCD
- migraine patients have high platelet
aggregability - magnesium can suppress platelet aggregability
10CNTS text analysis tools
- MBSP
- Flexible and adaptable
- Dutch and English
- State of the Art accuracy and efficiency
- 90 sentences / 1000 words/sec
- Configurable combination of linguistic modules
- Modules developed using Machine Learning
- TiMBL
- Adaptation through re-training and
semi-supervised learning - Client-server set-up
11CNTS shallow understanding
12Insulatard is an isophane insulin suspension
(NPH).
13Insulatard is an isophane insulin suspension
(NPH).
Insulatard is an isophane insulin suspension ( NPH
) .
14Insulatard is an isophane insulin suspension
(NPH).
Insulatard NNP is VBZ an DT isophane JJ
insulin NN suspension NN ( Punc NPH NNP )
Punc . Punc
15Insulatard is an isophane insulin suspension
(NPH).
NP Insulatard VP is NP an isophane
insulin suspension( NPH )
16Insulatard is an isophane insulin suspension
(NPH).
Insulatard Medicine name NPH Hormone
17Insulatard is an isophane insuline suspension
(NPH).
SBJ Insulatard is PREDC an isophane
insuline suspension ( NPH )
18Application Question Answering
- Give answer to question
- (document retrieval find documents relevant to
query) - Who invented the telephone?
- Alexander Graham Bell
- When was the telephone invented?
- 1876
19QA System Shapaqa
- Parse question
- When was the telephone invented?
- Which slots are given?
- Verb invented
- Object telephone
- Which slots are asked?
- Temporal phrase linked to verb
- Document retrieval on internet with given slot
keywords - Parsing of sentences with all given slots
- Count most frequent entry found in asked slot
(temporal phrase)
20Shapaqa example
- When was the telephone invented?
- Google invented AND the telephone
- produces 835 pages
- 53 parsed sentences with both input slots and
with a temporal phrase - is through his interest in Deafness and
fascination with acoustics that the telephone was
invented in 1876 , with the intent of helping
Deaf and hard of hearing - The telephone was invented by Alexander Graham
Bell in 1876 - When Alexander Graham Bell invented the telephone
in 1876 , he hoped that these same electrical
signals could
21Shapaqa frequency ranking
- So when was the phone invented?
- Internet answer is noisy, but robust
- 17 1876
- 3 1874
- 2 ago
- 2 later
- 1 Bell
-
- System was developed quickly
- Precision 76 (Google 31)
- International competition (TREC) MRR 0.45
22Application Biomedical text mining (EU project
BioMinT)
IR
IE
Linguistic / Semantic Features
Templates Factoids
Text Analysis
Medline abstracts
23(Partial) Factoids
- The mouse lymphoma assay (MLA) utilizing the Tk
gene is widely used to identify chemical mutagens.
CELL-LINE
The mouse lymphoma assay
MLA
O
S
the Tk gene
DNA part
utilizing
is widely used
to identify
O
chemical mutagens
24lt!DOCTYPE MBSP SYSTEM 'mbsp.dtd'gt ltMBSPgt ltS
cnt"s1"gt ltNP rel"SBJ" of"s1_1"gt ltW
pos"DT"gtThelt/Wgt ltW pos"NN"
sem"cell_line"gtmouselt/Wgt ltW pos"NN"
sem"cell_line"gtlymphomalt/Wgt ltW
pos"NN"gtassaylt/Wgt lt/NPgt ltW
pos"openparen"gt(lt/Wgt ltNPgt ltW pos"NN"
sem"cell_line"gtMLAlt/Wgt lt/NPgt ltW
pos"closeparen"gt)lt/Wgt ltVP id"s1_1"gt ltW
pos"VBG"gtutilizinglt/Wgt lt/VPgt ltNP rel"OBJ"
of"s1_1"gt ltW pos"DT"gtthelt/Wgt ltW
pos"NN" sem"DNA_part"gtTklt/Wgt ltW pos"NN"
sem"DNA_part"gtgenelt/Wgt lt/NPgt
ltVP id"s1_2"gt ltW pos"VBZ"gtislt/Wgt ltW
pos"RB"gtwidelylt/Wgt ltW pos"VBN"gtusedlt/Wgt
lt/VPgt ltVP id"s1_3"gt ltW pos"TO"gttolt/Wgt
ltW pos"VB"gtidentifylt/Wgt lt/VPgt lt/VPgt ltNP
rel"OBJ" of"s1_3"gt ltW pos"JJ"gtchemicallt/Wgt
ltW pos"NNS"gtmutagenslt/Wgt lt/NPgt ltW
pos"period"gt.lt/Wgt lt/Sgt lt/MBSPgt
25Extracted IEX Templates from shallow parser output
- NP(ltX proteingt) contain NP(Y "domain")
- EVENT contain
- PROTEIN ltproteingt
- DOMAIN domainf
- NP(ltX proteingt) be associated with NP(Y
disease) - EVENT associated_with
- PROTEIN ltproteingt
- DISEASE head
- NP(ltX proteingt) regulate NP(Y)
- EVENT regulate
- PROTEIN ltproteingt
- Y
Jee-Hyub Kim (Geneva)
() to be extracted, ltgt semantic constraint, ""
lexical constraint
26Application Ontology Extraction
- Clustering of head nouns of Subject-Verb and
Verb-Object relations - Combine with pattern matching and heuristics
- Case study Medline 4 million words hepatitis,
SwissProt corpus - Results
- Better clusters with shallow parsing
- Useful in knowledge management, thesaurus
development,
Ontobasis (IWT)
27Example (SwissProt corpus)
gene show significant homology, amino_acid_se
quence have/indicate/lack/reveal/show
homology protein show homology,
immunoreactivity, reactivity, sequence
similarity protein inhibit catalytic
activity, apoptosis, protein synthesis... protein
exhibit significant homology protein bind
copper, ubiquitin protein correspond
isoelectric point induction requires protein
synthesis Edman degradation of intact protein
regulatory subunit of cAMP-dependent
protein kinase
28(No Transcript)
29Further development
- Semantic roles
- Faster adaptation to new domains
- Domain semantics (NER / concept tagging)
- Active Learning / semi-supervised learning
- More analytic power
- Negation, modality, quantification
- Limited event and scenario recognition
30Conclusions
- Text Mining tasks benefit from text analysis
- Understanding can be formulated as a flexible
heterarchy of classifiers - These classifiers can be trained / adapted on
annotated corpora and can eventually approximate
deep understanding
31Questions?
- Walter Daelemans
- A1.10 Campus Drie Eiken
- (September Stadscampus)
- Walter.daelemans_at_ua.ac.be