Title: Mining Text Data
1Mining Text Data
- Antje Wolf
- S Anwendungen und spezielle Themen in Data Mining
- 06.07.2005
2Overview
- Introduction
- Architecture of Text Mining Systems
- Tagging
- Statistical Tagging
- Semantic Tagging
- Structural Tagging
- Taxonomy Construction
- Implementation Issues
- Visualizations and Analytics for Text Mining
- Summary
- Example of a Text Mining Tool in Bioinformatics
ProMiner
3Introduction
- 80 of digital data is nonstructured
- much information in textual form with little or
no formatting - growing interest in text mining
- different approaches
- use entire set of words in documents as input
- use tags associated with the documents
- information extraction
- Workflow
Input
Preprocessing
Data mining algorithms
4Architecture of Text Mining Systems
- 3 major components
- Business intelligence suite
- Intelligent tagging
- Information feeders
5Architecture of Text Mining SystemsIntelligent
tagging component
- Statistical Tagging categorization and term
extraction - Semantic Tagging information extraction
- Structural Tagging extraction from visual layout
of documents - each tagger separate training module based on
annotated examples
6Statistical TaggingText Categorization
- activity of labeling natural language texts
with thematic categories from a predefined set - knowledge engineering approach
- user defines manually a set of rules encoding
expert knowledge how to classify documents under
given categories - typical rule of CONSTRUE (Hayes, 1992) system if
DNF (disjunction of conjunctive clauses) formula
then category - example
If ((wheat farm) OR (wheat commodity) OR
(bushels export) OR (wheat tonnes) ) then
WHEAT else WHEAT
7Statistical TaggingText Categorization
- machine learning approach
- training set of documents that are pretagged
using the predefined set of categories
8Statistical TaggingText Categorization
- Example-based classifiers K nearest neighbor
(KNN) - decision if document dj ? category ci depends on
the k training documents most similar to dj - best value for k?
- k 20 (Larkey Croft, 1996)
- 29 lt k lt 46 (Yang Chute, 1994)
- distance-weighted version
- categorization status value (CSV) is
9Statistical TaggingText Categorization
- Support Vector Machines
- hyperplane that separates with maximum margin a
set of positive examples from a set of negative
examples
10Statistical TaggingTerm Extraction
- labeling each document with a set of terms
extracted from the document - Linguistic preprocessing
- tokenization identifying text structure at
subparagraph level, that is, word boundaries,
sentence boundaries, dates, abbreviations, etc. - part-of-speech tagging associating
morpho-syntactic categories such as noun,
adjective, verb along with case, number, person - lemmatization assignment of lemma, i.e., the
base form of an inflected word, to every word
token - Term generation
- Term filtering
11Semantic Tagging
- allows for mining of the actual information
present within the text - requires trained developers, very laborious
- extracted information is specific and precise
12Semantic Tagging
- DIAL (Declarative Information Analysis Laguage)
- basic elements
- predifined strings, e.g. merger
- word class element, e.g. a list of countries
- Part-of-speech tag, like noun, adjective
- scanner feature, e.g. Capital, HtmlTag
- constraints
- boolean checks for specific attributes
- IE rule bases
- logic program
- example
FMergerCCM(C1, C2) - Company(Compl)
OptCompanyDetails "and" skip(Company(x),
SkipFail, 10) Company(Comp2) OptCompanyDetails ski
p(WCMergerVerbs, SkipFailComp, 20) WCMergerVerbs
skip(WCMerger, SkipFail, 20) WCMerger
verify(WholeNotInPredicate(Compl, _at_PersonName))
verify(WholeNotInPredicate(Comp2, _at_PersonName))
C1 Compl C2 Comp2
13Semantic Tagging
- rulebooks of DIAL
- Financial rulebook (11,500 rules)
- can identify more than 50 entity types such as
company names, people names, organizations,
products, locations - can identify events such as mergers, joint
ventures - Business intelligence rulebook (7,000 rules)
- Intellectual property rulebook (100 rules)
- can identify 30 different types of entities in
patent files - Protein relationship rulebook (500 rules)
- can identify 30 different types of entities,
including proteins - can identify 10 different relationships,
including phosphorylation
14Structural Tagging
- ignores content of words
- focusing of superficial features, like size and
position on the page - GIVEN
- template document A
- set of primitives in A (annotated fields),
denoted PA - like "AUTHOR...TITLE..."
- query document B
- FIND
- degree of similarity between A and B
- set of primitives in B that corresponds to PA
15Taxonomy Construction
- tree with terms as leaves
- enables construction of high-level association
rules - rules between groups of terms rather than between
individual terms - time-consuming task ? need for semiautomatic
construction - there exist many taxonomies for different
domains, such as Gene Ontology which provides a
controlled vocabulary to describe gene and gene
product attributes in any organism
16Implementation Issues Soft Matching
- problem matching synonyms that refer to the same
entity - examples punctuation variations, spelling
mistakes, abbreviations, formal vs informal names - solutions
- soundex algorithm can match words that have a
similar phonetic pronunciation - lookup table for all abbreviations and nicknames
of a given entity - coding name conversion rules
- example X Corporation and X Corp. are mapped to X
17Implementation IssuesTemporal Resolution
- "time-stamp" documents for temporal analysis
(Trend Graphs) - problems
- large variety of possible date formats
- relative date formats ("yesterday", "last month")
- fuzzy temporal phrases ("in the very near future")
18Implementation IssuesAnaphora Resolution
- resolve coreferences
- pronominals ("he", "she", "we")
- definite noun phrases ("the ruthless man")
- solution
- collect all accessible antecedents for each
referring phrase - heuristics Prefer the candidate that appears ...
- ... earlier in the current sentence.
- ... earlier in the previous sentence.
- ... later within other sentences.
19Visualization and AnalyticsVisualizations
- Category Connection Maps
- concise visual representation of connections
between different categories (taxonomy nodes),
for example between companies and technologies - user chooses number of categories from the
taxonomy - system finds all connection between terms in
categories
20Visualization and AnalyticsVisualizations
- Relationship Maps
- concise representation of the relationships
between many terms in a given context - taxonomy category determines nodes of circle
graph - optional context node determines type of
connection
21Visualization and AnalyticsVisualizations
- Relationship Maps
- Spring graph two-dimensional graph in which the
distance of two elements should reflect the
strength of their relationship
22Visualization and AnalyticsAnalytics
- Clustering
- identify nodes that are strongly interrelated
- find dense subgraphs in a given graph
- Trend Graphs
- view changes in relationships over time
- ? identify trends and patterns
23Summary
- due to abundance of available textual data,
growing need for efficient text mining tools - textual data require preprocessing
- information extraction method proven to be
efficient for this task - analysis techniques like clustering and trend
graphs in combination with visualization tools
facilitate trend and pattern detection
24ProMiner
- aim detection of protein and gene names in
scientific articles - nomenclature is highly variable and ambiguous
- mostly composed entries
- phenotypical descriptions as protein names
- definition of gene aliases as convenient
abbreviations of corresponding protein names - parallel naming of genes and proteins
- ProMiner consists of three parts
- dictionary generation
- occurrence detection and
- filtering of matches
25ProMinerImplementation
- Dictionary generation construction and curation
- gene names from HUGO, protein names from
SWISSPROT and TREMBL - definition of token classes for curation of
dictionary and matching procedure - curation expansion and pruning phase
- tagging each token in dictionary with
corresponding class - after curation
- 38,200 entries with151,700 synonyms
26ProMinerImplementation
- Occurence detection
- processing one token at a time and keeping a set
of candidate solutions for present position - two scoring measures
- boundary scoreincreased on a token mismatch if
rises above a threshold candidate pruned from the
candidate set and checked for reporting - acceptance scoredetermines whether the
candidate is reported as match. linear
combination of token class specific match- and
mismatch termsweights for match terms set to
small value for non-descriptive tokens and high
one for modifier token
27ProMinerImplementation
- Filtering of matches match disambiguation
- set of synonyms
- overlapping matches match with the higher
acceptance score, the larger fraction of matches
or the largest number of matched tokens is
accepted - ambiguous synonym only those matches for which
most additional synonym occurrences can be found - Parameter optimization
- computation of weights with robust linear
programming (RLP) - training set of positive and negative examples
- computes separating hyperplane in vector space of
scoring contributions
28Literature
- Giouli, V., Piperidis, S., Current trends in
corpus processing and annotation
http//www.larflast.bas.bg/balric/eng_files/corpor
a7_1.php (Website from 5. Juli 2005) - Hanisch, D., Fluck, J., Mevissen, HT., Zimmer,
R., Playing biologys name game Identifying
protein names in scientific text. Pacific
Symposium on Biocomputing 2003, pages 403-414. - Hanisch, D., Fundel, K., Mevissen, HT., Zimmer,
R., Fluck, J., ProMiner rule-based protein and
gene entity recognition. BMC Bioinformatics 2005,
6(Suppl 1)S14. - Ye, N. (ed.), The Handbook of Data Mining.
Lawrence Erlbaum Publishers, 2003, ch. 21.