Title: Emerging Technologies of Text Mining
1Emerging Technologies of Text Mining Masrah
Azrifah Azmi Murad Department of Information
Systems
2Outline
- Introduction
- Text vs. Data Mining
- Motivation and challenges of TM
- Text mining processes
- Techniques in text mining
- Some application areas
- TM Commercial Tools
3Definition
- The non trivial extraction of implicit,
previously unknown, and potentially useful
information from (large amount of) textual data - Goal discover unknown information from any
sources or documents (unstructured data) to form
new facts - Cousin to data mining
- E.g., e-mails, memos, reports, contracts, phone
calls, and documents in the WWW
Source Automated Learning Group, NCSA
4Search vs Discover
Search (goal-oriented)
Discover (opportunistic)
Structured Data
Data Mining
Data Retrieval
Unstructured Data (Text)
Text Mining
Information Retrieval
Source AvaQuest Inc, 2002
5Data Retrieval
- Find records within a structured database.
source Swanson and Smalheiser, 1994
6Information Retrieval
- Find relevant information in an unstructured
information source (usually text)
source Swanson and Smalheiser, 1994
7Data Mining
- Discover new knowledge through analysis of data
source Swanson and Smalheiser, 1994
8Text Mining
- Discover new knowledge through analysis of text
source Swanson and Smalheiser, 1994
9Motivation for Text Mining
- Approximately 90 of the worlds data is held in
unstructured formats (source Oracle Corporation) - Information intensive business processes demand
that we transcend from simple document retrieval
to knowledge discovery.
Structured Numerical or Coded Information
10
Unstructured or Semi-structured Information
90
source Swanson and Smalheiser, 1994
10Challenges of Text Mining (1)
- Large textual database
- Web is growing
- Publications are electronic
- High dimensionality
- Consider each word/phrase as a dimension
- Complex and subtle relationships between concepts
in text - AOL merges with Time-Warner
- Time-Warner is bought by AOL
Source Automated Learning Group, NCSA
11Challenges of Text Mining (2)
- Ambiguity
- Word ambiguity
- Pronouns (he, she )
- Synonyms (automobile car vehicle Proton)
- Words with multiple meanings (bat baseball or
mammal) - Semantic ambiguity
- The king saw the rabbit with his glasses
(multiple meanings) - Noisy data
- Spelling mistakes
- Abbreviations
- Acronyms
Source Automated Learning Group, NCSA
12Challenges of Text Mining (3)
- Not well structured text
- Email/Chat rooms
- r u available ?
- Hey whazzzzzz up
- Speech
- Order of words in the query
- hot dog stand in the amusement park
- hot amusement stand in the dog park
Source Automated Learning Group, NCSA
13More Issues in Natural Language
- Ambiguity, e.g., Squad helps dog bite victim
- Anaphora (substitute for preceding word), e.g.,
After Mary proposed to John, they found a
preacher and got married For the honeymoon, they
went to Hawaii - Indexicality (points to some state of affairs),
e.g., I am over here Why did you do that? - Noncompositionality, e.g., baby shoes
- Discourse structure speak or write longer than
a sentence - Metonymy (using one noun phrase to stand for
another), e.g., Chrysler announced record profits - Metaphor (way to describe something), e.g., Brian
was a wall (bouncing every tennis ball back over
the net)
14Text Mining Process
- Text Preprocessing
- Syntactic/Semantic Text Analysis
- Features Generation
- Bag of Words
- Feature Selection
- Simple Counting
- Statistics
- Text/Data Mining
- Classification- Supervised Learning
- Clustering- Unsupervised Learning
- Analyzing Results
15Text Preprocessing Syntactic / Semantic Text
Analysis
- Part of Speech (PoS) Tagging
- Find the corresponding PoS for each word
- e.g., John (noun) gave (verb) the (det) ball
(noun) - Word Sense Disambiguation
- Context based or proximity based
- Parsing
- plastic bottle holder / plastic bottle holder
- Phrase Recognition/Collocations
- Kuala Lumpur, interest rate
16Feature Generation Bag of Words
- Text document is represented by the words it
contains (and their occurrences) - e.g., Lord of the rings ? the, Lord,
rings, of - Highly efficient
- Makes learning far simpler and easier
- Order of words is not that important for certain
applications - Stemming
- Reduce dimensionality
- Identifies a word by its root
- e.g., flying, flew ? fly
- Stop words
- Identifies the most common words that are
unlikely to help with text mining - e.g., the, a, an, you
17Feature Selection
- Reduce Dimensionality
- Learners have difficulty addressing tasks with
high dimensionality - Irrelevant Features
- Not all features help!
- Remove features that occur in only a few
documents - Reduce features that occur in too many documents
18Supervised vs. Unsupervised Learning
- What Is Good Classification?
- Correct classification
- Known label of test example is identical to the
predicted class from the model - Accuracy ratio
- Percent of test set examples that are correctly
classified by the model - Distance measure between classes can be used
- e.g., classifying football document as a
basketball document is not as bad as
classifying it as crime - What Is Good Clustering?
- Produce high quality clusters with
- high intra-class similarity
- low inter-class similarity
19Classification Techniques
- Neural networks
- Decision trees
- Bayesian classification
- Nearest-neighbor
- SVM
- HMM
20Clustering Techniques
- k-Means
- Fuzzy c-Means
- Hierarchical clustering
21Techniques in Text Mining (1)
- Information extraction - find particular pieces
of information from text documents determine
relationships that hold between them - Thematic indexing - identify dominant theme for a
particular group of documents by automatically
extracting the key feature of the group. For
example, documents about orange and apple -
classified under fruit or plantation - Categorization - automatically organizes
documents into user-defined categories or
taxonomies - Clustering groups together conceptually related
documents, enabling identification of duplicates
and near-duplicates
Source Sullivan, 2000
22Techniques in Text Mining (2)
- Summarization get the gist of a document or
document collection. E.g., headlines of
newspapers and TV news, previews of movies,
bulletins of weather forecasts and minutes of a
meeting - Foreign Language Text Mining - enables an
organization to make effective use of foreign
language data, even in the absence of staff with
foreign language skills - Topic Modeling - looks for patterns of words that
tend to occur together in documents, then
automatically categorizes those words into topics
23Application Areas (1)
- Information Retrieval
- Indexing and retrieval of textual documents
- Finding a set of (ranked) documents that are
relevant to the query - Bioinformatics
- Mining scientific journals for critical
information associated with genes and proteins
e.g., genes and their associated functions,
diseases, and tissue - Email
- Spam filtering
Source Automated Learning Group, NCSA
24Application Areas (2)
- News Feeds
- Discover what is interesting provide summary
- Homeland Security and Intelligence (US)
- Analysis of terrorist networks Rapid
identification of critical information about such
topics as weapons of mass destruction from very
large collections of text documents Surveillance
of the Web, e-mails, or chat rooms. - Marketing
- Discover distinct groups of potential buyers and
make suggestions for other products
Source Automated Learning Group, NCSA
25What is Information Extraction?
Advisory Programmer - Oracle (Austin, TX)
Response Code 1008-0074-97-iexc-jcn
Responsibilities This is an exciting opportunity
with Siemens Wireless Terminals a start-up
venture fully capitalized by a Global Leader in
Advanced Technologies. Qualified candidates will
Responsible for assisting with requirements
definition, analysis, design and implementation
that meet objectives, codes difficult and
sophisticated routines . Develops project plans,
schedules and cost data. Develop test plans and
implement physical design of databases. Develop
shell scripts for administrative and background
tasks, stored procedures and triggers. Using
Oracles Designer 2000, assist with Data Model
maintenance and assist with applications
development using Oracle Forms. Qualifications
BSCS, BSMIS or closely related field or related
equivalent knowledge normally obtained through
technical education programs. 5-8 years of
professional experience in development, system
design analysis, programming, installation using
Oracle development
- Given
- Source of textual documents
- Well defined limited query (text based)
- Find
- Sentences with relevant information
- Extract the relevant information and ignore
non-relevant information (important!) - Link related information and output in a
predetermined format
Source Automated Learning Group, NCSA
26Extra-semantic Information
- Extracting hidden meaning or sentiment based on
use of language. - Examples
- Customer is unhappy with their service!
- Sentiment discontent
- Sentiment is
- Emotions fear, love, hate, sorrow
- Feelings warmth, excitement
- Mood, disposition, temperament,
- Or even (someday)
- Lies, sarcasm
source Swanson and Smalheiser, 1994
27Web Mining - Content
- Enormous wealth of textual information on the Web
- Book/CD/Video stores (e.g., Amazon)
- Restaurant information (e.g., Zagats)
- Car prices (e.g., Carpoint)
- Hyper-link information
- Web is very dynamic
- Web pages are constantly being generated
(removed) - Web pages are generated from database queries
- Technologies used
- NLP
- IR
Source Automated Learning Group, NCSA
28Medical Research
- Find causal links between symptoms or diseases
and drugs or chemicals - E.g., Objective follow chains of causal
implication to discover a relationship between
migraines and biochemical levels - Data medical research papers, medical news
- Key concept types - symptoms, drugs, diseases,
chemicals
source Swanson and Smalheiser, 1994
29Example Application
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated
in some migraines - high levels of magnesium inhibit SCD
- migraine patients have high platelet
aggregability - magnesium can suppress platelet aggregability
source Swanson and Smalheiser, 1994
30Business Applications
- Ex 1 Decision Support in CRM
- What are customers typical complaints?
- What is the trend in the number of satisfied
customers in Cleveland? NY? - Ex 2 Knowledge Management
- People Finder
- Ex 3 Personalization in eCommerce
- Suggest products that fit a users interest
profile (even based on personality info).
source Swanson and Smalheiser, 1994
31Example 1 Decision Support using Bank Call
Center Data
- The Needs
- Analysis of call records as input into
decision-making process of Banks management - Quick answers to important questions
- Which offices receive the most angry calls?
- What products have the fewest satisfied
customers? - (Angry and Satisfied are recognizable
sentiments) - User friendly interface and visualization tools
source Swanson and Smalheiser, 1994
32Example 1 Decision Support using Bank Call
Center Data
- The Information Source
- Call center records
- Example
AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK,
NY, H-SUPRVR8, STMT, mr stark has been with the
company for about 20 yrs. He hates his stmt
format and wishes that we would show a daily
balance to help him know when he falls below
the required balance on the account.
source Swanson and Smalheiser, 1994
33Example 1 Call Volume by Sentiment
source Swanson and Smalheiser, 1994
34Example 2KM People Finder
- The Needs
- Find people as well as documents that can address
my information need - Promote collaboration and knowledge sharing
- Leverage existing information access system
- The Information Sources
- Email, groupware, online reports,
source Swanson and Smalheiser, 1994
35Example 2Simple KM People Finder
Ranked People Names
Name Extractor
Authority List
Query
Relevant Docs
Search or Navigation System
source Swanson and Smalheiser, 1994
36Example 2 KM People Finder
source Swanson and Smalheiser, 1994
37Example 3Personalized Movie Matcher
- The Need
- Match movies to individuals based on preference
profile - The Information
- Written reviews of movies
- Users lists of favorite movies.
Sentiment Analysis
Movie Reviews
Typed and Tagged Reviews
source Swanson and Smalheiser, 1994
38Sentiment Analysis of Movies Visualization
absurdity
Action
conflict
insecurity
1
Romance
crime
injustice
0
inferiority
death
deception
immorality
horror
destruction
fear
source Swanson and Smalheiser, 1994
39Commercial Tools
- IBM Intelligent Miner for Text
- Semio Map
- InXight LinguistX / ThingFinder
- LexiQuest
- ClearForest
- Teragram
- SRA NetOwl Extractor
- Autonomy
source Swanson and Smalheiser, 1994
40