Title: Introduction to Text Mining
1Introduction to Text Mining
- By
- Soumyajit Manna
- 11/10/08
2Outline
- Text Mining Definition
- Text Mining Application
- Text Characteristics
- Text Mining Process
- Future of text mining
3Text Mining Definition
- The non trivial extraction of implicit,
previously unknown, and potentially useful
information from (large amount of) textual data. - An exploration and analysis of textual
(natural-language) data by automatic and semi
automatic means to discover new knowledge. - What is previously unknown information ?
- Strict definition
- Information that not even the writer knows.
- e.g., Discovering a new method for a hair growth
that is described as a side effect for a
different procedure - Lenient definition
- Rediscover the information that the author
encoded in the text - e.g., Automatically extracting a products name
from a web-page.
4Definition Cont
- Then the question arises
-
-
- Is Text mining is similar to that of
Data mining ? -
- or
-
- Can we implement the Data Mining
technique for Text Mining?
5Answer
- Structured Data The data that will be used are
clearly described over a range of all
possibilities or can be described by a
spreadsheet. Types - 1. Order Numerical Values where
greater than and less than comparisons have
meaning. - 2. Categorical The values that can
be measured as true or false. - Typical data mining
application uses structured data. -
- Unstructured Data The above criteria does not
fulfill (Text Mining).
Gender BP Weight Code
M 175 65 3
F 141 72 1
. . .. .
F 160 59 2
6Answer Contd...
- The classical data mining technique is
implemented by transforming text into numerical
data and then putting it into the spreadsheet.
Company Income Job Overseas
0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1
7Text Mining Applications
- Marketing Discover distinct groups of potential
buyers according to a user text based profile - e.g. Amazon
- Industry Identifying groups of competitors web
pages - e.g., competing products and their prices
- Job seeking Identify parameters in searching for
jobs - e.g., www.flipdog.com
8Text Mining Methods
- Document Classification (Web Mining)
- Indexing and retrieval of textual documents and
extraction of partial knowledge using the web - Information Extraction
- Extraction of partial knowledge in the text
- Information Retrieval
- Indexing and retrieval of textual documents
- Clustering
- Generating collections of similar text documents
9Document Classification
- Purest embodiment of spreadsheet model with
labeled answers - Documents organized into folders, one folder for
each topic. - The application is almost always binary
classification because a document can appear in
multiple folder. - The problem is considered by the form of indexing
like the index of book.
Household
Household vs. Household
New Document
Finance
Finance vs. Finance
School vs. School
School
10Information Retrieval
- Given
- A source of textual documents
- A user query (text based)
- Find
- A set (ranked) of documents that
- are relevant to the query
Document Collection
Document Collection
Document Collection
Document Collection
Document Collection
Test Document
IR System
Match Documents
Query E.g. Spam / Text
11Intelligent Information Retrieval
- Meaning of words
- Synonyms buy / purchase
- Ambiguity bat (baseball vs. mammal)
- Order of words in the query
- hot dog stand in the amusement park
- hot amusement stand in the dog park
- User dependency for the data
- direct feedback
- indirect feedback
- Authority of the source
- IBM is more likely to be an authorized source
then my second far cousin
12Information Extraction
- Given
- A source of textual documents
- A well defined limited query (text based)
- Find
- Sentences with relevant information
- Extract the relevant information and
- ignore non-relevant information (important!)
- Link related information and output in a
predetermined format
13Information Extraction Model
Document Source
Sorted Data
Extraction System
Combine Query Result
- Query 1
- (E.g. revenue)
- Query 2
- (E.g. profit)
14Information Extraction Example.
- ..on revenues of twenty five million dollars, the
company reported a profited a profit of 4.5
million for the fiscal year
Input Documents
Revenue Profit
25000000 4500000
15Clustering
- Given
- A source of textual documents
- Similarity measure
- e.g., how many words are common in these
documents - Find
- Several clusters of documents that are relevant
to each other
16Clustering Model
-
- Group1 Group2
Group3 Group4 Group5
Document
Document
Document
Document Organizer
17Text Characteristics
- Large textual data base
- High dimensionality
- Several input modes
- Dependency
- Ambiguity
- Noisy data
- Not well structured text
18Text Characteristics Cont..
- Large textual data base
- Efficiency consideration
- over 2,000,000,000 web pages
- almost all publications are also in electronic
form - High dimensionality (Sparse input)
- Consider each word/phrase as a dimension
- Several input modes
- e.g., Web mining information about user is
generated by semantics, browse pattern and
outside knowledgebase.
19Text Characteristics Cont..
- Dependency
- relevant information is a complex conjunction of
words/phrases - e.g., Document categorization.
- Pronoun disambiguation.
- Ambiguity
- Word ambiguity
- Pronouns (he, she )
- buy, purchase
- Semantic ambiguity
- The king saw the rabbit with his glasses. (8
meanings)
20Text Characteristics Cont..
- Noisy data
- Example Spelling mistakes
- Not well structured text
- Chat rooms
- r u available ?
- Hey whazzzzzz up
- Speech
21Text Mining Process
22Text Mining Process Cont..
- Text preprocessing
- Syntactic/Semantic text analysis
- Features Generation
- Bag of words
- Features Selection
- Simple counting
- Statistics
- Text/Data Mining
- Classification- Supervised learning
- Clustering- Unsupervised learning
- Analyzing results
23Text preprocessing
- Part Of Speech (pos) tagging
- Find the corresponding pos for each word
- e.g., John (noun) gave (verb) the (det) ball
(noun) - 98 accurate.
- Word sense disambiguation
- Context based or proximity based
- Very accurate
- Parsing
- Generates a parse tree (graph) for each sentence
- Each sentence is a stand alone graph
24Features Generation
- Text document is represented by the words it
contains (and their occurrences) - e.g., Lord of the rings ? the, Lord,
rings, of - Highly efficient
- Makes learning far simpler and easier
- Order of words is not that important for certain
applications - Stemming identifies a word by its root
- e.g., flying, flew ? fly
- Reduce dimensionality
- Stop words The most common words are unlikely to
help text mining - e.g., the, a, an, you
25Features Generation with XML
- Current keyword-oriented search engines cannot
handle rich queries like - Find all books authored by Scooby-Doo.
- XML Extensible Markup Language
- XML documents have a nested structure in which
each element is associated with a tag. - Tags describe the semantics of elements.
ltbookgt lttitlegt The making of a bad movie lt/titlegt
ltauthorgt
ltnamegt Scooby-Doo lt/namegt
ltaffiliationgt Cartoons lt/affiliationgt
lt/authorgt lt/bookgt
26Feature Selection
- Reduce dimensionality
- Learners have difficulty addressing tasks with
high dimensionality - Irrelevant features
- Not all features help!
- e.g., the existence of a noun in a news article
is unlikely to help classify it as politics or
sport
27Challenges of Text Mining
- Access to raw text in gated collections (ie,
collections which require payment to permit
access to resources) . -
- Tools that are too difficult for non-programmers
to use. - Questions relating to the validity of text mining
as a technique for drawing legitimate
conclusions.
28Future Of Text Mining
- Develop focused, easy-to-use tools that bridge
the gap between computer programmers and
humanities researchers - Different tools and data, but common dimensions
- Example
- Find sales trends by product and correlate with
occurrences of company name in business news
articles - Dimensions Time, Company names (or stock
symbols), Product names, Regions
29Thanks