Introduction to Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Text Mining

Description:

Find all books authored by 'Scooby-Doo'. XML: Extensible Markup Language ... author name Scooby-Doo /name affiliation Cartoons /affiliation /author ... – PowerPoint PPT presentation

Number of Views:186
Avg rating:3.0/5.0
Slides: 30
Provided by: Soumy
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Text Mining


1
Introduction to Text Mining
  • By
  • Soumyajit Manna
  • 11/10/08

2
Outline
  • Text Mining Definition
  • Text Mining Application
  • Text Characteristics
  • Text Mining Process
  • Future of text mining

3
Text Mining Definition
  • The non trivial extraction of implicit,
    previously unknown, and potentially useful
    information from (large amount of) textual data.
  • An exploration and analysis of textual
    (natural-language) data by automatic and semi
    automatic means to discover new knowledge.
  • What is previously unknown information ?
  • Strict definition
  • Information that not even the writer knows.
  • e.g., Discovering a new method for a hair growth
    that is described as a side effect for a
    different procedure
  • Lenient definition
  • Rediscover the information that the author
    encoded in the text
  • e.g., Automatically extracting a products name
    from a web-page.

4
Definition Cont
  • Then the question arises
  • Is Text mining is similar to that of
    Data mining ?
  • or
  • Can we implement the Data Mining
    technique for Text Mining?

5
Answer
  • Structured Data The data that will be used are
    clearly described over a range of all
    possibilities or can be described by a
    spreadsheet. Types
  • 1. Order Numerical Values where
    greater than and less than comparisons have
    meaning.
  • 2. Categorical The values that can
    be measured as true or false.
  • Typical data mining
    application uses structured data.
  • Unstructured Data The above criteria does not
    fulfill (Text Mining).

Gender BP Weight Code
M 175 65 3
F 141 72 1
. . .. .
F 160 59 2
6
Answer Contd...
  • The classical data mining technique is
    implemented by transforming text into numerical
    data and then putting it into the spreadsheet.

Company Income Job Overseas
0 1 0 1
1 0 1 1
1 1 1 0
0 0 0 1
7
Text Mining Applications
  • Marketing Discover distinct groups of potential
    buyers according to a user text based profile
  • e.g. Amazon
  • Industry Identifying groups of competitors web
    pages
  • e.g., competing products and their prices
  • Job seeking Identify parameters in searching for
    jobs
  • e.g., www.flipdog.com

8
Text Mining Methods
  • Document Classification (Web Mining)
  • Indexing and retrieval of textual documents and
    extraction of partial knowledge using the web
  • Information Extraction
  • Extraction of partial knowledge in the text
  • Information Retrieval
  • Indexing and retrieval of textual documents
  • Clustering
  • Generating collections of similar text documents

9
Document Classification
  • Purest embodiment of spreadsheet model with
    labeled answers
  • Documents organized into folders, one folder for
    each topic.
  • The application is almost always binary
    classification because a document can appear in
    multiple folder.
  • The problem is considered by the form of indexing
    like the index of book.

Household
Household vs. Household
New Document
Finance
Finance vs. Finance
School vs. School
School
10
Information Retrieval
  • Given
  • A source of textual documents
  • A user query (text based)
  • Find
  • A set (ranked) of documents that
  • are relevant to the query

Document Collection
Document Collection
Document Collection
Document Collection
Document Collection
Test Document
IR System
Match Documents
Query E.g. Spam / Text
11
Intelligent Information Retrieval
  • Meaning of words
  • Synonyms buy / purchase
  • Ambiguity bat (baseball vs. mammal)
  • Order of words in the query
  • hot dog stand in the amusement park
  • hot amusement stand in the dog park
  • User dependency for the data
  • direct feedback
  • indirect feedback
  • Authority of the source
  • IBM is more likely to be an authorized source
    then my second far cousin

12
Information Extraction
  • Given
  • A source of textual documents
  • A well defined limited query (text based)
  • Find
  • Sentences with relevant information
  • Extract the relevant information and
  • ignore non-relevant information (important!)
  • Link related information and output in a
    predetermined format

13
Information Extraction Model
Document Source
Sorted Data
Extraction System
Combine Query Result
  • Query 1
  • (E.g. revenue)
  • Query 2
  • (E.g. profit)

14
Information Extraction Example.
  • ..on revenues of twenty five million dollars, the
    company reported a profited a profit of 4.5
    million for the fiscal year

Input Documents
Revenue Profit


25000000 4500000


15
Clustering
  • Given
  • A source of textual documents
  • Similarity measure
  • e.g., how many words are common in these
    documents
  • Find
  • Several clusters of documents that are relevant
    to each other

16
Clustering Model
  • Group1 Group2
    Group3 Group4 Group5

Document
Document
Document
Document Organizer
17
Text Characteristics
  • Large textual data base
  • High dimensionality
  • Several input modes
  • Dependency
  • Ambiguity
  • Noisy data
  • Not well structured text

18
Text Characteristics Cont..
  • Large textual data base
  • Efficiency consideration
  • over 2,000,000,000 web pages
  • almost all publications are also in electronic
    form
  • High dimensionality (Sparse input)
  • Consider each word/phrase as a dimension
  • Several input modes
  • e.g., Web mining information about user is
    generated by semantics, browse pattern and
    outside knowledgebase.

19
Text Characteristics Cont..
  • Dependency
  • relevant information is a complex conjunction of
    words/phrases
  • e.g., Document categorization.
  • Pronoun disambiguation.
  • Ambiguity
  • Word ambiguity
  • Pronouns (he, she )
  • buy, purchase
  • Semantic ambiguity
  • The king saw the rabbit with his glasses. (8
    meanings)

20
Text Characteristics Cont..
  • Noisy data
  • Example Spelling mistakes
  • Not well structured text
  • Chat rooms
  • r u available ?
  • Hey whazzzzzz up
  • Speech

21
Text Mining Process
22
Text Mining Process Cont..
  • Text preprocessing
  • Syntactic/Semantic text analysis
  • Features Generation
  • Bag of words
  • Features Selection
  • Simple counting
  • Statistics
  • Text/Data Mining
  • Classification- Supervised learning
  • Clustering- Unsupervised learning
  • Analyzing results

23
Text preprocessing
  • Part Of Speech (pos) tagging
  • Find the corresponding pos for each word
  • e.g., John (noun) gave (verb) the (det) ball
    (noun)
  • 98 accurate.
  • Word sense disambiguation
  • Context based or proximity based
  • Very accurate
  • Parsing
  • Generates a parse tree (graph) for each sentence
  • Each sentence is a stand alone graph

24
Features Generation
  • Text document is represented by the words it
    contains (and their occurrences)
  • e.g., Lord of the rings ? the, Lord,
    rings, of
  • Highly efficient
  • Makes learning far simpler and easier
  • Order of words is not that important for certain
    applications
  • Stemming identifies a word by its root
  • e.g., flying, flew ? fly
  • Reduce dimensionality
  • Stop words The most common words are unlikely to
    help text mining
  • e.g., the, a, an, you

25
Features Generation with XML
  • Current keyword-oriented search engines cannot
    handle rich queries like
  • Find all books authored by Scooby-Doo.
  • XML Extensible Markup Language
  • XML documents have a nested structure in which
    each element is associated with a tag.
  • Tags describe the semantics of elements.

ltbookgt lttitlegt The making of a bad movie lt/titlegt
ltauthorgt
ltnamegt Scooby-Doo lt/namegt
ltaffiliationgt Cartoons lt/affiliationgt
lt/authorgt lt/bookgt

26
Feature Selection
  • Reduce dimensionality
  • Learners have difficulty addressing tasks with
    high dimensionality
  • Irrelevant features
  • Not all features help!
  • e.g., the existence of a noun in a news article
    is unlikely to help classify it as politics or
    sport

27
Challenges of Text Mining
  • Access to raw text in gated collections (ie,
    collections which require payment to permit
    access to resources) .
  • Tools that are too difficult for non-programmers
    to use.
  • Questions relating to the validity of text mining
    as a technique for drawing legitimate
    conclusions.

28
Future Of Text Mining
  • Develop focused, easy-to-use tools that bridge
    the gap between computer programmers and
    humanities researchers
  • Different tools and data, but common dimensions
  • Example
  • Find sales trends by product and correlate with
    occurrences of company name in business news
    articles
  • Dimensions Time, Company names (or stock
    symbols), Product names, Regions

29
Thanks
  • Questions ??
Write a Comment
User Comments (0)
About PowerShow.com