Text Mining - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Text Mining

Description:

Generating collections of similar text documents. alg | Automated Learning Group ... Using Oracles Designer 2000, assist with Data Model maintenance and assist with ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 45

Provided by: lorett5

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining

1
Text Mining

May 23, 2006

2
Text Mining Definition

Many definitions in the literature
The non trivial extraction of implicit,
previously unknown, and potentially useful
information from (large amount of) textual data
An exploration and analysis of textual
(natural-language) data by automatic and semi
automatic means to discover new knowledge
What is previously unknown information?
Strict definition
Information that not even the writer knows
Lenient definition
Rediscover the information that the author
encoded in the text

3
Text Mining Views from T2K and ThemeWeaver
4
Text Mining Themescape and ThemeRiver

Visualizing Relationships Between Documents

Images from Pacific Northwest Laboratory
5
Text Characteristics (1)

Large textual data base
Enormous wealth of textual information on the Web
Publications are electronic
High dimensionality
Consider each word/phrase as a dimension
Noisy data
Spelling mistakes
Abbreviations
Acronyms
Text messages are very dynamic
Web pages are constantly being generated
(removed)
Web pages are generated from database queries
Not well structured text
Email/Chat rooms
r u available ?
Hey whazzzzzz up
Speech

6
Text Characteristics (2)

Dependency
Relevant information is a complex conjunction of
words/phrases
Order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
Ambiguity
Word ambiguity
Pronouns (he, she )
Synonyms (buy, purchase)
Words with multiple meanings (bat it is related
to baseball or mammal)
Semantic ambiguity
The king saw the rabbit with his glasses.
(multiple meanings)
Authority of the source
IBM is more likely to be an authorized source
then my second far cousin

7
Text Mining Process

Text Preprocessing
Syntactic/Semantic Text Analysis
Part of Speech (POS) Tagging
Features Generation
Bag of Words
Feature Selection
Simple Counting
Statistics
Selection based on POS
Text/Data Mining
Classification- Supervised Learning
Clustering- Unsupervised Learning
Information Extraction
Analyzing Results

8
Text PreProcessing Syntactic / Semantic Text
Analysis

Part Of Speech (POS) Tagging
Find the corresponding POS for each word
e.g., John (noun) gave (verb) the (det) ball
(noun)
Word Sense Disambiguation
Context based or proximity based
Very accurate
Parsing
Generates a parse tree (graph) for each sentence
Each sentence is a stand alone graph

9
Feature Generation Bag of Words

Text document is represented by the words it
contains (and their occurrences)
e.g., Lord of the rings ? the, Lord,
rings, of
Highly efficient
Makes learning far simpler and easier
Order of words is not that important for certain
applications
Stemming
Reduce dimensionality
Identifies a word by its root
e.g., flying, flew ? fly
Stop words
Identifies the most common words that are
unlikely to help with text mining
e.g., the, a, an, you

10
Feature Selection

Reduce Dimensionality
Learners have difficulty addressing tasks with
high dimensionality
Irrelevant Features
Not all features help!
Remove features that occur in only a few
documents
Reduce features that occur in too many documents

11
Text Mining General Application Areas

Information Retrieval
Indexing and retrieval of textual documents
Finding a set of (ranked) documents that are
relevant to the query
Information Extraction
Extraction of partial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and
extraction of partial knowledge using the web
Classification
Predict a class for each text document
Clustering
Generating collections of similar text documents

12
Text Mining Applications

Email Spam filtering
News Feeds Discover what is interesting
Medical Identify relationships and link
information from different medical fields
Homeland Security
Marketing Discover distinct groups of potential
buyers and make suggestions for other products
Industry Identifying groups of competitors web
pages
Job Seeking Identify parameters in searching for
jobs

13
Text Mining Supervised vs. Unsupervised Learning

Supervised learning (Classification)
Data (observations, measurements, etc.) are
accompanied by labels indicating the class of the
observations
Split into training data and test data for model
building process
New data is classified based on the model built
with the training data
Unsupervised learning (Clustering)
Class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

14
Text Mining Classification Definition

Given Collection of labeled records
Each record contains a set of features
(attributes), and the true class (label)
Create a training set to build the model
Create a testing set to test the model
Find Model for the class as a function of the
values of the features
Goal Assign a class (as accurately as possible)
to previously unseen records
Evaluation What Is Good Classification?
Correct classification
Known label of test example is identical to the
predicted class from the model
Accuracy ratio
Percent of test set examples that are correctly
classified by the model
Distance measure between classes can be used
e.g., classifying football document as a
basketball document is not as bad as
classifying it as crime

15
Text Mining Clustering Definition

Given Set of documents and a similarity measure
among documents
Find Clusters such that
Documents in one cluster are more similar to one
another
Documents in separate clusters are less similar
to one another
Goal
Finding a correct set of documents
Similarity Measures
Euclidean distance if attributes are continuous
Other problem-specific measures
e.g., how many words are common in these
documents
Evaluation What Is Good Clustering?
Produce high quality clusters with
high intra-class similarity
low inter-class similarity
Quality of a clustering method is also measured
by its ability to discover some or all of the
hidden patterns

16
Classification Techniques

Bayesian classification
Decision trees
Neural networks
Instance-Based Methods
Support Vector Machines

17
What is Information Extraction?
Advisory Programmer - Oracle (Austin, TX)
Response Code 1008-0074-97-iexc-jcn
Responsibilities This is an exciting opportunity
with Siemens Wireless Terminals a start-up
venture fully capitalized by a Global Leader in
Advanced Technologies. Qualified candidates will
Responsible for assisting with requirements
definition, analysis, design and implementation
that meet objectives, codes difficult and
sophisticated routines . Develops project plans,
schedules and cost data. Develop test plans and
implement physical design of databases. Develop
shell scripts for administrative and background
tasks, stored procedures and triggers. Using
Oracles Designer 2000, assist with Data Model
maintenance and assist with applications
development using Oracle Forms. Qualifications
BSCS, BSMIS or closely related field or related
equivalent knowledge normally obtained through
technical education programs. 5-8 years of
professional experience in development, system
design analysis, programming, installation using
Oracle development

Given
Source of textual documents
Well defined limited query (text based)
Find
Sentences with relevant information
Extract the relevant information and ignore
non-relevant information (important!)
Link related information and output in a
predetermined format

Example from Dan Roth Web Page
18
T2K and Using GATE

Load GATE_IE_Viz itinerary to see Information
Extraction in action

19
Information Extraction from Streaming Text

Information extraction
process of using advanced automated machine
learning approaches
to identify entities in text documents
extract this information along with the
relationships these entities may have in the text
documents
This project demonstrates information extraction
of names, places and organizations from real-time
news feeds. As news articles arrive, the
information is extracted and displayed.

20
Bayesian Classification

Idea assign to example X the class label C such
that P(CX) is maximal
Computes the distribution of an input associated
with each class, for example, given the variable
X with a value at xi the probability of it being
in Class A is greater than it being in Class B

Mathematically speaking If one knows how P(X
C), and the densities P(xi) and P(cj) (prior
probabilities) are known then the classifier is
one which assigns class cj to datum xi if cj has
the highest posterior probability given the data.
21
Bayesian Classification Why?

Probabilistic learning Calculate explicit
probabilities for hypothesis, is among the most
practical approaches to certain types of learning
problems
Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct
Prior knowledge Can be combined with observed
data
Standard
Provide a standard of optimal decision making
against which other methods can be measured
In a simpler form, provide a baseline against
which other methods can be measured

22
Naïve Bayesian Classification

Naïve assumption
Feature independence
P(xi C) is estimated as the relative frequency
of examples having value xi as feature in class C
Computationally easy!!!

23
Classification by Decision Tree

Decision tree
Flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree
Classifying an unknown example
Test the attribute of the example against the
decision tree

Text Mining in D2K
Email CLASSIFICATION
Naïve Bayesian

25
Email Classification

Input
Multiple mailboxes where each mailbox represents
a class
Output
Results of the model on the testing set
Model that classifies future email

26
Mailbox Files

MONO.mbx
Mono Developer Discussion List
mono-list_at_lists.ximian.com
216 messages
SPAM.mbx
Spam Mailbox
100 messages
JINI.mbx
JINI-Users mail list
JINI-USERS_at_JAVA.SUN.COM
104 messages

27
Opening the Itinerary

Click on the Itinerary Pane in the Resource
Panel
Expand the T2K directory with a single click
Double click on EmailClassification-T2K

28
D2K A Few Features

Properties indicate that a module has settings
that can be changed before execution
Indicated by a P in the lower left corner of a
module
e.g., filename, maximum iterations, etc.
Resource Manager
Load data that is accessible by all modules

29
EmailClassification Itinerary