DATA, TEXT,

About This Presentation

Title:

DATA, TEXT,

Description:

Chapter 7 DATA, TEXT, AND WEB MINING Data Mining Techniques and Tools But what about the Income 27.5? The following tables suggest that Income 27.5 is not a ... – PowerPoint PPT presentation

Number of Views:328

Avg rating:3.0/5.0

Slides: 59

Provided by: Jud8187

Category:

more less

Transcript and Presenter's Notes

Title: DATA, TEXT,

1
Chapter 7

DATA, TEXT,
AND WEB MINING

2
Learning Objectives

Define data mining and list its objectives and
benefits
Understand different purposes and applications of
data mining
Understand different methods of data mining,
especially clustering and decision tree models
Build expertise in use of some data mining
software

3
Learning Objectives

Learn the process of data mining projects
Understand data mining pitfalls and myths
Define text mining and its objectives and
benefits
Appreciate use of text mining in business
applications
Define Web mining and its objectives and benefits

4
Data Mining Concepts and Applications

Six factors behind the sudden rise in popularity
of data mining
General recognition of the untapped value in
large databases
Consolidation of database records tending toward
a single customer view
Consolidation of databases, including the
concept of an information warehouse
Reduction in the cost of data storage and
processing, providing for the ability to collect
and accumulate data
Intense competition for a customers attention in
an increasingly saturated marketplace and
The movement toward the de-massification of
business practices

5
Data Mining Concepts and Applications

Data mining (DM)
A process that uses statistical, mathematical,
artificial intelligence and machine-learning
techniques to extract and identify useful
information and subsequent knowledge from large
databases

6
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining
Data are often buried deep within very large
databases, which sometimes contain data from
several years sometimes the data are cleansed
and consolidated in a data warehouse
The data mining environment is usually
client/server architecture or a Web-based
architecture

7
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining
Sophisticated new tools help to remove the
information ore buried in corporate files or
archival public records finding it involves
massaging and synchronizing the data to get the
right results.
The miner is often an end user, empowered by data
drills and other power query tools to ask ad hoc
questions and obtain answers quickly, with little
or no programming skill

8
Data Mining Concepts and Applications

Major characteristics and objectives of data
mining
Striking it rich often involves finding an
unexpected result and requires end users to think
creatively
Data mining tools are readily combined with
spreadsheets and other software development
tools the mined data can be analyzed and
processed quickly and easily
Parallel processing is sometimes used because of
the large amounts of data and massive search
efforts

9
Data Mining Concepts and Applications

How data mining works
Data mining tools find patterns in data and may
even infer rules from them
Three methods are used to identify patterns in
data
Simple models
Intermediate models
Complex models

10
Data Mining Concepts and Applications

Classification
Supervised induction used to analyze the
historical data stored in a database and to
automatically generate a model that can predict
future behavior
Common tools used for classification are
Neural networks
Decision trees
If-then-else rules

11
Data Mining Concepts and Applications

Clustering
words cluster analysis is an exploratory data
analysis tool which aims at sorting different
objects into groups in a way that the degree of
association between two objects is maximal if
they belong to the same group and minimal
otherwise
cluster analysis simply discovers structures in
data without explaining why they exist.
The term cluster analysis (first used by Tryon,
1939) encompasses a number of different
algorithms and methods for grouping objects of
similar kind into respective categories.
Example, people and animal classification
Joining (Tree Clustering), Two-way Joining (Block
Clustering), and k-Means Clustering

12
Data Mining Concepts and Applications

k-Means Clustering the k-means method will
produce exactly k different clusters of greatest
possible distinction.
Algorithms
Given a set of observations (x1, x2, , xn),
where each observation is a d-dimensional real
vector, k-means clustering aims to partition the
n observations into k sets (k n)
S S1, S2, , Sk so as to minimize the
within-cluster sum of squares (WCSS)
where µi is the mean of points in Si.
See paper.

13
Data Mining Concepts and Applications

1) k initial "means" (in this case k3) are
randomly generated within the data domain (shown
in color).
2) k clusters are created by associating every
observation with the nearest mean. The partitions
here represent the Voronoi diagram generated by
the means.
3) The centroid of each of the k clusters becomes
the new mean.
4) Steps 2 and 3 are repeated until convergence
has been reached.

14
Data Mining Concepts and Applications

EM clustering on an artificial dataset ("mouse").
The tendency of k-means to produce equi-sized
clusters leads to bad results, while EM benefits
from the Gaussian distribution present in the
data set

15
Data Mining Concepts and Applications

Expectation Maximization) Clustering to detect
clusters in observations (or variables) and to
assign those observations to the clusters.
A typical example application a number of
consumer behavior related variables are measured
for a large sample of respondents.

16
Data Mining Concepts and Applications

Association
A category of data mining algorithm that
establishes relationships about items that occur
together in a given record
These powerful exploratory techniques have a wide
range of applications in many areas of business
practice and also research - from the analysis of
consumer preferences or human resource
management, to the history of language.
These techniques enable analysts and researchers
to uncover hidden patterns in large data sets,
such as "customers who order product A often also
order product B or C" or "employees who said
positive things about initiative X also
frequently complain about issue Y but are happy
with issue Z."
For example, if (CarPorsche and GenderMale and
Agelt20) then (RiskHigh and InsuranceHigh)).
Book store recommendation.
The implementation of the so-called a-priori
algorithm (see Agrawal and Swami, 1993 Agrawal
and Srikant, 1994 Han and Lakshmanan, 2001 see
also Witten and Frank, 2000) allows us to process
rapidly huge data sets for such associations,
based on predefined "threshold" values for
detection.

17
Data Mining Concepts and Applications

Association
Sequence Analysis. Sequence analysis is
concerned with a subsequent purchase of a product
or products given a previous buy. For instance,
buying an extended warranty is more likely to
follow (in that specific sequential order) the
purchase of a TV or other electric appliances.
Sequence rules, however, are not always that
obvious, and sequence analysis helps you to
extract such rules no matter how hidden they may
be in your market basket data.
Link Analysis. In retailing or
marketing, knowledge of purchase "patterns" can
help with the direct marketing of special offers
to the "right" or "ready" customers (i.e., those
who, according to the rules, are most likely to
purchase specific items given their observed past
consumption patterns). Link analysis" is often
used when these techniques - for extracting
sequential or non-sequential association rules -
are applied to organize complex "evidence." It is
easy to see how the "transactions" or "shopping
basket" metaphor can be applied to situations
where individuals engage in certain actions, open
accounts, contact other specific individuals, and
so on.
Unique data analysis requirements.
Crosstabulation tables, and in particular
Multiple Response tables

18
Data Mining Concepts and Applications

Visualization can be used in conjunction with
data mining to gain a clearer understanding of
many underlying relationships

19
Data Mining Concepts and Applications
20
Data Mining Concepts and Applications

a-priori algorithm
See paper.

21
Data Mining Concepts and Applications

Regression is a well-known statistical technique
that is used to map data to a prediction value
Forecasting estimates future values based on
patterns within large sets of data

22
Data Mining Concepts and Applications

Hypothesis-driven data mining
Begins with a proposition by the user, who then
seeks to validate the truthfulness of the
proposition
Discovery-driven data mining
Finds patterns, associations, and relationships
among the data in order to uncover facts that
were previously unknown or not even contemplated
by an organization

23
Data Mining Concepts and Applications
Data mining applications

Marketing
Banking
Retailing and sales
Manufacturing and production
Brokerage and securities trading
Insurance

Computer hardware and software
Government and defense
Airlines
Health care
Broadcasting
Police
Homeland security

24
Data Mining Techniques and Tools

Data mining tools and techniques can be
classified based on the structure of the data and
the algorithms used
Statistical methods
Decision trees
Defined as a root followed by internal nodes.
Each node (including root) is labeled with a
question and arcs associated with each node cover
all possible responses

25
Data Mining Techniques and Tools

Data mining tools and techniques can be
classified based on the structure of the data and
the algorithms used
Case-based reasoning
Neural computing
Intelligent agents
Genetic algorithms
Other tools
Rule induction
Data visualization

26
Data Mining Techniques and Tools

A general algorithm for building a decision tree
Create a root node and select a splitting
attribute.
Add a branch to the root node for each split
candidate value and label
Take the following iterative steps
Classify data by applying the split value.
If a stopping point is reached, then create leaf
node and label it. Otherwise, build another
subtree

27
Data Mining Techniques and Tools

Gini index
Used in economics to measure the diversity of
the population. The same concept can be used to
determine the purity of a specific class as a
result of a decision to branch along a particular
attribute/variable
Formula
Gini(S)1-?pj2
Where S is a data set that contains example from
n classes.
Pj is a relative frequency of class j in S.

28
Data Mining Techniques and Tools

Example

Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk
Pattern Income Credit Rating Loan Risk
0 1 2 3 4 5 23 17 43 68 32 20 High Low Low High Moderate High High High High Low Low High
There is only two classes, High and Low, the data
set S with p High and n low elements, then the
Gini computation is as follows
29
Data Mining Techniques and Tools

Phighp/(pn)
pLown/(np)
Gini(S)1 p2High p2 Low
If data set S is split into S1 and S2, the
splitting index is defined as follows
GiniSPLIT(S) (p1 n 1)/(p n)Gini(S1)
(p2 n 2)/(p n)Gini(S2)
Where p1,n 1 (p2 n 2) denote p1 High elements
and n1 Low element in the data set S1 (S2).
In this definition, the best split point is the
one with the lowest value of the GiniSPLIT index.
For our example, reorder the data according to
the income

Pattern Income Loan Risk
17 20 23 32 43 68 1 5 0 4 2 3 High High High Low High Low
30
Data Mining Techniques and Tools

Possible value of a split point for the Income
attribute are Incomelt17, Incomelt20, Incomelt23,
incomelt32, Incomelt43, and Income lt68.
Now we can compute the Gini index for each of
these levels of splits
Consider the choice of dividing the data at
Income lt17. We have the following choices of
classifications

Pattern Count High Low
Incomelt17 Income gt17 1 3 0 2
So the Gini index for Incomelt17 and Income gt 17
will be G(Incomelt17) 1 (Proportion of
records with High risk)2 (Proportion of records
with High risk)2 1 12 020. Similarly,
G(Income gt 17) 1 ((3/5)2 (2/5)2)12/25
31
Data Mining Techniques and Tools

Gini index for the split choice is computed as
follows
GiniSPLIT (Proportion of records at Income
lt17G(Incomelt17) (Proportion of records at
Income gt17 )G( Income gt17)
That is
GSPLIT(1/6) 0 (5/6) (12/25)
2/5.
Now consider the choice Income lt20.

Pattern Count High Low
Incomelt20 Income gt20 2 2 0 2
So the Gini index for Incomelt20 and Income gt 20
will be G(Incomelt20) 1 ((1)2 (0)2)
0. G(Income gt 20) 1 ((2/4)2
(2/4)2)1/2. GSPLIT(2/6) 0 (4/6) (1/2)
1/3.
32
Data Mining Techniques and Tools

For choice split at Income 23

Pattern Count High Low
Incomelt23 Income gt23 3 1 0 2
G(Incomelt23) 1 ((1)2 (0)2) 0. G(Income gt
23) 1 ((1/3)2 (2/3)2)4/9. GSPLIT(3/6) 0
(3/6) (4/9) 2/9. For choice split at Income
32
Pattern Count High Low
Incomelt32 Income gt32 3 1 1 1
G(Incomelt32) 1 ((3/4)2 (1/4)2)
3/8. G(Income gt 32) 1 ((1/2)2
(1/2)2)1/2. GSPLIT(4/6) 3/8 (2/6) (1/2)
7/24.
33
Data Mining Techniques and Tools

The lowest value of GSPLIT is for Incomelt23. So
we take the two nearest values and average them.
Thus, we have a split point at Income
(2332)/227.5.
Attribute lists are divided at the split point.
That is, we expect to have a rule that says
If Incomelt27.5
Then
Else if Incomegt27.5
Then
The following is the attribute list for
Incomelt27.5

Income Pattern Loan Risk Credit Rating
17 20 23 1 5 0 High High High Low High High
So the conclusion is if the Incomelt27.5, the
loan risk is high.
34
Data Mining Techniques and Tools

But what about the Income gt 27.5?
The following tables suggest that Income gt27.5 is
not a definitive indicator of Loan Risk.

Income Pattern Loan Risk Credit Rating
32 43 68 4 2 3 High Low High Moderate Low High
So we can borrow examining credit rating to
develop the subtree for Income gt 27.5
case. However, credit rating is category
variable. The rules for category variable is
slightly different from those for a continuous
variable. The Gini index formula will be
Gini ( Two Proportion)1
p2one proportion p2 the other proportion
35
Data Mining Techniques and Tools

In case of category variable, one proportion is
the set of records of Credit Rating Low, and
the other proportion is the set of records of
Credit Rating not Low, or ?Moderate, High.
Thus we have to compute proportion of each
category and its complement. But what about the
Income gt 27.5?
The following tables suggest that Income gt27.5 is
not a definitive indicator of Loan Risk.

Pattern Count Loan Risk High Loan Risk Low
Credit RatingLow Credit RatingModerate Credit RatingHigh 0 1 1 1 0 0
First, compute the Gini index for each
category G( Credit RatingLow) 1 02 12
0 G( Credit RatingModerate) 1 12 02 0 G(
Credit RatingLow) 1 12 02 0
36
Data Mining Techniques and Tools

Next, compute the Gini index for complement
categories
G( Credit Rating ? Low, Moderate) 1 (½)2
(1/2)21/2
G( Credit Rating ?Low, High) 1/2
G( Credit Rating ?Moderate, High) 1 02 12
0

Third, compute the Gini index for possible
branches. For branch choice of credit rating
low and Moderate, high, we would
have GSPLIT (Proportion of records with Credit
Rating Low) G (Credit Rating ?Low)
(Proportion of records with Credit Rating not
Low) G (Credit Rating ?not Low)
(Proportion of records with Credit Rating Low)
G (Credit Rating ?Low) (Proportion of
records with Credit Rating High, Moderate) G
(Credit Rating High, Moderate) GSPLIT(Credite
Rating Low) (1/3) 0(2/3) 00.
37
Data Mining Techniques and Tools

Last, compute the Gini index for other
categories
GSPLIT(Credite Rating Moderate) (1/3)
0(2/3) (1/2)1/3
GSPLIT(Credite Rating High) (1/3) 0(2/3)
(1/2)1/3
GSPLIT(Credite Rating Low, Moderate) (2/3)
(1/2)(1/3) 01/3
GSPLIT(Credite Rating Low, High) (2/3)
(1/2)(1/3) 01/3
GSPLIT(Credite Rating Moderate) (2/3)
0(1/3) 00
The lowest value of the Gini index for the split
is zero at Credit Rating Low and Credit Rating
?Moderate, High, thus this is split point and
these are the next branch of subtree. See figure.

38
Data Mining Techniques and Tools
39
Data Mining Techniques and Tools

The ID3 algorithm decision tree approach
Entropy
Measures the extent of uncertainty or randomness
in a data set. If all the data in a subset belong
to just one class, then there is no uncertainty
or randomness in that dataset, therefore the
entropy is zero

40
Data Mining Techniques and Tools

Cluster analysis for data mining
Cluster analysis is an exploratory data analysis
tool for solving classification problems
The object is to sort cases into groups so that
the degree of association is strong between
members of the same cluster and weak between
members of different clusters

41
Data Mining Techniques and Tools

Cluster analysis results may be used to
Help identify a classification scheme
Suggest statistical models to describe
populations
Indicate rules for assigning new cases to classes
for identification, targeting, and diagnostic
purposes
Provide measures of definition, size, and change
in what were previously broad concepts
Find typical cases to represent classes

42
Data Mining Techniques and Tools

Cluster analysis methods
Statistical methods
Optimal methods
Neural networks
Fuzzy logic
Genetic algorithms
Each of these methods generally works with one of
two general method classes
Divisive
Agglomerative

43
Data Mining Techniques and Tools

Hierarchical clustering method and example
Decide which data to record from the items
Calculate the distances between all initial
clusters. Store the results in a distance matrix
Search through the distance matrix and find the
two most similar clusters
Fuse those two clusters together to produce a
cluster that has at least two items
Calculate the distances between this new cluster
and all the other clusters
Repeat steps 3 to 5 until you have reached the
prespecified maximum number of clusters

44
Data Mining Techniques and Tools

Classes of data mining tools and techniques as
they relate to information and business
intelligence (BI) technologies
Mathematical and statistical analysis packages
Personalization tools for Web-based marketing
Analytics built into marketing platforms
Advanced CRM tools
Analytics added to other vertical
industry-specific platforms
Analytics added to database tools (e.g., OLAP)
Standalone data mining tools

45
Data Mining Project Processes
46
Data Mining Project Processes
47
Data Mining Project Processes

Knowledge discovery in databases (KDD)
A comprehensive process of using data mining
methods to find useful information and patterns
in data

48
Data Mining Project Processes

KDD process
Selection
Preprocessing
Transformation
Data mining
Interpretation/evaluation

49
Text Mining

Text mining
Application of data mining to nonstructured or
less structured text files. It entails the
generation of meaningful numerical indices from
the unstructured text and then processing these
indices using various data mining algorithms

50
Text Mining

Text mining helps organizations
Find the hidden content of documents, including
additional useful relationships
Relate documents across previous unnoticed
divisions
Group documents by common themes

51
Text Mining

Applications of text mining
Automatic detection of e-mail spam or phishing
through analysis of the document content
Automatic processing of messages or e-mails to
route a message to the most appropriate party to
process that message
Analysis of warranty claims, help desk
calls/reports, and so on to identify the most
common problems and relevant responses

52
Text Mining

Applications of text mining
Analysis of related scientific publications in
journals to create an automated summary view of a
particular discipline
Creation of a relationship view of a document
collection
Qualitative analysis of documents to detect
deception

53
Text Mining

How to mine text
Eliminate commonly used words (stop-words)
Replace words with their stems or roots (stemming
algorithms)
Consider synonyms and phrases
Calculate the weights of the remaining terms

54
Web Mining

Web mining
The discovery and analysis of interesting and
useful information from the Web, about the Web,
and usually through Web-based tools

55
Data Mining Project Processes
56
Web Mining

Web content mining
The extraction of useful information from Web
pages
Web structure mining
The development of useful information from the
links included in the Web documents
Web usage mining
The extraction of useful information from the
data being generated through webpage visits,
transaction, etc.

57
Web Mining

Uses for Web mining
Determine the lifetime value of clients
Design cross-marketing strategies across products
Evaluate promotional campaigns
Target electronic ads and coupons at user groups
Predict user behavior
Present dynamic information to users

58
Data Mining Project Processes

Write a Comment

User Comments (0)

About PowerShow.com

DATA, TEXT, - PowerPoint PPT Presentation

DATA, TEXT,

Chapter 7 DATA, TEXT, AND WEB MINING Data Mining Techniques and Tools But what about the Income 27.5? The following tables suggest that Income 27.5 is not a ... – PowerPoint PPT presentation