DATA, TEXT, - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

DATA, TEXT,

Description:

Chapter 7 DATA, TEXT, AND WEB MINING Data Mining Techniques and Tools But what about the Income 27.5? The following tables suggest that Income 27.5 is not a ... – PowerPoint PPT presentation

Number of Views:325
Avg rating:3.0/5.0
Slides: 59
Provided by: Jud8187
Category:
Tags: data | text | mining | text

less

Transcript and Presenter's Notes

Title: DATA, TEXT,


1
Chapter 7
  • DATA, TEXT,
  • AND WEB MINING

2
Learning Objectives
  • Define data mining and list its objectives and
    benefits
  • Understand different purposes and applications of
    data mining
  • Understand different methods of data mining,
    especially clustering and decision tree models
  • Build expertise in use of some data mining
    software

3
Learning Objectives
  • Learn the process of data mining projects
  • Understand data mining pitfalls and myths
  • Define text mining and its objectives and
    benefits
  • Appreciate use of text mining in business
    applications
  • Define Web mining and its objectives and benefits

4
Data Mining Concepts and Applications
  • Six factors behind the sudden rise in popularity
    of data mining
  • General recognition of the untapped value in
    large databases
  • Consolidation of database records tending toward
    a single customer view
  • Consolidation of databases, including the
    concept of an information warehouse
  • Reduction in the cost of data storage and
    processing, providing for the ability to collect
    and accumulate data
  • Intense competition for a customers attention in
    an increasingly saturated marketplace and
  • The movement toward the de-massification of
    business practices

5
Data Mining Concepts and Applications
  • Data mining (DM)
  • A process that uses statistical, mathematical,
    artificial intelligence and machine-learning
    techniques to extract and identify useful
    information and subsequent knowledge from large
    databases

6
Data Mining Concepts and Applications
  • Major characteristics and objectives of data
    mining
  • Data are often buried deep within very large
    databases, which sometimes contain data from
    several years sometimes the data are cleansed
    and consolidated in a data warehouse
  • The data mining environment is usually
    client/server architecture or a Web-based
    architecture

7
Data Mining Concepts and Applications
  • Major characteristics and objectives of data
    mining
  • Sophisticated new tools help to remove the
    information ore buried in corporate files or
    archival public records finding it involves
    massaging and synchronizing the data to get the
    right results.
  • The miner is often an end user, empowered by data
    drills and other power query tools to ask ad hoc
    questions and obtain answers quickly, with little
    or no programming skill

8
Data Mining Concepts and Applications
  • Major characteristics and objectives of data
    mining
  • Striking it rich often involves finding an
    unexpected result and requires end users to think
    creatively
  • Data mining tools are readily combined with
    spreadsheets and other software development
    tools the mined data can be analyzed and
    processed quickly and easily
  • Parallel processing is sometimes used because of
    the large amounts of data and massive search
    efforts

9
Data Mining Concepts and Applications
  • How data mining works
  • Data mining tools find patterns in data and may
    even infer rules from them
  • Three methods are used to identify patterns in
    data
  • Simple models
  • Intermediate models
  • Complex models

10
Data Mining Concepts and Applications
  • Classification
  • Supervised induction used to analyze the
    historical data stored in a database and to
    automatically generate a model that can predict
    future behavior
  • Common tools used for classification are
  • Neural networks
  • Decision trees
  • If-then-else rules

11
Data Mining Concepts and Applications
  • Clustering
  • words cluster analysis is an exploratory data
    analysis tool which aims at sorting different
    objects into groups in a way that the degree of
    association between two objects is maximal if
    they belong to the same group and minimal
    otherwise
  • cluster analysis simply discovers structures in
    data without explaining why they exist.
  • The term cluster analysis (first used by Tryon,
    1939) encompasses a number of different
    algorithms and methods for grouping objects of
    similar kind into respective categories.
  • Example, people and animal classification
  • Joining (Tree Clustering), Two-way Joining (Block
    Clustering), and k-Means Clustering

12
Data Mining Concepts and Applications
  • k-Means Clustering the k-means method will
    produce exactly k different clusters of greatest
    possible distinction.
  • Algorithms
  • Given a set of observations (x1, x2, , xn),
    where each observation is a d-dimensional real
    vector, k-means clustering aims to partition the
    n observations into k sets (k n)
    S  S1, S2, , Sk so as to minimize the
    within-cluster sum of squares (WCSS)
  • where µi is the mean of points in Si.
  • See paper.

13
Data Mining Concepts and Applications
  • 1) k initial "means" (in this case k3) are
    randomly generated within the data domain (shown
    in color).
  • 2) k clusters are created by associating every
    observation with the nearest mean. The partitions
    here represent the Voronoi diagram generated by
    the means.
  • 3) The centroid of each of the k clusters becomes
    the new mean.
  • 4) Steps 2 and 3 are repeated until convergence
    has been reached.

14
Data Mining Concepts and Applications
  • EM clustering on an artificial dataset ("mouse").
    The tendency of k-means to produce equi-sized
    clusters leads to bad results, while EM benefits
    from the Gaussian distribution present in the
    data set

15
Data Mining Concepts and Applications
  • Expectation Maximization) Clustering to detect
    clusters in observations (or variables) and to
    assign those observations to the clusters.
  • A typical example application a number of
    consumer behavior related variables are measured
    for a large sample of respondents.

16
Data Mining Concepts and Applications
  • Association
  • A category of data mining algorithm that
    establishes relationships about items that occur
    together in a given record
  • These powerful exploratory techniques have a wide
    range of applications in many areas of business
    practice and also research - from the analysis of
    consumer preferences or human resource
    management, to the history of language.
  • These techniques enable analysts and researchers
    to uncover hidden patterns in large data sets,
    such as "customers who order product A often also
    order product B or C" or "employees who said
    positive things about initiative X also
    frequently complain about issue Y but are happy
    with issue Z."
  • For example, if (CarPorsche and GenderMale and
    Agelt20) then (RiskHigh and InsuranceHigh)).
    Book store recommendation.
  • The implementation of the so-called a-priori
    algorithm (see Agrawal and Swami, 1993 Agrawal
    and Srikant, 1994 Han and Lakshmanan, 2001 see
    also Witten and Frank, 2000) allows us to process
    rapidly huge data sets for such associations,
    based on predefined "threshold" values for
    detection.

17
Data Mining Concepts and Applications
  • Association
  • Sequence Analysis. Sequence analysis is
    concerned with a subsequent purchase of a product
    or products given a previous buy. For instance,
    buying an extended warranty is more likely to
    follow (in that specific sequential order) the
    purchase of a TV or other electric appliances.
    Sequence rules, however, are not always that
    obvious, and sequence analysis helps you to
    extract such rules no matter how hidden they may
    be in your market basket data.
  • Link Analysis. In retailing or
    marketing, knowledge of purchase "patterns" can
    help with the direct marketing of special offers
    to the "right" or "ready" customers (i.e., those
    who, according to the rules, are most likely to
    purchase specific items given their observed past
    consumption patterns). Link analysis" is often
    used when these techniques - for extracting
    sequential or non-sequential association rules -
    are applied to organize complex "evidence." It is
    easy to see how the "transactions" or "shopping
    basket" metaphor can be applied to situations
    where individuals engage in certain actions, open
    accounts, contact other specific individuals, and
    so on.
  • Unique data analysis requirements.
    Crosstabulation tables, and in particular
    Multiple Response tables

18
Data Mining Concepts and Applications
  • Visualization can be used in conjunction with
    data mining to gain a clearer understanding of
    many underlying relationships

19
Data Mining Concepts and Applications
20
Data Mining Concepts and Applications
  • a-priori algorithm
  • See paper.

21
Data Mining Concepts and Applications
  • Regression is a well-known statistical technique
    that is used to map data to a prediction value
  • Forecasting estimates future values based on
    patterns within large sets of data

22
Data Mining Concepts and Applications
  • Hypothesis-driven data mining
  • Begins with a proposition by the user, who then
    seeks to validate the truthfulness of the
    proposition
  • Discovery-driven data mining
  • Finds patterns, associations, and relationships
    among the data in order to uncover facts that
    were previously unknown or not even contemplated
    by an organization

23
Data Mining Concepts and Applications
Data mining applications
  • Marketing
  • Banking
  • Retailing and sales
  • Manufacturing and production
  • Brokerage and securities trading
  • Insurance
  • Computer hardware and software
  • Government and defense
  • Airlines
  • Health care
  • Broadcasting
  • Police
  • Homeland security

24
Data Mining Techniques and Tools
  • Data mining tools and techniques can be
    classified based on the structure of the data and
    the algorithms used
  • Statistical methods
  • Decision trees
  • Defined as a root followed by internal nodes.
    Each node (including root) is labeled with a
    question and arcs associated with each node cover
    all possible responses

25
Data Mining Techniques and Tools
  • Data mining tools and techniques can be
    classified based on the structure of the data and
    the algorithms used
  • Case-based reasoning
  • Neural computing
  • Intelligent agents
  • Genetic algorithms
  • Other tools
  • Rule induction
  • Data visualization

26
Data Mining Techniques and Tools
  • A general algorithm for building a decision tree
  • Create a root node and select a splitting
    attribute.
  • Add a branch to the root node for each split
    candidate value and label
  • Take the following iterative steps
  • Classify data by applying the split value.
  • If a stopping point is reached, then create leaf
    node and label it. Otherwise, build another
    subtree

27
Data Mining Techniques and Tools
  • Gini index
  • Used in economics to measure the diversity of
    the population. The same concept can be used to
    determine the purity of a specific class as a
    result of a decision to branch along a particular
    attribute/variable
  • Formula
  • Gini(S)1-?pj2
  • Where S is a data set that contains example from
    n classes.
  • Pj is a relative frequency of class j in S.

28
Data Mining Techniques and Tools
  • Example

Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk Sample patterns for Training a Decision Tree to Predict Loan Risk
Pattern Income Credit Rating Loan Risk
0 1 2 3 4 5 23 17 43 68 32 20 High Low Low High Moderate High High High High Low Low High
There is only two classes, High and Low, the data
set S with p High and n low elements, then the
Gini computation is as follows
29
Data Mining Techniques and Tools
  • Phighp/(pn)
    pLown/(np)
  • Gini(S)1 p2High p2 Low
  • If data set S is split into S1 and S2, the
    splitting index is defined as follows
  • GiniSPLIT(S) (p1 n 1)/(p n)Gini(S1)
    (p2 n 2)/(p n)Gini(S2)
  • Where p1,n 1 (p2 n 2) denote p1 High elements
    and n1 Low element in the data set S1 (S2).
  • In this definition, the best split point is the
    one with the lowest value of the GiniSPLIT index.
    For our example, reorder the data according to
    the income

Pattern Income Loan Risk
17 20 23 32 43 68 1 5 0 4 2 3 High High High Low High Low
30
Data Mining Techniques and Tools
  • Possible value of a split point for the Income
    attribute are Incomelt17, Incomelt20, Incomelt23,
    incomelt32, Incomelt43, and Income lt68.
  • Now we can compute the Gini index for each of
    these levels of splits
  • Consider the choice of dividing the data at
    Income lt17. We have the following choices of
    classifications

Pattern Count High Low
Incomelt17 Income gt17 1 3 0 2
So the Gini index for Incomelt17 and Income gt 17
will be G(Incomelt17) 1 (Proportion of
records with High risk)2 (Proportion of records
with High risk)2 1 12 020. Similarly,
G(Income gt 17) 1 ((3/5)2 (2/5)2)12/25
31
Data Mining Techniques and Tools
  • Gini index for the split choice is computed as
    follows
  • GiniSPLIT (Proportion of records at Income
    lt17G(Incomelt17) (Proportion of records at
    Income gt17 )G( Income gt17)
  • That is
  • GSPLIT(1/6) 0 (5/6) (12/25)
    2/5.
  • Now consider the choice Income lt20.

Pattern Count High Low
Incomelt20 Income gt20 2 2 0 2
So the Gini index for Incomelt20 and Income gt 20
will be G(Incomelt20) 1 ((1)2 (0)2)
0. G(Income gt 20) 1 ((2/4)2
(2/4)2)1/2. GSPLIT(2/6) 0 (4/6) (1/2)
1/3.
32
Data Mining Techniques and Tools
  • For choice split at Income 23

Pattern Count High Low
Incomelt23 Income gt23 3 1 0 2
G(Incomelt23) 1 ((1)2 (0)2) 0. G(Income gt
23) 1 ((1/3)2 (2/3)2)4/9. GSPLIT(3/6) 0
(3/6) (4/9) 2/9. For choice split at Income
32
Pattern Count High Low
Incomelt32 Income gt32 3 1 1 1
G(Incomelt32) 1 ((3/4)2 (1/4)2)
3/8. G(Income gt 32) 1 ((1/2)2
(1/2)2)1/2. GSPLIT(4/6) 3/8 (2/6) (1/2)
7/24.
33
Data Mining Techniques and Tools
  • The lowest value of GSPLIT is for Incomelt23. So
    we take the two nearest values and average them.
    Thus, we have a split point at Income
    (2332)/227.5.
  • Attribute lists are divided at the split point.
    That is, we expect to have a rule that says
  • If Incomelt27.5
  • Then
  • Else if Incomegt27.5
  • Then
  • The following is the attribute list for
    Incomelt27.5

Income Pattern Loan Risk Credit Rating
17 20 23 1 5 0 High High High Low High High
So the conclusion is if the Incomelt27.5, the
loan risk is high.
34
Data Mining Techniques and Tools
  • But what about the Income gt 27.5?
  • The following tables suggest that Income gt27.5 is
    not a definitive indicator of Loan Risk.

Income Pattern Loan Risk Credit Rating
32 43 68 4 2 3 High Low High Moderate Low High
So we can borrow examining credit rating to
develop the subtree for Income gt 27.5
case. However, credit rating is category
variable. The rules for category variable is
slightly different from those for a continuous
variable. The Gini index formula will be
Gini ( Two Proportion)1
p2one proportion p2 the other proportion
35
Data Mining Techniques and Tools
  • In case of category variable, one proportion is
    the set of records of Credit Rating Low, and
    the other proportion is the set of records of
    Credit Rating not Low, or ?Moderate, High.
    Thus we have to compute proportion of each
    category and its complement. But what about the
    Income gt 27.5?
  • The following tables suggest that Income gt27.5 is
    not a definitive indicator of Loan Risk.

Pattern Count Loan Risk High Loan Risk Low
Credit RatingLow Credit RatingModerate Credit RatingHigh 0 1 1 1 0 0
First, compute the Gini index for each
category G( Credit RatingLow) 1 02 12
0 G( Credit RatingModerate) 1 12 02 0 G(
Credit RatingLow) 1 12 02 0
36
Data Mining Techniques and Tools
  • Next, compute the Gini index for complement
    categories
  • G( Credit Rating ? Low, Moderate) 1 (½)2
    (1/2)21/2
  • G( Credit Rating ?Low, High) 1/2
  • G( Credit Rating ?Moderate, High) 1 02 12
    0

Third, compute the Gini index for possible
branches. For branch choice of credit rating
low and Moderate, high, we would
have GSPLIT (Proportion of records with Credit
Rating Low) G (Credit Rating ?Low)
(Proportion of records with Credit Rating not
Low) G (Credit Rating ?not Low)
(Proportion of records with Credit Rating Low)
G (Credit Rating ?Low) (Proportion of
records with Credit Rating High, Moderate) G
(Credit Rating High, Moderate) GSPLIT(Credite
Rating Low) (1/3) 0(2/3) 00.
37
Data Mining Techniques and Tools
  • Last, compute the Gini index for other
    categories
  • GSPLIT(Credite Rating Moderate) (1/3)
    0(2/3) (1/2)1/3
  • GSPLIT(Credite Rating High) (1/3) 0(2/3)
    (1/2)1/3
  • GSPLIT(Credite Rating Low, Moderate) (2/3)
    (1/2)(1/3) 01/3
  • GSPLIT(Credite Rating Low, High) (2/3)
    (1/2)(1/3) 01/3
  • GSPLIT(Credite Rating Moderate) (2/3)
    0(1/3) 00
  • The lowest value of the Gini index for the split
    is zero at Credit Rating Low and Credit Rating
    ?Moderate, High, thus this is split point and
    these are the next branch of subtree. See figure.

38
Data Mining Techniques and Tools
39
Data Mining Techniques and Tools
  • The ID3 algorithm decision tree approach
  • Entropy
  • Measures the extent of uncertainty or randomness
    in a data set. If all the data in a subset belong
    to just one class, then there is no uncertainty
    or randomness in that dataset, therefore the
    entropy is zero

40
Data Mining Techniques and Tools
  • Cluster analysis for data mining
  • Cluster analysis is an exploratory data analysis
    tool for solving classification problems
  • The object is to sort cases into groups so that
    the degree of association is strong between
    members of the same cluster and weak between
    members of different clusters

41
Data Mining Techniques and Tools
  • Cluster analysis results may be used to
  • Help identify a classification scheme
  • Suggest statistical models to describe
    populations
  • Indicate rules for assigning new cases to classes
    for identification, targeting, and diagnostic
    purposes
  • Provide measures of definition, size, and change
    in what were previously broad concepts
  • Find typical cases to represent classes

42
Data Mining Techniques and Tools
  • Cluster analysis methods
  • Statistical methods
  • Optimal methods
  • Neural networks
  • Fuzzy logic
  • Genetic algorithms
  • Each of these methods generally works with one of
    two general method classes
  • Divisive
  • Agglomerative

43
Data Mining Techniques and Tools
  • Hierarchical clustering method and example
  • Decide which data to record from the items
  • Calculate the distances between all initial
    clusters. Store the results in a distance matrix
  • Search through the distance matrix and find the
    two most similar clusters
  • Fuse those two clusters together to produce a
    cluster that has at least two items
  • Calculate the distances between this new cluster
    and all the other clusters
  • Repeat steps 3 to 5 until you have reached the
    prespecified maximum number of clusters

44
Data Mining Techniques and Tools
  • Classes of data mining tools and techniques as
    they relate to information and business
    intelligence (BI) technologies
  • Mathematical and statistical analysis packages
  • Personalization tools for Web-based marketing
  • Analytics built into marketing platforms
  • Advanced CRM tools
  • Analytics added to other vertical
    industry-specific platforms
  • Analytics added to database tools (e.g., OLAP)
  • Standalone data mining tools

45
Data Mining Project Processes
46
Data Mining Project Processes
47
Data Mining Project Processes
  • Knowledge discovery in databases (KDD)
  • A comprehensive process of using data mining
    methods to find useful information and patterns
    in data

48
Data Mining Project Processes
  • KDD process
  • Selection
  • Preprocessing
  • Transformation
  • Data mining
  • Interpretation/evaluation

49
Text Mining
  • Text mining
  • Application of data mining to nonstructured or
    less structured text files. It entails the
    generation of meaningful numerical indices from
    the unstructured text and then processing these
    indices using various data mining algorithms

50
Text Mining
  • Text mining helps organizations
  • Find the hidden content of documents, including
    additional useful relationships
  • Relate documents across previous unnoticed
    divisions
  • Group documents by common themes

51
Text Mining
  • Applications of text mining
  • Automatic detection of e-mail spam or phishing
    through analysis of the document content
  • Automatic processing of messages or e-mails to
    route a message to the most appropriate party to
    process that message
  • Analysis of warranty claims, help desk
    calls/reports, and so on to identify the most
    common problems and relevant responses

52
Text Mining
  • Applications of text mining
  • Analysis of related scientific publications in
    journals to create an automated summary view of a
    particular discipline
  • Creation of a relationship view of a document
    collection
  • Qualitative analysis of documents to detect
    deception

53
Text Mining
  • How to mine text
  • Eliminate commonly used words (stop-words)
  • Replace words with their stems or roots (stemming
    algorithms)
  • Consider synonyms and phrases
  • Calculate the weights of the remaining terms

54
Web Mining
  • Web mining
  • The discovery and analysis of interesting and
    useful information from the Web, about the Web,
    and usually through Web-based tools

55
Data Mining Project Processes
56
Web Mining
  • Web content mining
  • The extraction of useful information from Web
    pages
  • Web structure mining
  • The development of useful information from the
    links included in the Web documents
  • Web usage mining
  • The extraction of useful information from the
    data being generated through webpage visits,
    transaction, etc.

57
Web Mining
  • Uses for Web mining
  • Determine the lifetime value of clients
  • Design cross-marketing strategies across products
  • Evaluate promotional campaigns
  • Target electronic ads and coupons at user groups
  • Predict user behavior
  • Present dynamic information to users

58
Data Mining Project Processes
Write a Comment
User Comments (0)
About PowerShow.com