Master of Science - PowerPoint PPT Presentation

About This Presentation
Title:

Master of Science

Description:

Keyword based. Similarity between query and document ... Improve design of individual pages. Improve effectiveness of e-commerce (sales and advertising) ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 32
Provided by: dream1
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Master of Science


1
DATA MINING OVERVIEW
ME

Margaret H. Dunham CSE Department Southern
Methodist University Dallas, Texas
75275 mhd_at_engr.smu.edu
2
  • Data is growing at a phenomenal rate
  • Users expect more sophisticated information
  • How?

UNCOVER HIDDEN INFORMATION DATA MINING
3
Data Mining Definition
  • Finding hidden information in a database
  • Fit data to a model
  • Similar terms
  • Exploratory data analysis
  • Data driven discovery
  • Deductive learning

4
Database Processing vs. Data Mining Processing
  • Query
  • Poorly defined
  • No precise query language
  • Query
  • Well defined
  • SQL
  • Data
  • Operational data
  • Data
  • Not operational data
  • Output
  • Precise
  • Subset of database
  • Output
  • Fuzzy
  • Not a subset of database

5
Data Mining Development
6
KDD Process
Modified from FPSS96C
  • Selection Obtain data from various sources.
  • Preprocessing Cleanse data.
  • Transformation Convert to common format.
    Transform to new format.
  • Data Mining Obtain desired results.
  • Interpretation/Evaluation Present results to
    user in meaningful manner.

7
KDD Process Ex Web Log
  • Selection
  • Select log data (dates and locations) to use
  • Preprocessing
  • Remove identifying URLs
  • Remove error logs
  • Transformation
  • Sessionize (sort and group)
  • Data Mining
  • Identify and count patterns
  • Construct data structure
  • Interpretation/Evaluation
  • Identify and display frequently accessed
    sequences.
  • Potential User Applications
  • Cache prediction
  • Personalization

8
Basic Data Mining Tasks
  • Classification maps data into predefined groups
  • Pattern Recognition
  • Regression
  • Clustering partitions database into groups
  • Groups not known apriori
  • Determined by the data (similarity)
  • Link Analysis uncovers relationships among data
  • Association Rules
  • Ex 60 of the time bread is sold so is peanut
    butter
  • Sequence Analysis
  • Ex Most people who purchase CD players will
    purchase a CD within one week
  • Not causal
  • Not functional dependencies

9
Survey of Data Mining Tasks
  • Classification
  • Decision Trees
  • Neural Networks
  • Clustering
  • Agglomerative
  • Partitional
  • Association Rules
  • Web Mining

10
Classification Problem
  • Given a database Dt1,t2,,tn and a set of
    classes CC1,,Cm, the Classification Problem
    is to define a mapping fDgC where each ti is
    assigned to one class.
  • Actually divides D into equivalence classes.
  • Prediction is similar, but may be viewed as
    having infinite number of classes.

11
Classification Examples
  • Pattern matching
  • Fraud detection
  • Identification of plant/animal specifies
  • Profiling (this is not a bad word)
  • Predicting terrorists or potential terrorist
    events
  • Web searches (Information Retrieval)

12
Defining Classes
13
Decision Trees
  • Decision Tree (DT)
  • Tree where the root and each internal node is
    labeled with a question.
  • The arcs represent each possible answer to the
    associated question.
  • Each leaf node represents a prediction of a
    solution to the problem.
  • Popular technique for classification Leaf node
    indicates class to which the corresponding tuple
    belongs.

14
Decision Tree Example
15
Neural Networks
  • Based on observed functioning of human brain.
  • (Artificial Neural Networks (ANN)
  • Our view of neural networks is very simplistic.
  • We view a neural network (NN) from a graphical
    viewpoint.
  • Alternatively, a NN may be viewed from the
    perspective of matrices.
  • Used in pattern recognition, speech recognition,
    computer vision, and classification.

16
Classification Using Neural Networks
  • Typical NN structure for classification
  • One output node per class
  • Output value is class membership function value
  • Supervised learning
  • For each tuple in training set, propagate it
    through NN. Adjust weights on edges to improve
    future classification.
  • Algorithms Propagation, Backpropagation,
    Gradient Descent

17
Neural Network Example
18
Propagation
19
Backpropagation
20
Clustering Problem
  • Given a database Dt1,t2,,tn of tuples and an
    integer value k, the Clustering Problem is to
    define a mapping fDg1,..,k where each ti is
    assigned to one cluster Kj, 1ltjltk.
  • A Cluster, Kj, contains precisely those tuples
    mapped to it.
  • Unlike classification problem, clusters are not
    known a priori.

21
Clustering Examples
  • Segment customer database based on similar buying
    patterns.
  • Group houses in a town into neighborhoods based
    on similar features.
  • Identify new plant species
  • Identify similar Web usage patterns

22
Agglomerative Example
B
A
E
C
D
Threshold of
4
2
3
5
1
A
B
C
D
E
23
Association Rule Problem
  • Given a set of items II1,I2,,Im and a
    database of transactions Dt1,t2, , tn where
    tiIi1,Ii2, , Iik and Iij ? I, the Association
    Rule Problem is to identify all association rules
    X ? Y with a minimum support and confidence.
  • Link Analysis
  • NOTE Support of X ? Y is same as support of X ?
    Y.

24
Example Market Basket Data
  • Items frequently purchased together
  • Bread ?PeanutButter
  • Uses
  • Placement
  • Advertising
  • Sales
  • Coupons
  • Objective increase sales and reduce costs

25
Association Rule Definitions
  • Set of items II1,I2,,Im
  • Transactions Dt1,t2, , tn, tj? I
  • Itemset Ii1,Ii2, , Iik ? I
  • Support of an itemset Percentage of transactions
    which contain that itemset.
  • Large (Frequent) itemset Itemset whose number of
    occurrences is above a threshold.

26
Association Rules Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
27
Web Data
  • Web pages
  • Intra-page structures
  • Inter-page structures
  • Usage data
  • Supplemental data
  • Profiles
  • Registration information
  • Cookies

28
Web Structure Mining
  • Mine structure (links, graph) of the Web
  • PageRank
  • Create a model of the Web organization.
  • May be combined with content mining to more
    effectively retrieve important pages.

29
PageRank
  • Used by Google
  • Prioritize pages returned from search by looking
    at Web structure.
  • Importance of page is calculated based on number
    of pages which point to it Backlinks.
  • Weighting is used to provide more importance to
    backlinks coming form important pages.
  • PR(p) c (PR(1)/N1 PR(n)/Nn)
  • PR(i) PageRank for a page i which points to
    target page p.
  • Ni number of links coming out of page i

30
Web Usage Mining
  • Extends work of basic search engines
  • Search Engines
  • IR application
  • Keyword based
  • Similarity between query and document
  • Crawlers
  • Indexing
  • Profiles
  • Link analysis

31
Web Usage Mining Applications
  • Personalization
  • Improve structure of a sites Web pages
  • Aid in caching and prediction of future page
    references
  • Improve design of individual pages
  • Improve effectiveness of e-commerce (sales and
    advertising)
Write a Comment
User Comments (0)
About PowerShow.com