Data%20warehousing%20and%20mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20warehousing%20and%20mining

Description:

Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay Introduction Organizations getting larger and amassing ever ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 20
Provided by: suni63
Category:

less

Transcript and Presenter's Notes

Title: Data%20warehousing%20and%20mining


1
Data warehousing and mining
  • Session VII (Part 1) 1545 - 1610
  • Sunita Sarawagi
  • School of IT, IIT Bombay

2
Introduction
  • Organizations getting larger and amassing ever
    increasing amounts of data
  • Historic data encodes useful information about
    working of an organization.
  • However, data scattered across multiple sources,
    in multiple formats.
  • Data warehousing process of consolidating data
    in a centralized location
  • Data mining process of analyzing data to find
    useful patterns and relationships

3
Typical data analysis tasks
  • Report the per-capita deposits broken down by
    region and profession.
  • Are deposits from rural coastal areas increasing
    over last five years?
  • What percent of small business loans were
    cleared?
  • Why is it less than last years? How did similar
    businesses that did not take loans perform?
  • What should be the new rules for loan
    eligibility?

4
Decision support tools
Mining tools
Direct Query
Reporting tools
Intelligent Miner
Essbase
Crystal reports
Merge Clean Summarize
Relational DBMS e.g. Redbrick
Data warehouse
Detailed transactional data
Operational data
Oracle
SAS
IMS
5
Data warehouse construction
  • Heterogeneous data integration
  • merge from various sources, fuzzy matches
  • remove inconsistencies
  • Data cleaning
  • missing data, outliers, clean fields e.g.
    names/addresses
  • Data mining techniques
  • Data loading summarize, create indices
  • Products Prism warehouse manager, Platinum info
    refiner, info pump, QDB, Vality

6
Warehouse maintenance
  • Data refresh
  • when to refresh, what form to send updates?
  • Materialized view maintenance with batch updates.
  • Query evaluation using materialized views
  • Monitoring and reporting tools
  • HP intelligent warehouse advisor

7
Decision support tools
Mining tools
Direct Query
Reporting tools
Intelligent Miner
Essbase
Crystal reports
Merge Clean Summarize
Relational DBMS e.g. Redbrick
Data warehouse
Detailed transactional data
Operational data
Oracle
SAS
IMS
8
OLAP
  • Fast, interactive answers to large aggregate
    queries.
  • Multidimensional model dimensions with
    hierarchies
  • Dim 1 Bank location
  • branch--gtcity--gtstate
  • Dim 2 Customer
  • sub profession --gt profession
  • Dim 3 Time
  • month --gt quarter --gt year
  • Measures loan amount, transactions, balance

9
OLAP
  • Navigational operators Pivot, drill-down,
    roll-up, select.
  • Hypothesis driven search E.g. factors affecting
    defaulters
  • view defaulting rate on age aggregated over other
    dimensions
  • for particular age segment detail along
    profession
  • Need interactive response to aggregate queries..

10
OLAP products
  • About 30 OLAP vendors
  • Dominant ones
  • Oracle Express largest market share 20
  • Arbor Essbase technology leader
  • Microsoft Plato introduced late last year,
    rapidly taking over...

11
Microsoft OLAP strategy
  • Plato OLAP server powerful, integrating various
    operational sources
  • OLE-DB for OLAP emerging industry standard based
    on MDX --gt extension of SQL for OLAP
  • Pivot-table services integrate with Office 2000
  • Every desktop will have OLAP capability.
  • Client side caching and calculations
  • Partitioned and virtual cube
  • Hybrid relational and multidimensional storage

12
Data mining
  • Process of semi-automatically analyzing large
    databases to find interesting and useful patterns
  • Overlaps with machine learning, statistics,
    artificial intelligence and databases but
  • more scalable in number of features and instances
  • more automated to handle heterogeneous data

13
Some basic operations
  • Predictive
  • Regression
  • Classification
  • Descriptive
  • Clustering / similarity matching
  • Association rules and variants
  • Deviation detection

14
Classification
  • Given old data about customers and payments,
    predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
15
Classification methods
  • Nearest neighbor
  • Regression (linear or any polynomial)
  • asalary bage c eligibility score.
  • Decision tree classifier
  • Probabilistic/generative models
  • Neural networks

16
Clustering
  • Unsupervised learning when old data with class
    labels not available e.g. when introducing a new
    product.
  • Group/cluster existing customers based on time
    series of payment history such that similar
    customers in same cluster.
  • Key requirement Need a good measure of
    similarity between instances.
  • Identify micro-markets and develop policies for
    each

17
Association rules
T
Milk, cereal
  • Given set T of groups of items
  • Example set of item sets purchased
  • Goal find all rules on itemsets of the form
    a--gtb such that
  • support of a and b gt user threshold s
  • conditional probability (confidence) of b given
    a gt user threshold c
  • Example Milk --gt bread
  • Purchase of product A --gt service B

Tea, milk
Tea, rice, bread
cereal
18
Mining market
  • Around 20 to 30 mining tool vendors
  • Major players
  • Clementine,
  • IBMs Intelligent Miner,
  • SGIs MineSet,
  • SASs Enterprise Miner.
  • All pretty much the same set of tools
  • Many embedded products fraud detection,
    electronic commerce applications

19
Conclusions
  • The value of warehousing and mining in effective
    decision making based on concrete evidence from
    old data
  • Challenges of heterogeneity and scale in
    warehouse construction and maintenance
  • Grades of data analysis tools straight querying,
    reporting tools, multidimensional analysis and
    mining.
Write a Comment
User Comments (0)
About PowerShow.com