ITEC 423 Data Warehousing and Data Mining - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

ITEC 423 Data Warehousing and Data Mining

Description:

ITEC 423 Data Warehousing and Data Mining Lecture 2 – PowerPoint PPT presentation

Number of Views:671
Avg rating:3.0/5.0
Slides: 46
Provided by: edut1299
Category:

less

Transcript and Presenter's Notes

Title: ITEC 423 Data Warehousing and Data Mining


1
ITEC 423 Data Warehousing and Data Mining
  • Lecture 2

DATA WAREHOUSEBUILDING BLOCKS
2
(No Transcript)
3
What is a data warehouse?
  • A data warehouse is a
  • subject-oriented,
  • Integrated (consolidated)
  • time-variant, and
  • nonvolatile
  • collection of data in support of managements
    decision-making process.
  • W. H. Inmon

4
Subject-Oriented Data
  • Warehouse is organized around major subjects of
    the enterprise rather than major application
    areas
  • Contains decision-support data rather than
    application-oriented data.
  • The focus of the design is
  • providing users easy access to data so that
    current and future questions can be answered

5
Application-Oriented vs Subject-Oriented
6
Integrated or Consolidated Data
  • Integrates corporate level application-oriented
    data from different source systems
  • data is often inconsistent or missing
  • Integrated data source must be made consistent to
    present a unified view of the data to the users.

7
Integrated Data
8
Time-Variant Data
This is necessary to support trending,
forecasting, and time-based performance
reporting, such as current year versus previous
year.
  • Data in the warehouse is only accurate and valid
    at some point in time or over some time interval.
  • contains slices of data across different periods
    of time.
  • With these data slices, the user can view current
    and past reports.
  • data represents a series of snapshots.
  • Time-variance is also shown in the extended time
    that data is stored
  • contains several years worth of data
  • implicit or explicit association of time with all
    data

9
Non-Volatile Data
  • Data in the warehouse is not updated in real-time
    but is refreshed from operational systems on a
    regular basis
  • New data is always added as a supplement to the
    database, rather than a replacement.

10
Data Granularity
  • operational database
  • data is usually kept at the lowest level of
    detail
  • the units of sale are captured at the level of
    units of a product per transaction at the check
    -out counter.
  • the quantity ordered is captured and stored at
    the level of units of a product per order
    received from the customer.
  • If summary data is needed, the individual
    transactions are grouped.
  • data warehouse
  • Initial requests are for summary data to use for
    analysis.
  • total sale units of a product in an entire
    region.
  • Progessively more details may be required
  • breakdown by states in the region.
  • Examine sale units at individual stores.

11
Data Granularity
  • Is the level of detail
  • keep data summarized at different levels of
    detail

THREE DATA LEVELS IN A BANKING DATA WAREHOUSE
12
Typical Properties of a data warehouse
  • A data warehouse is housed on an enterprise
    mainframe server.
  • Data from various online transaction processing
    (OLTP) applications and other sources is
    selectively extracted and organized
  • Read-only Copy
  • re-structured
  • Data warehouse database is used for processing
    analytical applications and user queries.

13
OLTP vs. Warehousing
  • Organized by transactions vs. Organized by
    particular subject
  • More number of users vs. less
  • Accesses few records vs. entire table
  • Smaller database vs. Large database
  • Normalized data structure vs. Unnormalized
  • Continuous update vs. periodic update (load)

14
Data Warehouse Compared to OLTP
PROPERTY OLTP DATA WAREHOUSE
ACTIVITIES Processes Analysis
RESPONSE TIME Subseconds to seconds Seconds to hours
OPERATIONS DML Primarily read-only
NATURE OFDATA 30-60 days Snapshots over time
DATA ORGANIZE. By application By subject,time
SIZE Small to large Large to very large
DATA SOURCES Operational Internal Operational, Internal, External
15
Data Warehouse Compared to OLTP
PROPERTY
OLTP
DATA WAREHOUSE
ACTIVITIES
Processes
Analysis
RESPONSE TIME
Subseconds
to
Seconds to hours
seconds
OPERATIONS
DML
Primarily read
-
only
NATURE OF
30
-
60 days
Snapshots over time
DATA
DATA ORGANIZE.
By application
By
subject,time
SIZE
Small to large
Large to very large
DATA SOURCES
Operational
Operational,
Internal
Internal, External
Predictable
Unpredictable
USAGE CURVEE
16
Data Warehouse or Data Mart
17
Data Warehouse Compared to Data Mart
Property Data Warehouse Data Mart
Scope Enterprise Department
Subjcts Multiple Single-subject, line of business (LOB)
Data Source Many Few
Size (typical) 100 GB to gt 1 TB lt 100 GB
Implementationtime Months to years Months
Data Warehouse
Data Mart
18
Data Warehouse Compared to Data Mart
Property
Data Warehouse
Data Mart
Scope
Enterprise
Department
Subjcts
Multiple
Single
-
subject, line
of business (LOB)
Data Source
Many
Few
Size (typical)
100 GB to gt 1 TB
lt 100 GB
Implementation
Months to years
Months
time
Data Warehouse
Data Mart
19
Which one to build first? Data warehouse or Data
Mart?
20
Top Down Approach
21
Top Down Approach
  • Build the overall, big, enterprise- wide data
    warehouse.
  • Instead of collection of fragmented islands of
    information.
  • Data warehouse is large and integrated.
  • would take longer to build and has a high risk
    of failure.
  • If you do not have experienced professionals on
    your team, this approach could be hazardous.
  • Difficult to sell this approach to senior
    management and sponsors
  • They are not likely to see results soon enough.

22
Pros and Cons of Top Down Approach
  • Advantages
  • Disadvantages
  • A truly corporate effort, an enterprise view of
    data
  • Inherently architected, not a union of disparate
    data marts
  • Single, central storage of data about the content
  • Centralized rules and control
  • May see quick results if implemented with
    iterations
  • Takes longer to build even with an iterative
    method
  • High exposure to risk of failure
  • Needs high level of cross-functional skills
  • High outlay without proof of concept

23
Bottom Up Approach
24
Bottom Up Approach
  • Build departmental data marts one by one based on
    priority
  • Collection of data marts make up the data
    warehouse
  • Beware of data fragmentation
  • Independent data marts are blind to the overall
    requirements of the entire organization.
  • Data marts contain
  • data at the lowest level of granularity
  • summaries depending on the needs for analysis
  • Data marts are joined or unioned together by
    conforming the dimensions

25
Pros and Cons of Bottom Up Approach
  • Disadvantages
  • Advantages
  • Faster and easier implementation of manageable
    pieces
  • Favorable return on investment and proof of
    concept
  • Less risk of failure
  • Inherently incremental
  • can schedule important data marts first
  • Allows project team to learn and grow
  • Each data mart has its own narrow view of data
  • Permeates redundant data in every data mart
  • Perpetuates inconsistent and irreconcilable data
  • Proliferates unmanageable interfaces

26
Architectural Types
27
Architectural Types
  • Centralized Data Warehouse
  • Takes into account the enterprise-level
    information requirements
  • Atomic level data at the lowest level of
    granularity is stored
  • Some summarized data may be included
  • Queries and applications access the central data
    warehouse.
  • No separate data marts
  • Independent Data Marts
  • Evolves in companies where the organizational
    units develop their own data marts for their own
    specific purposes
  • Each data mart serves a particular organizational
    unit
  • More than one version of the truth may be found
  • Data marts are independent of one another
  • Different data marts may have inconsistent data
    definitions and standards
  • Such variances hinder analysis of data across
    data marts.

28
Architectural Types
  • Federated
  • An existing legacy of an assortment of DSS in the
    form of operational systems, extracted datasets,
    primitive data marts,
  • May not be possible to discard investment and
    start from scratch
  • Practical solution is a federated architectural
    type
  • data may be physically or logically integrated
    through shared key fields, overall global
    metadata , distributed queries, and such other
    methods
  • No one overall data warehouse
  • Data-Mart Bus
  • Conformed supermarts approach
  • Analyzing requirements for a specific business
    subject such as orders, shipments, billings,
    insurance claims, car rentals, and ...
  • Build the first data mart (supermart) using
    business dimensions and metrics
  • These business dimensions will be shared in the
    future data marts.
  • Conform dimensions among the various data marts
  • Result would be logically integrated supermarts
    that will provide an enterprise view of the data
  • Data marts contain atomic data organized as a
    dimensional data model
  • Results from adopting an enhanced bottom-up
    approach to data warehouse development

29
Architectural Types
  • Hub-and-S poke
  • Similar to the centralized data warehouse
    architecture
  • Overall enterprise-wide data warehouse
  • Atomic data is stored in the centralized data
    warehouse
  • Major and useful difference is the presence of
    dependent data marts in this architectural type
  • Dependent data marts obtain data from the
    centralized data warehouse
  • The centralized data warehouse forms the hub to
    feed data to the data marts on the spokes
  • Dependent data marts may be developed for a
    variety of purposes
  • departmental analytical needs, specialized
    queries, data mining, and ...
  • Dependent data mart may have normalized,
    denormalized, summarized, or dimensional data
    structures based on individual requirements
  • Most queries are directed to the dependent data
    marts although the centralized data warehouse may
    itself be used for querying
  • Result s from adopting a top-down approach to
    data warehouse development.

30
Building Blocks of Data Warehouses
31
What is OLAP?
  • online analytical processing
  • Approach to answer multi-dimensional analytical
    queries.
  • part of the broader category of
  • business intelligence reporting, data mining.
  • Applications include
  • business reporting for sales,
  • management reporting,
  • budgeting
  • forecasting

32
Typical Data Warehousing Process
Phase I - STRATEGY Identify business
requirements Define objectives purpose of DW
Phase II - DEFINITION Project scoping and
planning Using building block approach
Phase III - ANALYSIS Information requirements are
defined
Phase IV - DESIGN Database structures to hold
base data and summaries are created Translation
mechanisms are designed
Phase V - BUILD DOCUMENT The warehouse is built
and documentation is developed
Phase VI - POPULATE, TEST TRAIN The warehouse
is populated and tested the users are trained on
system and tools
Iterative
Phase VII - DISCOVERY EVOLUTION The warehouse
is monitored and adjustments are applied, or
future extensions are planned
33
What Does All This Mean?
  • On a daily basis, organizations turn to their
    data warehouses to answer a limitless variety of
    questions.
  • Nothing is free
  • these benefits do come with a cost.
  • The value of a data warehouse is a result of the
    new and changed business processes it enables.
  • There are limitations
  • A DW cannot correct problems with the data,
    although it may help to clearly identify them.

34
Comparison of Typical DW Costs and Benefits
  • Costs
  • Hardware, software, development personnel and
    consultant costs.
  • Operational costs like ongoing systems
    maintenance.
  • Benefits
  • Added Revenue
  • Will the new (business objective) process
    generate new customers (what is the estimated
    value?)
  • Will the new (business objective) process
    increase the buying propensity of existing
    customers (by how much?)
  • Is the new process necessary to ensure that the
    competition doesn't offer a demanded service that
    you can't match?
  • Reduced costs
  • What costs of current systems will be eliminated?
  • Is the new process intended to make some
    operation more efficient? If so, how and what is
    the dollar value?

35
The Cost of Warehousing Data
  • Expenditures can be categorized as one-time
    initial costs or as recurring, ongoing costs.
  • The initial costs can further be identified as
    for hardware or software.
  • Expenditures can also be categorized as capital
    costs (associated with acquisition of the
    warehouse) or as operational costs (associated
    with running and maintaining the warehouse)

36
Expenditures Associated with Building a DW
Recurring Costs One-Time Costs
Capital Hardware maintenance Software maintenance Terminal analysis Middleware Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure
Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW Integration/transformation processing specification Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition
37
Cost is Highly Variable
  • A company that spends less money for their data
    warehouse is often happier with it.
  • The main justification for the development
    expense is that a DW reduces the cost of
    accessing the information owned by the
    organization.
  • Since information has to be retrieved just once
    (when it is placed in the warehouse), DW users
    see a lower cost on each report generated.

38
Typical Multidatabase Report and Screen Generation
Data download and transformation contribute to
retrieval costs for every report or screen
generated
39
Typical DW Report and Screen Generation
Data upload and transformation costs occur just
once. Retrieval costs are lower.
40
Farmers and Explorers
  • Every corporation has two types of DW users.
  • Farmers know what they want before they set out
    to find it. They submit small queries and
    retrieve small nuggets of information.
  • Explorers are quite unpredictable. They often
    submit large queries. Sometimes they find
    nothing, sometimes they find priceless nuggets.
  • Cost justification for the DW is usually done on
    the basis of the results obtained by farmers
    since explorers are unpredictable.

41
Data Marts and the Data Warehouse
Legacy systems feed data to the warehouse. The
warehouse feeds specialized information to
departments.
42
The Data Mart is More Specialized
The data mart serves the needs of one business
unit, not the organization.
43
Foundations of Data Mining
  • Data mining is the process of using raw data to
    infer important business relationships.
  • Despite a consensus on the value of data mining,
    a great deal of confusion exists about what it
    is.
  • It is a collection of powerful techniques
    intended for analyzing large datasets.
  • There is no single data mining approach, but
    rather a set of techniques that can be used in
    combination with each other.

44
The Roots of Data Mining
  • The approach has roots in practice dating back
    over 30 years.
  • In the early 1960s, data mining was called
    statistical analysis, and the pioneers were
    statistical software companies such as SAS and
    SPSS.
  • By the 1980s, the traditional techniques had been
    augmented by new methods such as fuzzy logic,
    heuristics and neural networks.

45
A General Approach
  • Although all data mining endeavors are unique,
    they possess a common set of process steps
  • Infrastructure preparation choice of hardware
    platform, the database system and one or more
    mining tools
  • Exploration looking at summary data, sampling
    and applying intuition
  • Analysis each discovered pattern is analyzed
    for significance and trends

46
A General Approach (continued)
  1. Interpretation Once patterns have been
    discovered and analyzed, the next step is to
    interpret them. Considerations include business
    cycles, seasonality and the population the
    pattern applies to.
  2. Exploitation this is both a business and a
    technical activity. One way to exploit a pattern
    is to use it for prediction. Others are to
    package, price or advertise the product in a
    different way.

47
Review Vocabulary
  • Data warehouse
  • Data mart
  • OLTP
  • OLAP
  • Dimensional Model
  • Subject Oriented
  • Time variant
  • Non volatile
  • Integrated/consolidate
Write a Comment
User Comments (0)
About PowerShow.com