COMP 578 Data Warehouse and Data Warehousing: An Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

COMP 578 Data Warehouse and Data Warehousing: An Introduction

Description:

... advantage requires that companies accelerate their decision making process so ... One key to this accelerated decision making is having the right information, at ... – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0
Slides: 34
Provided by: keithc5
Category:

less

Transcript and Presenter's Notes

Title: COMP 578 Data Warehouse and Data Warehousing: An Introduction


1
COMP 578Data Warehouse and Data WarehousingAn
Introduction
  • Keith C.C. Chan
  • Department of Computing
  • The Hong Kong Polytechnic University

2
What is A Data Warehouse?
A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. W. H. Inmon
HRO
Health
Data Warehouse
Student
ITs
3
Data Warehousing and Industry
  • One of the hottest topic in IS.
  • Over 90 of larger companies either have a DW or
    are starting one.
  • Warehousing is big business
  • Old statistics from Megroup.
  • 3.5 billion in early 1997
  • 8 billion in 1998 Metagroup
  • over 200 billion over next 5 years.
  • Latest by IDC on DW tools.
  • 5 billion in 1999.
  • 16 billion in 2004.
  • Latest by IDC on CRM applications
  • 61 billion in 2001
  • 148 billion in 2005

4
Data Warehousing and Industry (2)
  • A 1996 study of 62 data warehousing projects
    showed an average return on investment of 321,
    with an average payback period of 2.73 years.
  • In 2003, some people are skeptical.
  • WalMart has largest warehouse
  • 900-CPU, 2,700 disk, 23 TB Teradata system
  • 7TB in warehouse
  • 40-50GB per day

5
Why Data Warehouse?Why The Hype?
6
Information vs. Data
  • Information is pivotal in todays business
    environment. Success is dependent on its early
    and decisive use. A lack of information is a
    sure sign for failure. The rapidly changing
    environment in which business operates demands
    ever more immediate access to data. (Devlin,
    1997)
  • Many corporations are actively looking for new
    technologies that will assist them in becoming
    more profitable and competitive. Gaining
    competitive advantage requires that companies
    accelerate their decision making process so that
    they can respond quickly to change. One key to
    this accelerated decision making is having the
    right information, at the right time, easily
    accessible (Poe, 1996).

7
The Information Gap
  • The information gap is a result of
  • Fragmented way in which ISs and supporting DBs
    have been developed.
  • One-thing-at-a-time due to constraints on time
    and resources.
  • DBs on a variety of hardware and software
    platforms.
  • Difficult to locate and use accurate information.
  • Most systems developed to support operational
    processing.
  • Operational processing (a.k.a. TP) captures,
    stores and manipulates data to support daily
    operations.
  • Little thought given to the information or
    analytical tools needed for decision making.

8
Bridging The Information Gap
  • Data warehouses (DW) consolidate and integrate
    information from many different sources and
    arrange it in a meaningful format for making
    accurate business decisions (Martin, 1997a).
  • They support complex business decisions through
    analysis of trends, target marketing, competitive
    analysis, and so on.
  • Data warehousing has evolved to meet these needs
    without disturbing existing operational
    processing.

9
What Are The Issues?
  • How DW relates to existing operational systems.
  • Data architecture appropriate for most DW
    environments.
  • Extracting data from existing operational systems
    and loading them into a DW.
  • Interact with DW using OLAP, data mining and data
    visualization.

10
Data Warehouse Data Warehousing as Solution
11
The Need for Data Warehouses
  • Two major factors drive the need for data
    warehousing in most organizations today
  • Business requires an integrated company-wide view
    of high-quality information.
  • The IS department must separate informational
    from operational systems in order to dramatically
    improve performance in managing company data.

12
Need for a Company Wide View
  • Data in operational systems typically fragmented
    and of poor quality.
  • Generally distributed on a variety of
    incompatible HW and SW platforms
  • Unix running oracle DBMS
  • IBM MVS running the DB2 DBMS
  • Often necessary to provide a single, corporate
    view of that information for decision making.

13
Deriving a Single Corporate View
  • Develop a profile for each student from
  • STUDENT_DATA, STUDENT-EMPLOYEE, STUDENT_HEALTH
  • Some issues to resolve
  • Inconsistent key structures HKID and student
    name
  • Synonyms Student_No and Student_ID.
  • Free-form vs. structured fields Last name, first
    name.
  • Inconsistent data values different phone
    numbers.
  • Missing data how will the value for insurance be
    located?

14
Need to Separate Operational and Informational
Systems
  • Operational system used to run a business in real
    time based on current data.
  • E.g. sales order processing, reservation systems,
    patient registration,
  • Process large volumes of relatively simple
    read/write transactions, while providing fast
    response.
  • Information systems designed to support decision
    making based on historical data.
  • Designed for complex and read-only queries or
    data mining application.
  • Sales trend analysis, customer segmentation, and
    human resource planning.

15
Need to Separate Operational and Informational
Systems (2)
  • It is essential to separate informational
    processing from operational processing by
    creating a data warehouse.
  • A DW centralizes data (at least logically) that
    are scattered throughout disparate operational
    systems and makes them readily available for
    decision support.
  • A properly designed DW adds value to data by
    improving their quality and consistency.
  • A separate data warehouse eliminates much of the
    contention for resources that results when
    informational applications are cofounded with
    operational processing.

16
Data Warehouse vs. Operational DB Systems
  • OLTP (on-line transaction processing)
  • Major task of traditional relational DBMS
  • Day-to-day operations purchasing, inventory,
    banking, manufacturing, payroll, registration,
    accounting, etc.
  • OLAP (on-line analytical processing)
  • Major task of data warehouse system
  • Data analysis and decision making

17
Data Warehouse vs. Operational DB Systems
  • Distinct features (OLTP vs. OLAP)
  • User and system orientation customer vs. market
  • Data contents current, detailed vs. historical,
    consolidated
  • Database design ER application vs. star
    subject
  • View current, local vs. evolutionary, integrated
  • Access patterns update vs. read-only but complex
    queries

18
Why Separate Data Warehouse?
  • High performance for both systems
  • DBMS tuned for OLTP access methods, indexing,
    concurrency control, recovery
  • Warehouse tuned for OLAP complex OLAP queries,
    multidimensional view, consolidation.
  • Different functions and different data
  • missing data Decision support requires
    historical data which operational DBs do not
    typically maintain
  • data consolidation DS requires consolidation
    (aggregation, summarization) of data from
    heterogeneous sources
  • data quality different sources typically use
    inconsistent data representations, codes and
    formats which have to be reconciled.

19
Advantages of Warehousing Approach
  • High query performance
  • But not necessarily most current information
  • Doesnt interfere with local processing at
    sources
  • Complex queries at warehouse
  • OLTP at information sources
  • Information copied at warehouse
  • Can modify, annotate, summarize, restructure,
    etc.
  • Can store historical information
  • Security, no auditing
  • Has caught on in industry

20
The Terms The Definitions
21
The Data Warehouse
  • Strategic response to customer requirement for
    providing and processing information
  • at various levels of abstraction
  • using history for trend analysis
  • with high performance
  • What it provides
  • - A protected business decision support
    environment
  • - A repository of consolidated corporate data
  • - A staging area for revitalizing operational
    systems

22
  • What is A Data Warehouse

Multi-Dimensional Database
Data Rotation
Middleware
O L A P
M e t a D a t a
Data Scrubbers
Data Warehouse Manager
D S S
Data Mart
E I S
Dimensional Data Modeling
ESS
Data Mining
Star Schema
Data Propagation
Multi-relational tools
23
What is a Data Warehouse?A Practitioners
Viewpoint
  • A data warehouse is simply a single, complete,
    and consistent store of data obtained from a
    variety of sources and made available to end
    users in a way they can understand and use it in
    a business context.
  • -- Barry Devlin, IBM Consultant

24
What is a Data Warehouse?An Alternative Viewpoint
  • A DW is a
  • subject-oriented,
  • integrated,
  • time-varying,
  • non-volatile
  • collection of data that is used primarily in
    organizational decision making.
  • -- W.H. Inmon, Building the Data Warehouse, 1992

25
The Data Warehouse
  • Key characteristics
  • Subject-oriented
  • Integrated
  • Time-variant
  • Nonvolatile

26
Subject Oriented
Operational Applications/ Databases
Data Warehouse Subjects
  • Data is stored by business subject rather than by
    application
  • Order Billing
  • Accounts Receivable
  • Accounts Payable
  • Loans
  • Savings
  • Life Insurance Claims Processing
  • Auto Insurance
  • Customer
  • Claims
  • Sales
  • Product

27
Integrated
  • Data is stored once in a single integrated
    location

Operational Environment
Decision Support Environment
Savings Database
Data Warehouse Database
Savings Application
No Application Flavor
Customer data stored in several Databases
Current Accounts Database
Current Accounts Application
Personal Loans Database
Personal Loans Application
Subject Customer
28
Time-variant
  • Data is stored as a series of snapshots or views
    which records data content and context across
    time.

Data Warehouse Data

Time
Data
Key, Version and Date timestamp
- Data is tagged with some element of time -
creation date, as of date/to , etc. - Data is
available for long periods of time. For example,
five or more years
29
Non-volatile
  • Existing data in the warehouse is not overwritten
    or updated.

External Source Systems
Create Update Delete Transactions
Internal Source Systems
Data Warehouse
READ-ONLY
Data Warehouse Business Users Applications
30
How the Data Warehouse evolved
Operational Reporting
Data Extraction/Replication
Data Warehouses Data Marts OLAP
Servers Data Mining
31
Line of Business Data Marts extend the concept
Business Source Systems
Data Staging/Replication Layer
Line of Business Systems
External Data
Other
Data Warehouse
Data Marts
  • Extends the concept of Data Warehousing into the
    various lines-of-business in support of specific
    needs for business intelligence

32
Data Mining further extends the concepts of
tactical access to data in support of specific
business objectives
Business Source Systems
Data Staging/Replication Layer
Line of Business Systems
External Data
Other
Data Warehouse
  • Specialized applications which run on OLAP
    servers for drill-down processing
  • Can include access by neural nets, gophers and
    agents

33
Data Warehousing
  • Definition 1
  • The process of constructing and using data
    warehouses
  • Definition 2
  • The process whereby organizations extract meaning
    from their informational assets through the use
    of data warehouses.
Write a Comment
User Comments (0)
About PowerShow.com