Data Warehousing Version 6.0 - 04/18/2000 - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Data Warehousing Version 6.0 - 04/18/2000

Description:

What is the cost of staff to break into a new line of business? ... painlessly and 'seamlessly' and at the same time, combine data from modern ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 61
Provided by: davidf86
Learn more at: http://mis.njit.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing Version 6.0 - 04/18/2000


1
Data WarehousingVersion 6.0 - 04/18/2000
2
Can your database answer questions like these?
  • What is the cost of staff to break into a new
    line of business?
  • What are the travel routes of my competitions
    inventory?
  • At what velocity is my competitor moving toward a
    common goal?
  • How will a transaction on a certain date be
    affected by currency exchange rates?
  • Is a foreign labor source likely to produce a
    higher quality product?
  • Which 20 of the problem creates 80 of the
    problems?

3
Can your database answer questions like these?
  • By product and location, how can we regain a lost
    customer base?
  • Which skill and staff levels are most likely to
    accept the voluntary layoff package?

4
Data is Difficult to Manage
  • Amount of data increases exponentially past data
    must be kept for long times new data are added
    rapidly.
  • Data are scattered throughout organizations
    collected by many individuals using different
    methods and devices.
  • Only small portions of an organizations data are
    relevant for specific decisions.
  • An ever-increasing amount of external data needs
    to be considered in making organizational
    decisions.

5
Data is Difficult to Manage - contd.
  • Raw data may be stored in different computing
    systems, formats, and human and computer
    languages.
  • Legal requirements relating to data differ among
    countries and change frequently.
  • Selecting data management tools can be difficult
    because of the large number of tools available.
  • Data security, quality, and integrity are
    critical, yet easily jeopardized.

6
Data and Knowledge Managementment
  • Businesses do not run on data. They run on
    information and their knowledge of how to put
    that information to use successfully.

7
Some Information Concepts
  • Data Unorganized facts and figures. (raw
    material)
  • Information Data that has been processed into a
    form that is meaningful to the recipient and is
    of real of perceived value in current or
    prospective actions or decisions.
  • Information
  • adds to a representation
  • corrects or confirms previous information
  • has surprise value in that it tells us
    something we did not know, or could not predict.
  • What is a finished product to one, may be raw
    materials to someone else.

8
Definitions Information vs. Knowledge
  • Knowledge a combination of instincts, ideas,
    rules, and procedures that guide actions and
    decisions.
  • Helping to provide the best available knowledge
    to decision-making is another role of information
    systems

9
Relationship Between Data, Information, and
Knowledge
  • The difference between data and information is
    easy to remember.
  • It is often cited as the reason why systems that
    collect large amounts of information fail to meet
    managements information needs.
  • There are many methods of converting data into
    information for decision making.
  • Managers take action based on information about a
    current situation plus their accumulated
    knowledge. Actions taken feed the process of
    accumulating more knowledge (experience).
  • Example How do medical students become competent
    physicians?

10
Relationship Between Data, Information, and
Knowledge
11
Attributes of Quality Information
  • Timeliness
  • Completeness
  • Conciseness
  • Relevance
  • Accuracy
  • Precision
  • Appropriateness of Form

12
Special Characteristics of Information
  • Usefulness - depends on combination of
    quality,accessibility,and presentation.
  • One persons information may be another persons
    noise.
  • Soft data may be as important as hard data.
  • Ownership of information may be hard to maintain.
  • More information is not always better
    (information overload).
  • Politics can often hide or distort information.

13
Sources of Data
  • Internal Data
  • Data about people, products, services, processes.
  • Often stored in corporate data bases (e.g. Sales
    or HR).
  • Some data may be disparate in different regions,
    but accessible by networks.
  • Personal Data
  • Individuals document expertise by creating
    personal data - subjective estimates. Some is
    kept in heads, or mental models.
  • Can be store on PCs, or available on the Web.
  • External Data
  • Many sources.

14
Some Sources of Business External Data
  • Federal Publications
  • Survey of Current Business
  • Monthly Labor Review
  • Federal Reserve Bulletin
  • Employment and Earnings
  • Commerce Business Daily
  • Census Bureau
  • Other
  • International Monetary Fund
  • Moodys
  • Standard Poors
  • Advertising Age
  • Dialog and Lexis/Nexis
  • Other-contd.
  • ABI/Inform
  • Annual Editor Publisher Market Guide
  • Thomas Register On-line
  • Indexes
  • Encyclopedia of Business Information Sources

15
What is a data warehouse?
  • A data warehouse is a pool of data organized in a
    format that enables users to interpret data and
    convert it into useful information to gain
    knowledge from this interpretation.
  • It is a single place that contains complete and
    consistent data from multiple sources.
  • Data warehousing is the act of a business person
    extracting business value from the data stored in
    the data warehouse.

16
Collecting Raw Data
  • Is not Easy
  • collect in the field
  • elicit from people
  • collect manually, electronically, or by sensors.
  • Data collection technology has not kept pace with
    advances of data storage technology.
  • Data collection from external sources is not easy
    either.
  • Bottom Line Garbage IN, Garbage OUT - GIGO.
  • Data Quality is an Important Issue.

17
Data Quality Issues
  • Intrinsic DQ
  • accuracy, objectivity, believability, reputation.
  • Accessibility DQ
  • Accessibility and access security
  • Contextual DQ
  • relevancy, value-added, timeliness,completeness,
    amount of data.
  • Representation DQ
  • Interpretability, ease of understanding,concise
    representation, consistent representation

18
Why Data Warehousing?
  • Managers do not make decisions that are good or
    bad, they make decisions on the basis of good
    or bad information.
  • Management information
  • a. the right information
  • b. in the right form
  • c. at the right time.
  • Most transaction-based information systems have
    difficulty delivering this information.

19
Why Data Warehousing - 2
  • Not the right information
  • data not easily accessible
  • meaning is subtly (or significantly different)
    from the question context.
  • Information is presented with too much or too
    little detail, covers the wrong time spans, or is
    in the wrong intervals.

20
Why Data Warehousing - 3
  • Not the right time
  • Getting this information may require the efforts
    of highly skilled professionals who are not
    generally available at the whim of business
    managers.
  • Data comes from a variety of different systems
    which are resident on a variety of different
    technology platforms.

21
Why Data Warehousing - 4
  • Not the right format
  • If data is extracted, merged, and converted into
    a meaningful information, often it is not in a
    usable format.
  • Users will want it loadable into a particular PC
    tool or spreadsheet with which he/she is
    familiar.
  • Printouts weighing 10 pounds are not in the right
    format.
  • a diskette with a COBOL file description is not
    in the right format.

22
The Dilemma for Corporate IT
  • How to control scarce IT resources consumed by
    insatiable user demand for ad-hoc reports.
  • Each ad-hoc report generated by IT and analyzed
    by the user generates three more reports to
    further illuminate the insights gleamed in the
    first.
  • Often the extract programs have few reusable
    components.
  • The user is on a voyage of discovery in a sea of
    data.

23
The Response of Corporate IT
  • New methodologies Align the IT systems with the
    business goals and requirements.
  • These techniques concentrate on business process
    requirements, not decision support requirements.
  • Transaction systems must be rigorously specified
    in advance. The are an intersection between the
    organization and the customer.
  • These systems should not be a voyage of
    discovery for either.

24
Transaction Systems vs. Analytical Support Systems
  • Transaction Systems
  • Insert an order for 300 baseballs
  • Update this passengers airline reservation.
  • close-out accounts payable records for this
    vendor.
  • What is the current checking account balance for
    this customer?
  • Analytical Support Systems
  • Did the sales promotion last quarter do better
    than the same promotion last year?
  • Is the five-day moving average for this security
    leading or trailing actual prices?
  • Which product line sells best in middle-America
    and how does this correlate to demographic data.

25
Analytical Processing
  • Analytical Processing today includes what in the
    past have been called
  • DSS (Decision Support Systems)
  • EIS (Executive Information Systems)
  • ESS (Executive Support Systems)
  • It is an evolution of End-User Computing
  • Placing strategic data access in the hands of
    decision makers aids productivity and enables
    them to be better decision makers.

26
Key Difference OLTP vs. OLAP
  • OLTP Processing specific functions
  • OLAP providing flexibility for undetermined
    analysis.

27
A Multidimensional database
28
Data for Decision Support
  • The data must be integrated - requires data from
    many separate internal corporate databases.
  • The data must be enriched - through integration
    with other external data.
  • The data must be available - and not constrained
    by machine resources.

29
Sources of Data
  • Internal Data
  • Financial Systems
  • Logistics Systems
  • Sales Systems
  • Production Systems
  • Personnel Systems
  • Billing Systems
  • Information Systems
  • External Data Needs
  • to recognize opportunities
  • to detect threats
  • to identify synergies

30
Sources of Data - 2
  • External Data Categories
  • Competitor Data
  • Economic Data
  • Industry Data
  • Credit Data
  • Commodity Data
  • Econometric Data
  • Psychometric Data
  • Meteorological Data
  • Demographic Data
  • Sales Marketing Data

31
Operational Control vs. Operational Strategy
  • Data is a source not just of operational control,
    but of operational strategy.
  • Operational strategy is an attempt to describe
    the need, in a competitive and turbulent market,
    to continually innovate and re-align strategy
    with time scales too short to be comprehended by
    strategic planning in the conventional corporate
    sense.

32
Comparison of Control and Strategy Data
  • Operational Data
  • short-lived, rapidly changing
  • requires record-level access
  • repetitive standard transactions and access
    patterns
  • updated in real-time
  • event-driven process generates data
  • Strategic Data
  • long-living, static
  • data aggregated into sets (which is why warehouse
    data is friendly to RDBMS).
  • ad-hoc queries with some periodic reporting
  • updated periodically with mass loads
  • data-driven data governs process.

33
Information Requirements by Management
Level(Source Gorry and Scott Morton)
34
Dimensional Modeling
  • Dimensional Modeling gives us a way to visualize
    data.
  • The CEOs perspective
  • We sell products in various markets, and we
    measure our performance over time.
  • From the data warehouse designers perspective,
    we hear three dimensions
  • We sell Products
  • in various Markets
  • and measure performance over time.

35
Dimensional Modeling - contd.
  • Management may be interested in examining sales
    figures in a certain city by product, by time
    period, by salesperson, and by store.
  • The more dimensions involved, the more difficult
    it is to represent in a single table or graph.
  • The ability to add and modify the dimensions used
    in a table or graph is often known as slicing
    and dicing the data.

36
Dimensional Model of the Business
e
m
i
T
M a r k e t
Product
37
Data Dependencies Model of a Business
Product
Ship To
Shipper
Ship Type
Product Line
District Credit
Contact Location
Order Item
Sales Order
Customer Location
Product Group
Contact
Contract
Customer
Contract Type
Sales Rep
Sales District
Sales Region
Sales Division
38
Transaction Processing
  • The Relation Model was full of promises for equal
    access to data.
  • In the early 1980s the relational model was a
    dream. Typical transaction rates were one per
    second.
  • Today the SABRE system typically processes 4,000
    transactions per second, with peak bursts of
    13,000 per second.
  • OLTP - (On-line Transaction Processing) The point
    is to get data in to the database.

39
Segregating Operational and Warehouse Data
  • In the past, data administrators were constantly
    told to build data sharing, normalization, and
    non-redundant corporate databases.
  • Early attempts at data warehousing tried to share
    the data with transaction-based systems. This
    resulted in LONG response times for complex
    queries.
  • The idea today is to keep the two separate.
  • Separate databases, and perhaps separate DBMS
    products and processor platforms are used.
  • Controlled and practical redundancy is better
    than out-of-control theoretical purity.

40
Fundamental Obstacles With Traditional Systems
  • Systems Integration - Disintegration grew slowly
    from islands of automation.
  • ownership, planning, economic, organizational
    development issues all contribute.
  • Hardware Architecture
  • Inconsistent Data
  • Data Pollution
  • Bad Application Design (semantic and syntactical
    differences).
  • Ownership
  • Data Entry Conventions

41
The Data Warehouse
  • Active, tactical, and current events flow from
    the operational systems to the data warehouse to
    become static, strategic, and historical data.
  • The data warehouse becomes a middle ground
    where a large number of disparate and
    incompatible legacy systems are tied to an
    equally diverse collection of end-user
    workstations.
  • Legacy systems usually comprise a hodge-podge of
    assorted hardware, software, and operational
    systems accumulated over many decades, are by
    nature, incompatible with one another and unique
    to each organization.

42
Practical Facts About the Warehouse
  • The chances are remote that any single vendor
    will be able to develop a product that can
    interface with all legacy systems painlessly
    and seamlessly and at the same time, combine
    data from modern platforms and data external to
    the organization.
  • Instead warehouse product vendors develop
    specialized capabilities to work with various
    environments.

43
Typical Dimensional Model
Product Dimension
Sales Fact
Product_key description brand category
Time_key product_key store_key dollars_sold units_
sold dollars_cost
Time Dimension
Time_key day-of-week month quarter year holiday_fl
ag
Store Dimension
Store_key store_name address floor_plan_type
44
Fact Table
  • Fact Table is where numerical measurements of the
    business are stored.
  • Each measurement is taken at the intersection of
    all the dimensions.
  • The best facts are numeric, continuously valued
    and additive.
  • For every query made against the fact table may
    use hundreds of thousands of individual records
    to construct an answer set.

45
Dimension Tables
  • Dimension tables are where textual descriptions
    of the business are stored.
  • Each textual description helps to describe a
    member of the dimension.
  • Example each member in the product dimension is
    a specific product. The product dimension
    database has many attributes to describe the
    product. A key role of the dimension table
    attribute is to serve as the source of
    constraints in a query.

46
Example
47
Example Query
  • Find all product brands that were sold in the
    first quarter of 1995 and present the total
    dollar sales as well as the number of units.
  • Brand is a collection of individual products.
  • To construct
  • A. Drag attribute brand from product dimension.
    Place as Row Header.
  • B. Drag Dollar Sales and Units Sold from the
    Fact Table, and place to the right of the Brand
    row header.
  • C. Specify row constraint 1st Q 1995 on the
    quarter attribute in the Time Dimension Table.

48
Multidimensionality
  • Examples of dimensions
  • products, salespeople, market segments, business
    units, geographical locations, distribution
    channels, country, industry
  • Examples of Facts or Measures
  • money, sales volume, head count, inventory,
    profit, actual vs. forecast.
  • Examples of Time
  • daily, weekly, monthly, quarterly, yearly

49
Components of a Data Warehouse - 1
  • Acquisition - The first component handles
    acquisition of data from legacy systems and
    outside sources.
  • Data is identified, copied, formatted, and
    prepared for loading into a warehouse.
  • Vendors provide tools for extraction and
    preparation.

50
Components of a Data Warehouse - 2
  • Storage Area - The second component is the
    storage area managed by relational databases,
    multi-dimensional databases, specialized hardware
    - symmetric multiprocessor (SMP) or massively
    parallel processors (MPP) machines - or by
    software.
  • The storage component hold the data so that many
    different data mining, executive information and
    decision support systems can make use of it
    effectively.

51
Components of a Data Warehouse - 3
  • Access - The third component of the warehouse is
    the access area.
  • Different end-user PCs and workstations draw data
    from the warehouse with the help of
    multi-dimensional analysis tools, neural
    networks, data discover tools, or analysis tools.
  • These smart data-mining tools are the driving
    force behind the data warehouse concept.
  • What good is it to store all the information
    without some way to understand it in new and
    different ways.

52
Data Warehouse Access Tools
  • Intelligent Agents and Agencies - tools work and
    think for user.
  • Query Facilities and Managed Query environments.
  • Statistical Analysis - One of the biggest
    surprises in the data warehousing marketplace is
    the resurgence of interest in traditional
    statistical analysis, and the concomitant
    resurrection of the popularity to products like
    SAS and SPSS.

53
Data Warehouse Access Tools - 2
  • Data Discovery - A large class of tools formerly
    classified as decision support, artificial
    intelligence and expert systems. They now make
    use of neural networks, fuzzy logic, decision
    trees, and other tools from advanced mathematics
    to allow a user to sift through massive amounts
    of raw data to discover new, interesting,
    insightful, and in many cases useful things about
    the organization, its operations, and its
    markets.
  • Currently there are nearly 60 different data
    discovery tools/products on the market.

54
Data Warehouse Access Tools - 3
  • OLAP - On-line Analytical Processing often uses
    multi-dimensional spreadsheet tools allowing
    users to look at information from many different
    angles.
  • Users are able to slice and dice reports and to
    look at the same kinds of information at
    different levels at the same time.
  • Typical OLAP application might allow a product
    manager to view sales figures for a given product
    at the national level, see them broken down by
    division, drill down to see territories within a
    division, check sales numbers for each store
    within a territory, and then compare them against
    sales of stores from another territory.

55
Data Warehouse Access Tools - 4
  • Data Visualization - These tools turn ugly,
    boring numbers into exciting visual
    presentations.
  • These tools bring graphical representation to new
    heights. Example Geographical information
    systems turn data about stores, individuals, or
    anything else into compelling, easy to
    understand, dynamic maps.
  • PC-based Geographic Information systems have the
    ability to display spatial occurrences and the
    relationship between and among geographically
    specific variables.

56
Developing the Data Warehouse
  • The most expensive warehousing ventures involve
    major new hardware acquisitions and significant
    investments in training, analysis, and systems
    development costs.
  • Typical startup projects allocate 60 of budget
    for hardware and software for creation of a
    powerful storage component. 30 on data mining
    and acquisition tools.
  • Budgeting for Systems Analysis and Development
    has 50 of budget on acquisition capabilities,
    30 fund user solutions, 20 creation of
    databases in the storage component.

57
Developing the Data Warehouse
  • Clarify what you want to do with the Warehouse -
    How Will It be Used.
  • Scrutinize the offerings of vendors and systems
    integrators. Make sure you understand which
    functions they provide, and which you must build.
  • Most successful projects start as small, tightly
    defined tactical systems to solve pressing
    business needs, and develop into larger systems
    over time.

58
DW Summary Key Concepts
  • The DW is a collection of integrated,
    subject-oriented databases designed to support
    the decision support function where each unit of
    data is non-volatile and relevant to some moment
    in time (W.H. Inmon, 1992).
  • Implicit Assumptions
  • physically separate from operational systems
  • hold aggregated data and transactional (atomic)
    data for management separate from those used for
    OLTP.

59
DW Summary Characteristics
  • Subject-orientation
  • integrated
  • non-volatile (i.e. not updated)
  • time variant (kept for long periods, for
    forecasting and trend analysis)
  • summarized
  • large volume
  • not normalized
  • metadata
  • data sources

60
Data Mining
Write a Comment
User Comments (0)
About PowerShow.com