Title: Data Warehousing
1Data Warehousing
- University of California, Berkeley
- School of Information Management and Systems
- SIMS 257 Database Management
2Lecture Outline
- Review
- Application of Object Relational DBMS the
Berkeley Environmental Digital Library - Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
3Lecture Outline
- Review
- Application of Object Relational DBMS the
Berkeley Environmental Digital Library - Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
4A Digital Library Infrastructure Model
5UC Berkeley Digital Library Project
- Focus Work-centered digital information
services - Testbed Digital Library for the California
Environment - Research Technical agenda supporting
user-oriented access to large distributed
collections of diverse data types. - Part of the NSF/NASA/DARPA Digital Library
Initiative (Phases 1 and 2)
6The Environmental Library - Contents
- As of late 2002, the collection represents over
one terabyte of data, including over 183,000
digital images, about 300,000 pages of
environmental documents, and over 2 million
records in geographical and botanical databases.
7Botanical Data
- The CalFlora Database contains taxonomical and
distribution information for more than 8000
native California plants. The Occurrence Database
includes over 600,000 records of California plant
sightings from many federal, state, and private
sources. The botanical databases are linked to
the CalPhotos collection of California plants,
and are also linked to external collections of
data, maps, and photos.
8Geographical Data
- Much of the geographical data in the collection
has been used to develop our web-based GIS
Viewer. The Street Finder uses 500,000 Tiger
records of S.F. Bay Area streets along with the
70,000-records from the USGS GNIS database.
California Dams is a database of information
about the 1395 dams under state jurisdiction. An
additional 11 GB of geographical data represents
maps and imagery that have been processed for
inclusion as layers in our GIS Viewer. This
includes Digital Ortho Quads and DRG maps for the
S.F. Bay Area.
9Documents
- Most of the 300,000 pages of digital documents
are environmental reports and plans that were
provided by California state agencies. This
collection includes documents, maps, articles,
and reports on the California environment
including Environmental Impact Reports (EIRs),
educational pamphlets, water usage bulletins, and
county plans. Documents in this collection come
from the California Department of Water Resources
(DWR), California Department of Fish and Game
(DFG), San Diego Association of Governments
(SANDAG), and many other agencies. Among the most
frequently accessed documents are County General
Plans for every California county and a survey of
125 Sacramento Delta fish species.
10Multivalent Documents
11(No Transcript)
12(No Transcript)
13(No Transcript)
14GIS Viewer Example
http//elib.cs.berkeley.edu/annotations/gis/buildi
ngs.html
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Blobworld use regions for retrieval
- We want to find general objects? Represent
images based on coherent regions
19(No Transcript)
20(No Transcript)
21Lecture Outline
- Review
- Application of Object Relational DBMS the
Berkeley Environmental Digital Library - Data Warehouses
- Introduction to Data Warehouses
- Data Warehousing
- (Based on lecture notes from Joachim Hammer,
University of Florida, and Joe Hellerstein and
Mike Stonebraker of UCB) -
22Overview
- Data Warehouses and Merging Information Resources
- What is a Data Warehouse?
- History of Data Warehousing
- Types of Data and Their Uses
- Data Warehouse Architectures
- Data Warehousing Problems and Issues
23Problem Heterogeneous Information Sources
Heterogeneities are everywhere
Personal Databases
World Wide Web
Scientific Databases
Digital Libraries
- Different interfaces
- Different data representations
- Duplicate and inconsistent information
Slide credit J. Hammer
24Problem Data Management in Large Enterprises
- Vertical fragmentation of informational systems
(vertical stove pipes) - Result of application (user)-driven development
of operational systems
Sales Planning
Suppliers
Num. Control
Stock Mngmt
Debt Mngmt
Inventory
...
...
...
Sales Administration
Finance
Manufacturing
...
Slide credit J. Hammer
25Goal Unified Access to Data
Personal Databases
Digital Libraries
Scientific Databases
- Collects and combines information
- Provides integrated view, uniform user interface
- Supports sharing
Slide credit J. Hammer
26The Traditional Research Approach
- Query-driven (lazy, on-demand)
Clients
Metadata
Integration System
. . .
Wrapper
Wrapper
Wrapper
. . .
Source
Source
Source
Slide credit J. Hammer
27Disadvantages of Query-Driven Approach
- Delay in query processing
- Slow or unavailable information sources
- Complex filtering and integration
- Inefficient and potentially expensive for
frequent queries - Competes with local processing at sources
- Hasnt caught on in industry
Slide credit J. Hammer
28The Warehousing Approach
- Information integrated in advance
- Stored in WH for direct querying and analysis
Slide credit J. Hammer
29Advantages of Warehousing Approach
- High query performance
- But not necessarily most current information
- Doesnt interfere with local processing at
sources - Complex queries at warehouse
- OLTP at information sources
- Information copied at warehouse
- Can modify, annotate, summarize, restructure,
etc. - Can store historical information
- Security, no auditing
- Has caught on in industry
Slide credit J. Hammer
30Not Either-Or Decision
- Query-driven approach still better for
- Rapidly changing information
- Rapidly changing information sources
- Truly vast amounts of data from large numbers of
sources - Clients with unpredictable needs
Slide credit J. Hammer
31Data Warehouse Evolution
Building the DW Inmon (1992)
Data Replication Tools
Relational Databases
Company DWs
2000
1995
1990
1985
1980
1960
1975
Information- Based Management
Data Revolution
Middle Ages
Prehistoric Times
TIME
PCs and Spreadsheets
End-user Interfaces
1st DW Article
DW Confs.
Vendor DW Frameworks
Slide credit J. Hammer
32What is a Data Warehouse?
- A Data Warehouse is a
- subject-oriented,
- integrated,
- time-variant,
- non-volatile
- collection of data used in support of management
decision making processes. - -- Inmon Hackathorn, 1994 viz. Hoffer, Chap 11
33DW Definition
- Subject-Oriented
- The data warehouse is organized around the key
subjects (or high-level entities) of the
enterprise. Major subjects include - Customers
- Patients
- Students
- Products
- Etc.
34DW Definition
- Integrated
- The data housed in the data warehouse are defined
using consistent - Naming conventions
- Formats
- Encoding Structures
- Related Characteristics
35DW Definition
- Time-variant
- The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business
36DW Definition
- Non-volatile
- Data in the data warehouse are loaded and
refreshed from operational systems, but cannot be
updated by end-users
37What is a Data Warehouse?A Practitioners
Viewpoint
- A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in
a business context. - -- Barry Devlin, IBM Consultant
Slide credit J. Hammer
38A Data Warehouse is...
- Stored collection of diverse data
- A solution to data integration problem
- Single repository of information
- Subject-oriented
- Organized by subject, not by application
- Used for analysis, data mining, etc.
- Optimized differently from transaction-oriented
db - User interface aimed at executive decision makers
and analysts
39 Contd
- Large volume of data (Gb, Tb)
- Non-volatile
- Historical
- Time attributes are important
- Updates infrequent
- May be append-only
- Examples
- All transactions ever at WalMart
- Complete client histories at insurance firm
- Stockbroker financial information and portfolios
Slide credit J. Hammer
40Warehouse is a Specialized DB
- Standard DB
- Mostly updates
- Many small transactions
- Mb - Gb of data
- Current snapshot
- Index/hash on p.k.
- Raw data
- Thousands of users (e.g., clerical users)
- Warehouse
- Mostly reads
- Queries are long and complex
- Gb - Tb of data
- History
- Lots of scans
- Summarized, reconciled data
- Hundreds of users (e.g., decision-makers,
analysts)
Slide credit J. Hammer
41Summary
Business Information Guide
Business Information Interface
Data Warehouse
Data Warehouse Catalog
Data Warehouse Population
Operational Systems
Enterprise Modeling
Slide credit J. Hammer
42Warehousing and Industry
- Warehousing is big business
- 2 billion in 1995
- 3.5 billion in early 1997
- Predicted 8 billion in 1998 Metagroup
- WalMart has largest warehouse
- 900-CPU, 2,700 disk, 23 TB Teradata system
- 7TB in warehouse
- 40-50GB per day
Slide credit J. Hammer
43Types of Data
- Business Data - represents meaning
- Real-time data (ultimate source of all business
data) - Reconciled data
- Derived data
- Metadata - describes meaning
- Build-time metadata
- Control metadata
- Usage metadata
- Data as a product - intrinsic meaning
- Produced and stored for its own intrinsic value
- e.g., the contents of a text-book
Slide credit J. Hammer
44Data Warehousing Architecture
45Ingest
46Data Warehouse Architectures Conceptual View
- Single-layer
- Every data element is stored once only
- Virtual warehouse
- Two-layer
- Real-time derived data
- Most commonly used approach in
- industry today
Slide credit J. Hammer
47Three-layer Architecture Conceptual View
- Transformation of real-time data to derived data
really requires two steps
View level Particular informational needs
Physical Implementation of the Data Warehouse
Slide credit J. Hammer
48Issues in Data Warehousing
- Warehouse Design
- Extraction
- Wrappers, monitors (change detectors)
- Integration
- Cleansing merging
- Warehousing specification Maintenance
- Optimizations
- Miscellaneous (e.g., evolution)
Slide credit J. Hammer
49Data Warehousing Two Distinct Issues
- (1) How to get information into warehouse
- Data warehousing
- (2) What to do with data once its in warehouse
- Warehouse DBMS
- Both rich research areas
- Industry has focused on (2)
Slide credit J. Hammer
50Data Extraction
- Source types
- Relational, flat file, WWW, etc.
- How to get data out?
- Replication tool
- Dump file
- Create report
- ODBC or third-party wrappers
Slide credit J. Hammer
51Wrapper
- Converts data and queries from one data model to
another
Queries
Data Model A
Data
- Extends query capabilities for sources with
limited capabilities
Queries
Slide credit J. Hammer
52Wrapper Generation
- Solution 1 Hard code for each source
- Solution 2 Automatic wrapper generation
Wrapper Generator
Definition
Wrapper
Slide credit J. Hammer
53Data Transformations
- Convert data to uniform format
- Byte ordering, string termination
- Internal layout
- Remove, add reorder attributes
- Add key
- Add data to get history
- Sort tuples
Slide credit J. Hammer
54Monitors
- Goal Detect changes of interest and propagate to
integrator - How?
- Triggers
- Replication server
- Log sniffer
- Compare query results
- Compare snapshots/dumps
Slide credit J. Hammer
55Data Integration
- Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse - Rule-based
- Actions
- Resolve inconsistencies
- Eliminate duplicates
- Integrate into warehouse (may not be empty)
- Summarize data
- Fetch more data from sources (wh updates)
- etc.
Slide credit J. Hammer
56Data Cleansing
- Find ( remove) duplicate tuples
- e.g., Jane Doe vs. Jane Q. Doe
- Detect inconsistent, wrong data
- Attribute values that dont match
- Patch missing, unreadable data
- Notify sources of errors found
Slide credit J. Hammer
57Warehouse Maintenance
- Warehouse data ? materialized view
- Initial loading
- View maintenance
- View maintenance
Slide credit J. Hammer
58Differs from Conventional View Maintenance...
- Warehouses may be highly aggregated and
summarized - Warehouse views may be over history of base data
- Process large batch updates
- Schema may evolve
Slide credit J. Hammer
59Differs from Conventional View Maintenance...
- Base data doesnt participate in view maintenance
- Simply reports changes
- Loosely coupled
- Absence of locking, global transactions
- May not be queriable
Slide credit J. Hammer
60Warehouse Maintenance Anomalies
- Materialized view maintenance in loosely coupled,
non-transactional environment - Simple example
Slide credit J. Hammer
61Warehouse Maintenance Anomalies
Slide credit J. Hammer
62Maintenance Anomaly - Solutions
- Incremental update algorithms (ECA, Strobe, etc.)
- Research issues Self-maintainable views
- What views are self-maintainable
- Store auxiliary views so original auxiliary
views are self-maintainable
Slide credit J. Hammer
63Self-Maintainability Examples
- Sold(item,clerk,age)
- Sale(item,clerk) Emp(clerk,age)
- Inserts into Emp
- If Emp.clerk is key and Sale.clerk is foreign key
(with ref. int.) then no effect - Inserts into Sale
- Maintain auxiliary view
- Emp-?clerk,age(Sold)
- Deletes from Emp
- Delete from Sold based on clerk
Slide credit J. Hammer
64Self-Maintainability Examples
- Deletes from Sale
- Delete from Sold based on item,clerk
- Unless age at time of sale is relevant
- Auxiliary views for self-maintainability
- Must themselves be self-maintainable
- One solution all source data
- But want minimal set
-
Slide credit J. Hammer
65Partial Self-Maintainability
- Avoid (but dont prohibit) going to sources
- SoldSale(item,clerk) Emp(clerk,age)
- Inserts into Sale
- Check if clerk already in Sold, go to source if
not - Or replicate all clerks over age 30
- Or ...
Slide credit J. Hammer
66Warehouse Specification (ideally)
View Definitions
Warehouse Configuration Module
Warehouse
Integration rules
Integrator
Metadata
Change Detection Requirements
Extractor/ Monitor
Extractor/ Monitor
Extractor/ Monitor
...
Slide credit J. Hammer
67Optimization
- Update filtering at extractor
- Similar to irrelevant updates in constraint and
view maintenance - Multiple view maintenance
- If warehouse contains several views
- Exploit shared sub-views
Slide credit J. Hammer
68Additional Research Issues
- Historical views of non-historical data
- Expiring outdated information
- Crash recovery
- Addition and removal of information sources
- Schema evolution
Slide credit J. Hammer
69More Information on DW
- Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999. - Devlin, Barry, Data Warehouse, from Architecture
to Implementation. Addison-Wesley, 1997. - Inmon, W.H., Building the Data Warehouse. John
Wiley, 1992. - Widom, J., Research Problems in Data
Warehousing. Proc. of the 4th Intl. CIKM Conf.,
1995. - Chaudhuri, S., Dayal, U., An Overview of Data
Warehousing and OLAP Technology. ACM SIGMOD
Record, March 1997.