Chapters 24, 26, 27 Database System Architectures - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Chapters 24, 26, 27 Database System Architectures

Description:

Chapters 24, 26, 27 Database System Architectures Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut – PowerPoint PPT presentation

Number of Views:548
Avg rating:3.0/5.0
Slides: 103
Provided by: Stev114
Category:

less

Transcript and Presenter's Notes

Title: Chapters 24, 26, 27 Database System Architectures


1
Chapters 24, 26, 27Database System Architectures
Prof. Steven A. Demurjian, Sr. Computer Science
Engineering Department The University of
Connecticut 191 Auditorium Road, Box
U-155 Storrs, CT 06269-3155
steve_at_engr.uconn.edu http//www.engr.uconn.edu/st
eve (860) 486 - 4818
  • A portion of these slides are being used with the
    permission of Dr. Ling Lui, Associate Professor,
    College of Computing, Georgia Tech.
  • Remaining slides represent new material.

2
Classical and Distributed Architectures
  • Classic/Centralized DBMS Dominated the Commercial
    Market from 1970s Forward
  • Problems of this Approach
  • Difficult to Scale w.r.t. Performance Gains
  • If DB Overloaded, replace with a Faster Computer
  • this can Only Go So Far - Disk Bottlenecks
  • Distributed DBMS have Evolved to Address a Number
    of Issues
  • Improved Performance
  • Putting Data Near Location where it is Needed
  • Replication of Data for Fault Tolerance
  • Vertical and Horizontal Partitioning of DB Tuples

3
Common Features of Centralized DBMS
  • Data Independence
  • High-Level Representation via Conceptual and
    External Schemas
  • Physical Representation (Internal Schema) Hidden
  • Program Independence
  • Multiple Applications can Share Data
  • Views/External Schema Support this Capability
  • Reduction of Program/Data Redundancy
  • Single, Unique, Conceptual Schema
  • Shared Database
  • Almost No Data Redundancy
  • Controlled Data Access Reduces Inconsistencies
  • Programs Execute with Consistent Results

4
Common Features of Centralized DBMS
  • Promote Sharing Automatically Provided via CC
  • No Longer Programmatic Issue
  • Most DBMS Offer Locking for Key Shared Data
  • Oracle Allows Locks on Data Item (Attributes)
  • For Example, Controlling Access to Shared
    Identifier
  • Coherent and Central DB Administration
  • Semantic DB Integrity via the Automatic
    Enforcement of Data Consistency via Integrity
    Constraints/Rules
  • Data Resiliency
  • Physical Integrity of Data in the Presence of
    Faults and Errors
  • Supported by DB Recovery
  • Data Security Control Access for Authorized
    Users Against Sensitive Data

5
Shared Nothing Architecture
  • In this Architecture, Each DBMS Operates
    Autonomously
  • there is No Sharing
  • Three Separate DBMSs on Three Different Computers
  • Applications/Clients Must Know About the External
    Schemas of all Three DBMSs for
  • Database Retrieval
  • Client Processing
  • Complicates Client
  • Different DBMS Platforms(Oracle, Sybase,
    Informix, ..)
  • Different Access Modes(Query, Embedded, ODBC)
  • Difficult for SWE to Code

6
NW Architecture with Centralized DB
  • High-Speed NWs/WANs Spawned Centralized DB
    Accessible Worldwide
  • Clients at Any Site can Access Repository
  • Data May be Far Away - Increased Access Time
  • In Practice, Each Remote Site Needs only Portion
    of the Data in DB1 and/or DB2
  • Inefficient, no Replication w.r.t. Failure

7
Fully Distributed Architecture
  • The Five Sites (Chicago, SF, LA, NY, Atlanta)
    each have a Portion of the Database - its
    Distributed
  • Replication is Possible for Fault Tolerance
  • Queries at one Site May Need to Access Data at
    Another Site (e.g., for a Join)
  • Increased Transaction Processing Complexity

8
Goals of DDBMS
  • Support User Distribution Across Multiple Sites
  • Remote Access by Users Regardless of Location
  • Distribution and Replication of Database Content
  • Provide Location Transparency
  • Users Manipulate their Own Data
  • Non-Local Sites Appear Local to Any User
  • Provide Transaction Control Akin to Centralized
    Case
  • Transaction Control Hides Distribution
  • CC and Serializability - Must be Extended
  • Minimize Communications Cost
  • Optimize Use of Network - a Critical Issue
  • Distribute DB Design Supported by Partitioning
    (Fragmentation) and Replication

9
Goals of DDBMS
  • Improve Response Time for DB Access
  • Use a More Sophisticated Load Control for
    Transaction Processing
  • However, Synchronization Across Sites May
    Introduce Additional Overhead
  • System Availability
  • Site Independence in the Presence of Site Failure
  • Subset of Database is Always Available
  • Replication can Keep All Data Available, Even
    When Multiple Sites Fail
  • Modularity
  • Incremental Growth with the Addition of Sites
  • Dedicate Sites to Specific Tasks

10
Advantages of DDBMS
  • There are Four Major Advantages
  • Transparency
  • Distribution/NW Transparency
  • User Doesnt Know about NW Configuration
    (Location Transparency)
  • User can Find Object at any Site (Naming
    Transparency)
  • Replication Transparency (see next PPT)
  • User Doesnt Know Location of Data
  • Replicas are Transparently Accessible
  • Fragmentation Transparency
  • Horizontal Fragmentation (Distribute by Row)
  • Vertical Fragmentation (Distribute by Column)

11
Data Distribution and Replication
12
Advantages of DDBMS
  • Increased Reliability and Availability
  • Reliability - System Always Running
  • Availability - Data Always Present
  • Achieved via Replication and Distribution
  • Improved Performance
  • Sites Able to Utilize Data that is Local for
    Majority of Queries
  • Easier Expansion
  • Improve Performance of Site by
  • Upgrading Processor of Computer
  • Adding Additional Disks
  • Splitting a Site into Two or More Sites
  • Expansion over Time as Business Grows

13
Challenges of DDBMS
  • Tracking Data - Meta Data More Complex
  • Must Track Distribution (where is the Data)
  • V H Fragmentation (How is Data Split)
  • Replication (Multiple Copies for Consistency)
  • Distributed Query Processing
  • Optimization, Accessibility, etc., More Complex
  • Block Analysis of Data Size Must also Now
    Consider the NW Transmitting Time
  • Distributed Transaction Processing
  • TP Potentially Spans Multiple Sites
  • Submit Query to Multiple Sites
  • Collect and Collate Results

14
Challenges of DDBMS
  • Replicated Data Management
  • TP Must Choose the Replica to Access
  • Updates Must Modify All Replica Copies
  • Distributed Database Recovery
  • Recovery of Individual Sites
  • Recovery Across DDBMS
  • Security
  • Local and Remote Authorization
  • During TP, be Able to Verify Remote Privileges
  • Distributed Directory Management
  • Meta-Data on Database - Local and Remote
  • Must maintain Replicas of this - Every Site
    Tracks the Meta-Data for All Sites

15
A Complete Schema with Keys ...
Keys Allow us to Establish Links Between
Relations
what is this Similar to in ER?
16
and Corresponding DB Tables
which Represent Tuples/Instances of Each Relation
A S C null W B null null
1 4 5 5
17
with Remaining DB Tables
18
What is Fragmentation?
  • Fragmentation Divides a DB Across Multiple Sites
  • Two Types of Fragmentation
  • Horizontal Fragmentation
  • Given a Relation R with n Total Tuples, Spread
    Entire Tuples Across Multiple Sites
  • Each Site has a Subset of the n Tuples
  • Essentially Fragmentation is a Selection
  • Vertical Fragmentation
  • Given a Relation R with m Attributes and n Total
    Tuples, Spread the Columns Across Multiple Sites
  • Essentially Fragmentation is a Projection
  • In Both Cases, Sites can Overlap for Replication

19
Horizontal Fragmentation
  • Site 2 Tracks All Information Related to Dept. 5

20
Horizontal Fragmentation
  • Site 3 Tracks All Information Related to Dept. 4
  • Note that an Employee Could be Listed in Both
    Cases, if s/he Works on a Project for Both
    Departments

21
Refined Horizontal Fragmentation
  • Further Fragment from Site 2 based on Dept. that
    Employee Works in
  • Notice that G1 G2 G3 is the Same as WORKS_ON5
  • there is no Overlap

22
Refined Horizontal Fragmentation
  • Further Fragment from Site 3 based on Dept. that
    Employee Works in
  • Notice that G4 G5 G6 is the Same as WORKS_ON4
  • Note Some Fragments can be Empty

23
Homogeneous DDBMS
  • Homogeneous
  • Identical Software (w.r.t. Database)
  • One DB Product (e.g., Oracle) is Distributed and
    Available at All Sites
  • Uniformity w.r.t. Administration, Maintenance,
    Client Access, Users, Security, etc.
  • Interaction by Programmatic Clients is Consistent
    (e.g., JDBC or ODBC or )

24
Non-Federated Heterogeneous DDBMS
  • Non-Federated Heterogeneous
  • Different Software (w.r.t. Database)
  • Multiple DB Products (e.g., Oracle at One Site,
    Access another, Sybase, Informix, etc.)
  • Replicated Administration (e.g., Users Needs
    Accounts on Multiple Systems)
  • Varied Programmatic Access - SWEs Must Know All
    Platforms/Client Software Complicated

25
Federated DDBMS
  • Federated
  • Multiple DBMS Platforms Overlaid with a Global
    Schema View
  • Single External Schema Combines Schemas from all
    Sites
  • Multiple Data Models
  • Relational in one Component DBS
  • Object in another Component DBS
  • Hierarchical in a 3rd DBS

26
Federated DBMS Issues
  • Differences in Data Models
  • Reconcile Relation vs. Object-Oriented Models
  • Each Different Model has Different Capabilities
  • These Differences Must be Addressed in Order to
    Present a Federated Schema
  • Differences in Constraints
  • Referential Integrity Constraints in Different
    DBSs
  • Different Constraints on Similar Data
  • Federated Schema Must Deal with these Conflicts
  • Differences in Query Languages
  • SQL-89, SQL-92, SQL2, SQL3
  • Specific Types in Different DBMS (Blobs in Oracle)

27
Databases on WWW
  • Web has changed the way we do Business Research
  • Facts
  • Industry Saw an Opportunity, knew it had to Move
    Quickly to Capitalize
  • Lots of Action, Lots of Money, Lots of Releases
  • Line Between RD is Very Narrow
  • Many Researchers Moved to Industry (Trying to
    Return Back to Academia)
  • Emergence of Java
  • Java changed the way that Software was Designed,
    Developed, and Utilized
  • Particularly w.r.t. Web-Based Applications,
    Database Interoperability, Web Architectures,
    etc.
  • Emergence of Enterprise Computing

28
Internet and the Web
  • A Major Opportunity for Business
  • A Global Marketplace
  • Business Across State and Country Boundaries
  • A Way of Extending Services
  • Online Payment vs. VISA, Mastercard
  • A Medium for Creation of New Services
  • Publishers, Travel Agents, Teller, Virtual Yellow
    Pages, Online Auctions
  • A Boon for Academia
  • Research Interactions and Collaborations
  • Free Software for Classroom/Research Usage
  • Opportunities for Exploration of Technologies in
    Student Projects

29
Market Forecast New Internet Servers
30
WWW Three Market Segments
Server
  • Business to Business
  • Information sharing
  • Ordering info./status
  • Targeted electronic commerce

Corporate Network
Server
Internet
  • Intranet
  • Decision support
  • Mfg.. System monitoring
  • corporate repositories
  • Workgroups

Corporate Network
Server
Server
  • Internet
  • Sales
  • Marketing
  • Information
  • Services


31
Information Delivery Problems on the Net
  • Everyone can Publish Information on the Web
    Independently at Any Time
  • Consequently, there is an Information Explosion
  • Identifying Information Content More Difficult
  • There are too Many Search Engines but too Few
    Capable of Returning High Quality Data
  • Most Search Engines are Useful for Ad-hoc
    Searches but Awkward for Tracking Changes

32
Example Web Applications
  • Scenario 1 World Wide Wait
  • A Major Event is Underway and the Latest,
    Up-to-the Minute Results are Being Posted on the
    Web
  • You Want to Monitor the Results for this
    Important Event, so you Fire up your Trusty Web
    Browser, Pointing at the Result Posting Site, and
    Wait, and Wait, and Wait
  • What is the Problem?
  • The Scalability Problems are the Result of a
    Mismatch Between the Data Access Characteristics
    of the Application and the Technology Used to
    Implement the Application

33
Example Web Applications
  • Scenario 2
  • Many Applications Today have the Need for
    Tracking Changes in Local and Remote Data Sources
    and Notifying Changes If Some Condition Over the
    Data Source(s) is Met
  • If You Want to Monitor the Changes on Web, You
    Need to Fire Your Trusty Web Browser from Time to
    Time, and Cache the Most Recent Result, and do
    the Difference Manually Each Time You Poll the
    Data Source(s)
  • What is the Problem?
  • Pure Pull is Not the Answer to All Problems

34
What is the Problem?
  • Applications are Asymmetric but the Web is Not
  • Computation Centric vs. Information Flow Centric
  • Type of Asymmetry
  • Network Asymmetry
  • Satellite, CATV, Mobile Clients, Etc.
  • Client to Server Ratio
  • Too Many Clients can Swamp Servers
  • Data Volume
  • Mouse and Key Click vs. Content Delivery
  • Update and Information Creation
  • Clients Need to be Informed or Must Poll

35
Useful Solutions
  • Combination/Interleave of Pull and Push Protocols
  • User-initiated, Comprehensive Search-based
    Information Delivery (Pull)
  • Server-initiated Information Dissemination (Push)
  • Provide Support for a Variety of Data Delivery
    Protocols, Frequencies, and Delivery Modes
  • Information Delivery Frequencies
  • Periodic, Conditional, Ad-Hoc
  • Information Delivery Modes
  • Information Delivery Protocols (IDP)
  • Request/Respond, Polling, Publish/Subscribe,
    Broadcast
  • Information Delivery Styles (IDS)
  • Pull, Push, Hybrid

36
Information Delivery Frequencies
  • Periodic
  • Data is Delivered from a Server to Clients
    Periodically
  • Period can be Defined by System-default or by
    Clients Using their Profiles
  • Period can be Influenced by Client and Bandwidth
  • PDA/Cell Phone vs. PC w/Modem
  • PC w/DSL vs. PC w/Cable Modem
  • Conditional (Aperiodic)
  • Data is Delivered from a Server when Conditions
    Installed by Clients in their Profiles are
    Satisfied
  • Ad-hoc (or Irregular)

37
Information Delivery Modes
  • Uni-cast
  • Data is Sent from a Data Source (a Single Server)
    to Another Machine
  • 1-to-n
  • Data is Sent by a Single Data Source and Received
    by Multiple Machines
  • Multicast vs. Broadcast
  • Multicast Data is Sent to a Specific Set of
    Clients
  • Broadcast Sending Data Over a Medium which an
    Unidentified or Unbounded Set of Clients can
    Listen

38
IDP Request/Respond
  • Semantics of Request/Respond
  • Clients Send their Request to Servers to Ask the
    Information of their Interest
  • Servers Respond to the Client Request by
    Delivering the Information Requested
  • Client can Wait (Synchronous) or Not
  • Applications
  • Most Database Systems and Web Search Engines are
    Using the Request/Respond Protocol for
    Client-Server Communication

39
IDP Programmed Polling vs. User Polling
  • Semantics
  • Programmed Polling a System Periodically Sends
    Requests to Other Sites to Obtain Status
    Information or Detect Changed Values
  • User Polling a User or Application Periodically
    or Aperiodically Polls the Data Sites and Obtains
    the Changes
  • Applications
  • Programmed Polling Save the Users from having to
    Click, but does Nothing to Solve the Scalability
    Problems Caused by the Request/Respond Mechanism

40
IDP Publish/Subscribe
  • Semantics Servers Publish/Clients Subscribe
  • Servers Publish Information Online
  • Clients Subscribe to the Information of Interest
    (Subscription-based Information Delivery)
  • Data Flow is Initiated by the Data Sources
    (Servers) and is Aperiodic
  • Danger Subscriptions can Lead to Other Unwanted
    Subscriptions
  • Applications
  • Unicast Database Triggers and Active Databases
  • 1-to-n Online News Groups

41
IDP Broadcast
  • Semantics Server Broadcast/Client(s) Listen
  • Clients who Require Access to a Data Item Need to
    Wait Until the Item Appears
  • Selective Broadcast vs. Random Broadcast
  • Applications
  • Unicast Doesnt Address Network Bandwidth
    Problems
  • 1-to-n Most Likely to Prevail

42
Information Delivery Styles
  • Pull-Based System
  • Transfer of Data from Server to Client is
    Initiated by a Client Pull
  • Clients Determine when to Get Information
  • Potential for Information to be Old Unless Client
    Periodically Pulls
  • Push-Based System
  • Transfer of Data from Server to Client is
    Initiated by a Server Push
  • Clients may get Overloaded if Push is Too
    Frequent
  • Hybrid
  • Pull and Push Combined
  • Pull First and then Push Continually

43
Summary Pull vs. Push
Request/
Publish/
Broadcast
Periodic
Conditional
Ad-hoc
Respond
Subscribe
Pure Pull
Y
Y
Pure Push
Y
Y
Y
Y
Hybrid
Y
Y
Y
Y
Y
Y
44
Design Options for Nodes
  • Three Types of Nodes
  • Data Sources
  • Provide Base Data which is to be Disseminated
  • Clients
  • Who are the Net Consumers of the Information
  • Information Brokers
  • which Acquire Information from Other Data
    Sources, Add Value to that Information and then
    Distribute this Information to Other Consumers
  • By Creating a Hierarchy of Brokers, Information
    Delivery can be Tailored to the Need of Many Users

45
The Next Big Challenge
  • Interoperability
  • Heterogeneous Distributed Databases
  • Heterogeneous Distributed Systems
  • Autonomous Applications
  • Scalability
  • Rapid and Continuous Growth
  • Amount of Data
  • Variety of Data Types
  • Different Privacy Levels or Ownerships of Data

46
Interoperability A Classic View
Simple Federation
Multiple Nested Federation
FDB Global Schema
FDB Global Schema 4
Federated Integration
Federated Integration
Local Schema
Local Schema
Local Schema
FDB 1
Local Schema
FDB3
Federation
Federation
47
CORBA-Based Development
IDL file
Object Implementation
Client Application
IDL Compiler
IDL Compiler
Stub
Skeleton
ORB/IIOP
ORB/IIOP
48
ORB IntegrationJava Client Legacy Application
Java Client
Legacy Application
Java Wrapper
Object Request Broker (ORB)
CORBA is the Medium of Info. Exchange Requires
Java/CORBA Capabilities
49
Java Client with Wrapper to Legacy Application
Java Client
Interactions Between Java Client and Legacy Appl.
via C and RPC C is the Medium of Info.
Exchange Java Client with C/C Wrapper
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
NATIVE LAYER
Native Functions (C) RPC Client Stubs (C)
Legacy Application
Network
50
COTS and Legacy Appls. to Java Clients
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that Map to COTS Appl
Native Functions that Map to Legacy Appl
NATIVE LAYER
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
Mapping Classes
JAVA NETWORK WRAPPER
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C Appls
with Java Wrappers
51
Database Interoperability in the Internet
  • Technology
  • Web/HTTP, JDBC/ODBC, CORBA (ORBs IIOP), XML
  • Architecture

52
Java Client to Legacy App via RDBS
Transformed Legacy Data
Java Client
Relational Database System(RDS)
Updated Data
Extract and Generate Data
Transform and Store Data
Legacy Application
53
JDBC
  • JDBC API Provides DB Access Protocols for Open,
    Query, Close, etc.
  • Different Drivers for Different DB Platforms

JDBC API
Java Application
Driver Manager
Driver
Driver
Oracle
Sybase
Access
54
Connecting a DB to the Web
  • Web Server are Stateless
  • DB Interactions Tend to be Stateful
  • Invoking a CGI Script on Each DB Interaction is
    Very Expensive, Mainly Due to the Cost of DB Open

DBMS
CGI Script Invocation or JDBC Invocation
Web Server
Internet
Browser
55
Connecting More Efficiently
  • To Avoid Cost of Opening Database, One can Use
    Helper Processes that Always Keep Database Open
    and Outlive Web Connection
  • Newly Invoked CGI Scripts Connect to a
    Preexisting Helper Process
  • System is Still Stateless

DBMS
Helper Processes
CGI Script or JDBC Invocation
Web Server
Internet
Browser
56
DB-Internet Architecture
WWW Client (Netscape)
WWW Client (HotJava)
WWW client (Info. Explore)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Dispatcher
DBWeb Gateway
DBWeb Gateway
57
EJB Architecture
58
Technology Push
  • Computer/Communication Technology (Almost Free)
  • Plenty of Affordable CPU, Memory, Disk, Network
    Bandwidth
  • Next Generation Internet Gigabit Now
  • Wireless Ubiquitous, High Bandwidth
  • Information Growth
  • Massively Parallel Generation of Information on
    the Internet and from New Generation of Sensors
  • Disk Capacity on the Order of Peta-bytes
  • Small, Handy Devices to Access Information

59
More Information Moves to Cyberspace
  • What Grows Faster than Internet?
  • Why?
  • Low Rent 10x Cheaper100 Letters (1MB) Disk
    10 File Cabinet 5001 Picture Disk
    10 Printed 40
  • Easy Access and Search
  • Robot can Find All Docs Matching a Predicate
  • Access from Anywhere
  • Human Costs 15/hr
  • How Much Information is there?

Gray, Turing Lecture 1998, ACM FCRC Atlanta
60
Sizes of Search Engines
KEY AVAltaVista, NLNorthern Light,
INKInktomi, EXExcite, LYLycos, ISInfoseek,
WCWebCrawler
as of February, 1999
61
Research Challenges
  • Inherent Complexity
  • Coping with Latency (Sometimes Unpredictable)
  • Failure Detection and Recovery (Partial Failure)
  • Concurrency, Load Balancing, Availability, Scale
  • Service Partitioning
  • Ordering of Distributed Events
  • Accidental Complexity
  • Heterogeneity Beyond the Local Case Platform,
    Protocol, Plus All Local Heterogeneity in Spades.
  • Autonomy Change and Evolve Autonomously
  • Tool Deficiencies Language Support
    (Sockets,rpc), Debugging, Etc.

Ubiquitous/Pervasive Many computers and
information appliances everywhere, networked
together
62
Infosphere
Problem too many sources,too much information
63
Current State-of-Art
Web Server
Mainframe Database Server
Thin Client
64
Infosphere Scenario
Infotaps Fat Clients
Sensors
Many sources
Database Server
65
Heterogeneity and Autonomy
  • Heterogeneity
  • How Much can we Really Integrate?
  • Syntactic Integration
  • Different Formats and Models
  • Web/SQL Query Languages
  • Semantic Interoperability
  • Basic Research on Ontology, Etc
  • Autonomy
  • No Central DBA on the Net
  • Independent Evolution of Schema and Content
  • Interoperation is Voluntary
  • Interface Technology (Support for Isvs)
  • DCOM Microsoft Standard
  • CORBA, Etc...

66
Security and Data Quality
  • Security
  • System Security in the Broad Sense
  • Attacks Penetrations, Denial of Service
  • System (and Information) Survivability
  • Security Fault Tolerance
  • Replication for Performance, Availability, and
    Survivability
  • Data Quality
  • Web Data Quality Problems
  • Local Updates with Global Effects
  • Unchecked Redundancy (Mutual Copying)
  • Registration of Unchecked Information
  • Spam on the Rise

67
Legacy Data Challenge
  • Legacy Applications and Data
  • Definition Important and Difficult to Replace
  • Typically, Mainframe Mission Critical Code
  • Most are OLTP and Database Applications
  • Evolution of Legacy Databases
  • Client-server Architectures
  • Wrappers
  • Expensive and Gradual in Any Case

68
Potential Value Added/Jumping on Bandwagon
  • Sophisticated Query Capability
  • Combining SQL with Keyword Queries
  • Consistent Updates
  • Atomic Transactions and Beyond
  • But Everything has to be in a Database!
  • Only If we Stick with Classic DB Assumptions
  • Relaxing DB Assumptions
  • Interoperable Query Processing
  • Extended Transaction Updates
  • Commodities DB Software
  • A Little Help is Still Good If it is Cheap
  • Internet Facilitates Software Distribution
  • Databases as Middleware

69
Data Warehousing and Data Mining
  • Data Warehousing
  • Provide Access to Data for Complex Analysis,
    Knowledge Discovery, and Decision Making
  • Underlying Infrastructure in Support of Mining
  • Provides Means to Interact with Multiple DBs
  • OLAP (on-Line Analytical Processing) vs. OLTP
  • Data Mining
  • Discovery of Information in a Vast Data Sets
  • Search for Patterns and Common Features based
  • Discover Information not Previously Known
  • Medical Records Accessible Nationwide
  • Research/Discover Cures for Rare Diseases
  • Relies on Knowledge Discovery in DBs (KDD)

70
Data Warehousing and OLAP
  • A Data Warehouse
  • Database is Maintained Separately from an
    Operational Database
  • A Subject-Oriented, Integrated, Time-Variant,
    and Non-Volatile Collection of Data in Support
    for Managements Decision Making Process
    W.H.Inmon
  • OLAP (on-Line Analytical Processing)
  • Analysis of Complex Data in the Warehouse
  • Attempt to Attain Value through Analysis
  • Relies on Trained and Adept Skilled Knowledge
    Workers who Discover Information
  • Data Mart
  • Organized Data for a Subset of an Organization

71
Building a Data Warehouse
  • Option 1
  • Leverage Existing Repositories
  • Collate and Collect
  • May Not Capture All Relevant Data
  • Option 2
  • Start from Scratch
  • Utilize Underlying Corporate Data

Corporate data warehouse
Option 1 Consolidate Data Marts
Option 2 Build from scratch
Data Mart
Data Mart
Data Mart
Data Mart
...
Corporate data
72
Data Warehouse Characteristics
  • Utilizes a Multi-Dimensional Data Model
  • Warehouse Comprised of
  • Store of Integrated Data from Multiple Sources
  • Processed into Multi-Dimensional Model
  • Warehouse Supports of
  • Times Series and Trend Analysis
  • Super-Excel Integrated with DB Technologies
  • Data is Less Volatile than Regular DB
  • Doesnt Dramatically Change Over Time
  • Updates at Regular Intervals
  • Specific Refresh Policy Regarding Some Data

73
Three Tier Architecture
monitor
OLAP Server
integrator
External data sources
Summarization report
Extraxt Transform Load Refresh
Operational databases
Data Warehouse
serve
Query report
Data mining
metadata
Data marts
74
Data Warehouse Design
  • Most of Data Warehouses use a Start Schema to
    Represent Multi-Dimensional Data Model
  • Each Dimension is Represented by a Dimension
    Table that Provides its Multidimensional
    Coordinates and Stores Measures for those
    Coordinates
  • A Fact Table Connects All Dimension Tables with a
    Multiple Join
  • Each Tuple in Fact Table Represents the Content
    of One Dimension
  • Each Tuple in the Fact Table Consists of a
    Pointer to Each of the Dimensional Tables
  • Links Between the Fact Table and the Dimensional
    Tables for a Shape Like a Star

75
What is a Multi-Dimensional Data Cube?
  • Representation of Information in Two or More
    Dimensions
  • Typical Two-Dimensional - Spreadsheet
  • In Practice, to Track Trends or Conduct Analysis,
    Three or More Dimensions are Useful

76
Multi-Dimensional Schemas
  • Supporting Multi-Dimensional Schemas Requires Two
    Types of Tables
  • Dimension Table Tuples of Attributes for Each
    Dimension
  • Fact Table Measured/Observed Variables with
    Pointers into Dimension Table
  • Star Schema
  • Characterizes Data Cubes by having a Single Fact
    Table for Each Dimension
  • Snowflake Schema
  • Dimension Tables from Star Schema are Organized
    into Hierarchy via Normalization
  • Both Represent Storage Structures for Cubes

77
Example of Star Schema
Date
Sale Fact Table
Date Month Year
Date Product Store Customer Unit_Sales Dollar_Sale
s
Store
StoreID City State Country Region
78
A Second Example of Star Schema
79
and Corresponding Snowflake Schema
80
Data Warehouse Issues
  • Data Acquisition
  • Extraction from Heterogeneous Sources
  • Reformatted into Warehouse Context - Names,
    Meanings, Data Domains Must be Consistent
  • Data Cleaning for Validity and Qualityis the
    Data as Expected w.r.t. Content? Value?
  • Transition of Data into Data Model of Warehouse
  • Loading of Data into the Warehouse
  • Other Issues Include
  • How Current is the Data? Frequency of Update?
  • Availability of Warehouse? Dependencies of Data?
  • Distribution, Replication, and Partitioning
    Needs?
  • Loading Time (Clean, Format, Copy, Transmit,
    Index Creation, etc.)?

81
OLAP Strategies
  • OLAP Strategies
  • Roll-Up Summarization of Data
  • Drill-Down from the General to Specific
    (Details)
  • Pivot Cross Tabulate the Data Cubes
  • Slide and Dice Projection Operations Across
    Dimensions
  • Sorting Ordering Result Sets
  • Selection Access by Value or Value Range
  • Implementation Issues
  • Persistent with Infrequent Updates (Loading)
  • Optimization for Performance on Queries is More
    Complex - Across Multi-Dimensional Cubes
  • Recovery Less Critical - Mostly Read Only
  • Temporal Aspects of Data (Versions) Important

82
Knowledge Discovery
  • Data Warehousing Requires Knowledge Discovery to
    Organize/Extract Information Meaningfully
  • Knowledge Discovery
  • Technology to Extract Interesting Knowledge
    (Rules, Patterns, Regularities, Constraints) from
    a Vast Data Set
  • Process of Non-trivial Extraction of Implicit,
    Previously Unknown, and Potentially Useful
    Information from Large Collection of Data
  • Data Mining
  • A Critical Step in the Knowledge Discovery
    Process
  • Extracts Implicit Information from Large Data Set

83
Steps in a KDD Process
  • Learning the Application Domain (goals)
  • Gathering and Integrating Data
  • Data Cleaning
  • Data Integration
  • Data Transformation/Consolidation
  • Data Mining
  • Choosing the Mining Method(s) and Algorithm(s)
  • Mining Search for Patterns or Rules of Interest
  • Analysis and Evaluation of the Mining Results
  • Use of Discovered Knowledge in Decision Making
  • Important Caveats
  • This is Not an Automated Process!
  • Requires Significant Human Interaction!

84
On-Line Analytical Processing
  • Data Cube
  • A Multidimensonal Array
  • Each Attribute is a Dimension
  • In Example Below, the Data Must be Interpreted so
    that it Can be Aggregated by Region/Product/Date

Product
Pants
Diapers
Beer
Nuts
West East Central Mountain South
Region
Jan Feb March April
Date
85
Examples of Data Mining
  • The Slicing Action
  • A Vertical or Horizontal Slice Across Entire Cube

Slice on city Atlanta
Multi-Dimensional Data Cube
86
Examples of Data Mining
  • The Dicing Action
  • A Slide First Identifies on Dimension
  • A Selection of Any Cube within the Slice which
    Essentially Constrains All Three Dimensions

March 2000
Atlanta
Electronics
Dice on Electronics and Atlanta
87
Examples of Data Mining
  • Drill Down - Takes a Facet (e.g., Q1)
  • and Decomposes into Finer Detail

Drill down on Q1
Atlanta Columbus Gainesville Savannah
Arizona California Georgia Iowa
Q1 Q2 Q3 Q4
Roll Up on Location (State, USA)
Products Sales
Roll Up Combines Multiple Dimensions From
Individual Cities to State
88
Mining Other Types of Data
  • Analysis and Access Dramatically More Complicated!

Spatial databases Multimedia databases World
Wide Web
Time series data
Geographical and Satellite Data
89
Advantages/Objectives of Data Mining
  • Descriptive Mining
  • Discover and Describe General Properties
  • 60 People who buy Beer on Friday also have
    Bought Nuts or Chips in the Past Three Months
  • Predictive Mining
  • Infer Interesting Properties based on Available
    Data
  • People who Buy Beer on Friday usually also Buy
    Nuts or Chips
  • Result of Mining
  • Order from Chaos
  • Mining Large Data Sets in Multiple Dimensions
    Allows Businesses, Individuals, etc. to Learn
    about Trends, Behavior, etc.
  • Impact on Marketing Strateg

90
Data Mining Methods (1)
  • Association
  • Discover the Frequency of Items Occurring
    Together in a Transaction or an Event
  • Example
  • 80 Customers who Buy Milk also Buy BreadHence -
    Bread and Milk Adjacent in Supermarket
  • 50 of Customers Forget to Buy Milk/Soda/DrinksHe
    nce - Available at Register
  • Prediction
  • Predicts Some Unknown or Missing Information
    based on Available Data
  • Example
  • Forecast Sale Value of Electronic Products for
    Next Quarter via Available Data from Past Three
    Quarters

91
Association Rules
  • Motivated by Market Analysis
  • Rules of the Form
  • Item1Item2 Itemk?Itemk1 Itemn
  • Example
  • Beer Soft Drink ? Pop Corn
  • Problem Discovering All Interesting Association
    Rules in a Large Database is Difficult!
  • Issues
  • Interestingness
  • Completeness
  • Efficiency
  • Basic Measurement for Association Rules
  • Support of the Rule
  • Confidence of the Rule

92
Data Mining Methods (2)
  • Classification
  • Determine the Class or Category of an Object
    based on its Properties
  • Example
  • Classify Companies based on the Final Sale
    Results in the Past Quarter
  • Clustering
  • Organize a Set of Multi-dimensional Data Objects
    in Groups to Minimize Inter-group Similarity is
    and Maximize Intra-group Similarity
  • Example
  • Group Crime Locations to Find Distribution
    Patterns

93
Classification
  • Two Stages
  • Learning Stage Construction of a Classification
    Function or Model
  • Classification Stage Predication of Classes of
    Objects Using the Function or Model
  • Tools for Classification
  • Decision Tree
  • Bayesian Network
  • Neural Network
  • Regression
  • Problem
  • Given a Set of Objects whose Classes are Known
    (Training Set), Derive a Classification Model
    which can Correctly Classify Future Objects

94
An Example
  • Attributes
  • Class Attribute - Play/Dont Play the Game
  • Training Set
  • Values that Set the Condition for the
    Classification
  • What are the Pattern Below?

95
Data Mining Methods (3)
  • Summarization
  • Characterization (Summarization) of General
    Features of Objects in the Target Class
  • Example
  • Characterize Peoples Buying Patterns on the
    Weekend
  • Potential Impact on Sale Items When Sales
    Start
  • Department Stores with Bonus Coupons
  • Discrimination
  • Comparison of General Features of Objects Between
    a Target Class and a Contrasting Class
  • Example
  • Comparing Students in Engineering and in Art
  • Attempt to Arrive at Commonalities/Differences

96
Summarization Technique
  • Attribute-Oriented Induction
  • Generalization using Concert hierarchy (Taxonomy)

barcode category brand content
size
14998 milk diaryland Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
Milk bread
...
Skim milk 2 milk
White whole bread wheat
Category Content Count
Lucern Dairyland
Wonder Safeway
milk skim 280 milk 2
98 ...
97
Why is Data Mining Popular?
  • Technology Push
  • Technology for Collecting Large Quantity of Data
  • Bar Code, Scanners, Satellites, Cameras
  • Technology for Storing Large Collection of Data
  • Databases, Data Warehouses
  • Variety of Data Repositories, such as Virtual
    Worlds, Digital Media, World Wide Web
  • Corporations want to Improve Direct Marketing and
    Promotions - Driving Technology Advances
  • Targeted Marketing by Age, Region, Income, etc.
  • Exploiting User Preferences/Customized Shopping

98
Requirements Challenges in Data Mining
  • Security and Social
  • What Information is Available to Mine?
  • Preferences via Store Cards/Web Purchases
  • What is Your Comfort Level with Trends?
  • User Interfaces and Visualization
  • What Tools Must be Provided for End Users of Data
    Mining Systems?
  • How are Results for Multi-Dimensional Data
    Displayed?
  • Performance Guarantees
  • Range from Real-Time for Some Queries to
    Long-Term for Other Queries
  • Data Sources of Complex Data Types or
    Unstructured Data - Ability to Format, Clean, and
    Load Data Sets

99
Data Mining Visualization
  • Leverage Improving 3D Graphics and Increased PC
    Processing Power for Displaying Results
  • Significant Research in Visualization w.r.t.
    Displaying Multi-Dimensional Data

100
Successful Data Mining Applications
  • Business Data Analysis and Decision Support
  • Marketing, Customer Profiling, Market Analysis
    and Management, Risk Analysis and Management
  • Fraud Detection
  • Detecting Telephone Fraud, Automotive and Health
    Insurance Fraud, Credit-card Fraud, Suspicious
    Money Transactions (Money Laundering)
  • Text Mining
  • Message Filtering (Email, Newsgroups, Etc.)
  • Newspaper Articles Analysis
  • Sports
  • IBM Advanced Scout Analyzed NBA Game Statistics
    (Shots Blocked, Assists and Fouls) to Gain
    Competitive Advantage

101
Select Data Mining Products
102
Concluding Remarks
  • Four-Fold Objective
  • Distributed Database Processing
  • Web-Based Architectures for DB Interoperability
  • Data Warehouses
  • Data Mining of Vast Information Repositories
  • All Three are Tightly Related
  • DDBMS can Improve Performance of Mining
    Repositories as Backend Database Processors
  • Web-Based Architectures Provide Access Means for
    DDBMS or Mining
  • Warehouses are Infrastructure to Facilitate
    Mining
  • Clinical Data Repositories, Geographic
    Information Systems, Deductive DBMS, Multi-Media
    DBMS, Mobile DBMS, Embedded/Real-Time DBMS, etc.
Write a Comment
User Comments (0)
About PowerShow.com