CS511 Design of Database Management Systems - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

CS511 Design of Database Management Systems

Description:

A way of thinking about the world and the phenomena. Fundamental 'paradigm' of data management ... records of same/neighboring key are packed physically together ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 45
Provided by: kevinc65
Category:

less

Transcript and Presenter's Notes

Title: CS511 Design of Database Management Systems


1
CS511Design of Database Management Systems
  • Lecture 02 Relational Model
  • System R / Ingres

2
(No Transcript)
3
Relational Model
4
The Concept of Data Model
  • Mathematical formalism providing
  • notation structure for describing data
  • set of operations used to manipulate that data
  • A way of organizing a collection of facts
    pertaining to a system being managed
  • A way of thinking about the world and the
    phenomena
  • Fundamental paradigm of data management

5
The State of the World Before Rel. Model
  • Network/hierarchical DBMS, 1960s
  • IDS network DBMS Bachman at GE, 1961
  • IMS hierarchical DBMS IBM in 1968 Still exists
    today!
  • CODASYL approach to data management, 1960s
  • CODASYL Conf. Of Data System Languages, set up
    by US DOD, to standardize software applications
  • COBOL (comm. bus. oriented lang.) defined by
    CODASYL
  • ruled the business data processing world
  • incorporating prior data definition languages
  • DBTG (Database Task Group, under CODASYL), 1971
  • closely aligned with COBOL
  • DBTG Report would standardize network model
  • Bachman (for network model) got Turing award in
    1973

6
Network Model DBTG Report
  • Network DB
  • a collection of records
  • record collection of fields
  • similar to an entity in E-R model
  • records connected by binary, many-to-one links
  • similar to binary relationships in E-R model
  • simulate one-to-one, many-to-many by many-to-one

7
Network Model Implementation
  • Student record linked to enrollment record
  • A lot of pointer linkage
  • ring-structured ptrs implements many-one links
  • Data manipulation is thus navigational

Johnson

CS001
A
CS308
B
8
DBTG Query Example
  • select name from student where dept CS
  • student.dept CS
  • find any student using dept
  • while DB-status 0 do
  • begin
  • get student
  • print (student.name)
  • find duplicate student using dept
  • end

9
DBTG Query Example Predicates
  • select name from student where dept CS and
    grade gt A-
  • student.dept CS
  • find any student using dept
  • while DB-status 0 do
  • begin
  • get student
  • if student.grade gt A-
  • print (student.name)
  • find duplicate student using dept
  • end

10
DBTG Query Example Navigation
  • select E.grade from student S, enrollment E
  • where S.name Johnson and E.id S.id
  • student.name Johnson
  • find any student using name
  • find first enrollment within StudentEnroll
  • while DB-status 0 do
  • begin
  • get enrollment
  • print (enrollment.grade)
  • find next enrollment within StudentEnroll
  • end

11
Whats Wrong?
  • Mix presentation and access in modeling
  • very much like HTML today -)
  • Difficult and complex to program
  • Data dependence

12
Data Dependence
  • Ordering dependence
  • stored ordering should be independent of access
    presentation
  • Indexing dependence
  • indices should be redundant and for
    optimization
  • Access path dependence
  • access paths to data should not be hard-coded

13
Codd's Paper
  • Concepts of data independence and declarative
    queries
  • argues need for more natural data-centric
    modeling
  • argues declarative queries
  • Complete definition of relational model
  • data structure based on relations
  • algebra for manipulating data
  • Formal theoretic notions
  • expressive power, redundancy, and consistency
  • starting research on database theories

14
The State of the World Relational camp
  • Codds paper in 1970
  • resistance even within IBM
  • First implementations, 1973
  • System R at IBM San Jose Lab
  • INGRES at UC.Berkeley
  • The Great Debate in 1975 SIGMOD conf.
  • Codd got Turing award in 1981

15
Arguments Against the Other Side?
  • COBOL/CODASYL ? Relational
  • too mathematical (to understand)
  • Relational ? COBOL/CODASYL
  • too complicated (to program)

16
All About Paradigm Shift
  • Bachman
  • offline sequential access --gt online navigation
  • This revolution in thinking is changing the
    programmer from a stationery viewer of objects
    passing before him in core into mobile navigator
    who is able to probe and traverse a database at
    will.
  • The Programmer as Navigator, 1973 Turing
    Lecture
  • Codd
  • navigation over records links --gt declaration
    over relations
  • The most important motivation for this research
    work is to provide a sharp and clear boundary
    between the logical and physical aspects of data
    management. We call this data independence
    objective. Relational Databases A Practical
    Foundation for Productivity, 1981 Turing Award
    Lecture

17
Relational Model Provide Independence?
  • Ordering dependence?
  • Indexing dependence?
  • Access path dependence?

18
System R / Ingres
19
System R
  • System R 1974-1978
  • IBM San Jose Labs, lots of PhD researchers
  • Won Kim is UIUC alum
  • Dissertation Query Processing for Relational
    Database Systems

20
INGRES
  • INGRES 1973-1977
  • U.C. Berkeley faculty graduate students
  • Mike Stonebraker (then an asst. prof)
  • ACM SIGMOD Innovation Award 1991
  • Eugene Wong
  • ? Postgres ? PostgreSQL

21
System R and INGRES Gray
  • Jim Gray (see System R 25th Reunion page)
  • Hostility developed between the San Jose IBM
    group and the Berkeley group because they were
    working on very, very similar things and had
    very, very similar ideas
  • As a consequence we came to the conclusion that
    the best thing was not to talk to each other.
  • The Berkeley folks thought the IBM guys were
    ripping off ideas from the INGRES project. We had
    a strained relationship.

22
Joint ACM Software System Award 1988
  • System R
  • Donald Chamberlin, James Gray, Raymond Lorie,
    Gianfranco Putzolu,
  • Patrici Selinger, Irving Traiger
  • INGRES
  • Gerald Held, Michael Stonebraker, Eugene Wong
  • Citation
  • The INGRES and System R systems demonstrated
    that a practical and efficient database
    management system (DBMS) could be implemented
    based on the relational data model. These systems
    were full-function DBMS's that supported
    non-procedural query languages (QUEL and SQL),
    automatic query optimization, alternative storage
    structures, transactions, crash recovery, views,
    integrity, and protection. They have
    revolutionized the database system industry by
    showing how data stored in a computer can be
    conveniently accessed by end users and while at
    the same time it can be used by production
    application programs.

23
Contributions of System R?
  • Bringing theory to practice
  • nice theory implemented into practical system
  • High level query language (SQL)
  • Codds relational algebra/calculus were
    criticized as too mathematical
  • System research in action
  • macro design a complete system architecture
  • micro identify key problems and provide
    solutions
  • Defining database landscape
  • industry product spec. and research directions

24
Complete System Study
  • Phase 0 1974, 1975
  • initial single-user prototype
  • try out ideas and find issues
  • felt a good idea to plan to throw away ver. 1.0
  • Phase 1 1976, 1977
  • full-function, multi-user prototype
  • Phase 2 1978, 1979
  • evaluation and feedback
  • lots of good lessons learned
  • Very similar process took place in INGRES

25
System R Architecture
  • RDS/RSS divide remains in many systems
  • RDS query processing (logical)
  • view, query parser, rewriter, optimizer, executor
  • RSS storage/access methods (physical)
  • storage, access methods, buffer manager, lock,
    log/recovery

26
Views
  • View defined as a query
  • another consistent use of SQL (no separate DDL)
  • Query on views
  • query rewriter to flatten view (unfold def.)
  • form a composite query tree
  • View transparency
  • Almost any queries on any views
  • Not fully transparent though
  • update only for single-relation views
  • no right meanings in some cases
  • many-to-one nature of view def. (ambiguity)
  • even none-to-one (some view state has no
    correspondence)

27
SQL as Query Language
  • High-level declarative, English-based language
  • declarative language what not how
  • well-founded/simple semantics based on relational
    algebra
  • small set of well-understood operators, so
    optimizer knows
  • how operators can be interchanged/transformed
  • what equivalent implementations are for each op
  • Consistent for different functionalities
  • data definition e.g., table creation, view
    definition
  • data manipulation e.g., queries, updates
  • Uniform for different usage scenarios
  • embedding from different host languages (canned
    queries)
  • ad hoc user queries from command lines
  • Unexpected benefit Standardized DB interface
    (mid 80s)

28
What makes SQL possible?
  • Query parse, access-path selection, code gen,
    execute
  • Cost-based access-path selection (optimization)
  • Pre-compilation for canned queries
  • remove preprocessing/optimization from run time
  • data indexes and statistics may change
  • reoptimize and recompile by observing
    dependencies
  • alternative approaches?
  • caching of recent used query plans
  • trigger to invalidate cached plans on relevant
    events
  • e.g., on rebuilding system statistics, on index
    creation
  • in contrast, interpreted QUEL in INGRES admitted
    mistake

29
Query Optimizer Cost Based
  • Cost-based optimizer set up paradigm
  • largely unchanged since
  • Cost model
  • C weighted-sum(CPU time, IO)
  • CPU time modeled as number of RSS calls

30
Query Optimizer Access Path Selection
  • Access path selection based on expected costs
  • select people where job programmer and city
    champaign
  • path 1 job index --gt check city path 2 city
    index --gt check job
  • more paths?
  • Data independence what are hidden from users?
  • Cost estimation based on
  • index selectivity
  • job (programmer) more selective or city
    (chamapign)?
  • index clustering
  • records of same/neighboring key are packed
    physically together
  • minimize IO to fetch records of same key or a
    range
  • a relation can typically has at most one
    clustering index. Why?

31
Query Optimizer Join Strategies
  • To evaluate R.a S.a
  • Nested-loop join
  • for each tuple r in R
  • use index, fetch S tuples s s.t. s.ar.a
  • Q B-tree or hashing index better?
  • Sort-merge join
  • sort R, sort S
  • merge tuples in R and S in order
  • Q Use B-tree index to speed up? Hashing?
  • All joins two-way n-ary joins as binary trees
  • prune away lots of alternative plans for n-ary
    joins

32
Storage Phase 0
  • Tuples have TID, containing page number
  • direct access by TID to fetch the page
  • Tuples contain pointers to values in domain
  • pros
  • clean and original domain notion in Codds
    model
  • ultimate data normalization
  • cons inefficient
  • idea picked up by some web search engines to
    store cached documents
  • Design of Database Systems --gt
  • 0AF1 00A0 A0B8 B001

33
Access Methods Phase 0
  • Inverted Index (an inversion is also an index)
  • domain value --gt list of TIDs
  • similar idea used in text information retrieval
    for word to document ID mapping
  • Q find documents with cat and dog
  • inverted index
  • cat --gt d1, d2, d3
  • dog --gt d1, d3, d5, d6
  • A d1, d3
  • How is this different from, say, B-trees?

34
Access Methods Phase 1
  • B-trees
  • Hashing rejected
  • place records to buckets based on hashing values
  • reason of rejection cannot support range queries
  • Advantages?
  • to reach a record, you

35
Lock Manager Influential
  • Multiple granularity (hierarchy) of locks
  • records, relations, entire database
  • Intention locks for traversing data hierarchies
  • lock table with intention before locking tuples
  • acquiring locks from top down
  • Deadlock handling detection
  • We will study this in more detail later (Gray)

36
Lock Manager Predicate-Lock Abandoned
  • Lock all the tuples satisfying a predicate
  • e.g., lock student.dept CS
  • Problems
  • hard to determine if locks conflict
  • dept CS vs. GPA gt 3.0?
  • involving semantics
  • data snapshot also matters
  • locked set can be changing during locking

37
Lock Manager Convoy Problem
  • Lock resource may interfere with OS resources
  • Transaction T1 holds lock while to be dispatched
    by OS
  • Transaction T2, , Tn all wait convoy formed
  • T1 dispatched, and then release lock
  • say 10K instructions for dispatching
  • lock granted to T2, which is waiting to be
    dispatched by OS
  • T1 soon need the lock again, go back to end of
    convoy
  • say only executed 1k instructions between locking
  • T1 is now wasting its CPU time slice
  • Most cycles are for dispatching only!
  • T1 10K dispatching, 1K execution, waiting in
    convoy

38
Log/Recovery Failures
  • Transaction failures
  • i.e., transaction rollback if cannot be committed
  • System failures
  • all data updates in main memory buffer lost
  • Media failures
  • data on disk lost

39
Log/Recovery Before Logging
  • Before (undo) logging record old value
  • after crash, bring DB back to consistency
  • undo all uncommitted transactions
  • problems
  • data must be flushed to disk before transaction
    commit
  • after crash, cannot bring to the most recent
    status

40
Log/Recovery After Logging
  • After (redo) logging record new value
  • after crash, bring DB forward to consistency
  • redo all committed transactions
  • problem
  • must hold data in buffer until commit made
  • long history of committed transactions slow in
    recovery

41
Log/Recovery Before/After Logging
  • Combined before/after logging to recover
  • checkpoints to archive current healthy state
  • redo all committed transactions
  • undo all uncommitted transactions
  • Dual logs to prevent log failure
  • Shadow pages not a good idea
  • tend to lose physical clustering
  • too much overhead in maintaining
  • alternative in-place updates separate logs

42
System R Influences Other Ideas
  • catalog as relations-- metadata managed as data
  • security management (grant/revoke)
  • integrity constraints, triggers

43
System R Influence
  • Ellison's Oracle beats IBM to market by reading
    white papers.
  • IBM releases multiple RDBMSs, settles down to
    DB2. 
  • Gray (System R), Jerry Held (Ingres) and others
    join Tandem (Non-Stop SQL), Kapali Eswaran starts
    EsVal, which begets HP Allbase and Cullinet
  • Relational Technology Inc (Ingres Corp),
    Britton-Lee/Sybase, Wang PACE grow out of Ingres
    group
  • CA releases CA-Universe, a commercialization of
    INGRES

44
A Comment from Students
  • I think it is remarkable how this System R
    influenced many actual Relational Database
    systems architecture, with it's various
    components as the locking , recovery, security
    and views subsystems. To read this is not much
    different than for example reading the
    description of the architecture of Oracle (I used
    to work for them).
Write a Comment
User Comments (0)
About PowerShow.com