Title: CS511 Design of Database Management Systems
1CS511Design of Database Management Systems
- Lecture 02 Relational Model
- System R / Ingres
-
2(No Transcript)
3Relational Model
4The Concept of Data Model
- Mathematical formalism providing
- notation structure for describing data
- set of operations used to manipulate that data
- A way of organizing a collection of facts
pertaining to a system being managed - A way of thinking about the world and the
phenomena - Fundamental paradigm of data management
5The State of the World Before Rel. Model
- Network/hierarchical DBMS, 1960s
- IDS network DBMS Bachman at GE, 1961
- IMS hierarchical DBMS IBM in 1968 Still exists
today! - CODASYL approach to data management, 1960s
- CODASYL Conf. Of Data System Languages, set up
by US DOD, to standardize software applications - COBOL (comm. bus. oriented lang.) defined by
CODASYL - ruled the business data processing world
- incorporating prior data definition languages
- DBTG (Database Task Group, under CODASYL), 1971
- closely aligned with COBOL
- DBTG Report would standardize network model
- Bachman (for network model) got Turing award in
1973
6Network Model DBTG Report
- Network DB
- a collection of records
- record collection of fields
- similar to an entity in E-R model
- records connected by binary, many-to-one links
- similar to binary relationships in E-R model
- simulate one-to-one, many-to-many by many-to-one
7Network Model Implementation
- Student record linked to enrollment record
- A lot of pointer linkage
- ring-structured ptrs implements many-one links
- Data manipulation is thus navigational
Johnson
CS001
A
CS308
B
8DBTG Query Example
- select name from student where dept CS
- student.dept CS
- find any student using dept
- while DB-status 0 do
- begin
- get student
- print (student.name)
- find duplicate student using dept
- end
9DBTG Query Example Predicates
- select name from student where dept CS and
grade gt A- - student.dept CS
- find any student using dept
- while DB-status 0 do
- begin
- get student
- if student.grade gt A-
- print (student.name)
- find duplicate student using dept
- end
10DBTG Query Example Navigation
- select E.grade from student S, enrollment E
- where S.name Johnson and E.id S.id
- student.name Johnson
- find any student using name
- find first enrollment within StudentEnroll
- while DB-status 0 do
- begin
- get enrollment
- print (enrollment.grade)
- find next enrollment within StudentEnroll
- end
11Whats Wrong?
- Mix presentation and access in modeling
- very much like HTML today -)
- Difficult and complex to program
- Data dependence
12Data Dependence
- Ordering dependence
- stored ordering should be independent of access
presentation - Indexing dependence
- indices should be redundant and for
optimization - Access path dependence
- access paths to data should not be hard-coded
13Codd's Paper
- Concepts of data independence and declarative
queries - argues need for more natural data-centric
modeling - argues declarative queries
- Complete definition of relational model
- data structure based on relations
- algebra for manipulating data
- Formal theoretic notions
- expressive power, redundancy, and consistency
- starting research on database theories
14The State of the World Relational camp
- Codds paper in 1970
- resistance even within IBM
- First implementations, 1973
- System R at IBM San Jose Lab
- INGRES at UC.Berkeley
- The Great Debate in 1975 SIGMOD conf.
- Codd got Turing award in 1981
15Arguments Against the Other Side?
- COBOL/CODASYL ? Relational
- too mathematical (to understand)
- Relational ? COBOL/CODASYL
- too complicated (to program)
16All About Paradigm Shift
- Bachman
- offline sequential access --gt online navigation
- This revolution in thinking is changing the
programmer from a stationery viewer of objects
passing before him in core into mobile navigator
who is able to probe and traverse a database at
will. - The Programmer as Navigator, 1973 Turing
Lecture - Codd
- navigation over records links --gt declaration
over relations - The most important motivation for this research
work is to provide a sharp and clear boundary
between the logical and physical aspects of data
management. We call this data independence
objective. Relational Databases A Practical
Foundation for Productivity, 1981 Turing Award
Lecture
17Relational Model Provide Independence?
- Ordering dependence?
- Indexing dependence?
- Access path dependence?
18System R / Ingres
19System R
- System R 1974-1978
- IBM San Jose Labs, lots of PhD researchers
- Won Kim is UIUC alum
- Dissertation Query Processing for Relational
Database Systems
20INGRES
- INGRES 1973-1977
- U.C. Berkeley faculty graduate students
- Mike Stonebraker (then an asst. prof)
- ACM SIGMOD Innovation Award 1991
- Eugene Wong
- ? Postgres ? PostgreSQL
21System R and INGRES Gray
- Jim Gray (see System R 25th Reunion page)
- Hostility developed between the San Jose IBM
group and the Berkeley group because they were
working on very, very similar things and had
very, very similar ideas - As a consequence we came to the conclusion that
the best thing was not to talk to each other. - The Berkeley folks thought the IBM guys were
ripping off ideas from the INGRES project. We had
a strained relationship.
22Joint ACM Software System Award 1988
- System R
- Donald Chamberlin, James Gray, Raymond Lorie,
Gianfranco Putzolu, - Patrici Selinger, Irving Traiger
- INGRES
- Gerald Held, Michael Stonebraker, Eugene Wong
- Citation
- The INGRES and System R systems demonstrated
that a practical and efficient database
management system (DBMS) could be implemented
based on the relational data model. These systems
were full-function DBMS's that supported
non-procedural query languages (QUEL and SQL),
automatic query optimization, alternative storage
structures, transactions, crash recovery, views,
integrity, and protection. They have
revolutionized the database system industry by
showing how data stored in a computer can be
conveniently accessed by end users and while at
the same time it can be used by production
application programs. -
23Contributions of System R?
- Bringing theory to practice
- nice theory implemented into practical system
- High level query language (SQL)
- Codds relational algebra/calculus were
criticized as too mathematical - System research in action
- macro design a complete system architecture
- micro identify key problems and provide
solutions - Defining database landscape
- industry product spec. and research directions
24Complete System Study
- Phase 0 1974, 1975
- initial single-user prototype
- try out ideas and find issues
- felt a good idea to plan to throw away ver. 1.0
- Phase 1 1976, 1977
- full-function, multi-user prototype
- Phase 2 1978, 1979
- evaluation and feedback
- lots of good lessons learned
- Very similar process took place in INGRES
25System R Architecture
- RDS/RSS divide remains in many systems
- RDS query processing (logical)
- view, query parser, rewriter, optimizer, executor
- RSS storage/access methods (physical)
- storage, access methods, buffer manager, lock,
log/recovery
26Views
- View defined as a query
- another consistent use of SQL (no separate DDL)
- Query on views
- query rewriter to flatten view (unfold def.)
- form a composite query tree
- View transparency
- Almost any queries on any views
- Not fully transparent though
- update only for single-relation views
- no right meanings in some cases
- many-to-one nature of view def. (ambiguity)
- even none-to-one (some view state has no
correspondence)
27SQL as Query Language
- High-level declarative, English-based language
- declarative language what not how
- well-founded/simple semantics based on relational
algebra - small set of well-understood operators, so
optimizer knows - how operators can be interchanged/transformed
- what equivalent implementations are for each op
- Consistent for different functionalities
- data definition e.g., table creation, view
definition - data manipulation e.g., queries, updates
- Uniform for different usage scenarios
- embedding from different host languages (canned
queries) - ad hoc user queries from command lines
- Unexpected benefit Standardized DB interface
(mid 80s)
28What makes SQL possible?
- Query parse, access-path selection, code gen,
execute - Cost-based access-path selection (optimization)
- Pre-compilation for canned queries
- remove preprocessing/optimization from run time
- data indexes and statistics may change
- reoptimize and recompile by observing
dependencies - alternative approaches?
- caching of recent used query plans
- trigger to invalidate cached plans on relevant
events - e.g., on rebuilding system statistics, on index
creation - in contrast, interpreted QUEL in INGRES admitted
mistake
29Query Optimizer Cost Based
- Cost-based optimizer set up paradigm
- largely unchanged since
- Cost model
- C weighted-sum(CPU time, IO)
- CPU time modeled as number of RSS calls
30Query Optimizer Access Path Selection
- Access path selection based on expected costs
- select people where job programmer and city
champaign - path 1 job index --gt check city path 2 city
index --gt check job - more paths?
- Data independence what are hidden from users?
- Cost estimation based on
- index selectivity
- job (programmer) more selective or city
(chamapign)? - index clustering
- records of same/neighboring key are packed
physically together - minimize IO to fetch records of same key or a
range - a relation can typically has at most one
clustering index. Why?
31Query Optimizer Join Strategies
- To evaluate R.a S.a
- Nested-loop join
- for each tuple r in R
- use index, fetch S tuples s s.t. s.ar.a
- Q B-tree or hashing index better?
- Sort-merge join
- sort R, sort S
- merge tuples in R and S in order
- Q Use B-tree index to speed up? Hashing?
- All joins two-way n-ary joins as binary trees
- prune away lots of alternative plans for n-ary
joins
32Storage Phase 0
- Tuples have TID, containing page number
- direct access by TID to fetch the page
- Tuples contain pointers to values in domain
- pros
- clean and original domain notion in Codds
model - ultimate data normalization
- cons inefficient
- idea picked up by some web search engines to
store cached documents - Design of Database Systems --gt
- 0AF1 00A0 A0B8 B001
33Access Methods Phase 0
- Inverted Index (an inversion is also an index)
- domain value --gt list of TIDs
- similar idea used in text information retrieval
for word to document ID mapping - Q find documents with cat and dog
- inverted index
- cat --gt d1, d2, d3
- dog --gt d1, d3, d5, d6
- A d1, d3
- How is this different from, say, B-trees?
34Access Methods Phase 1
- B-trees
- Hashing rejected
- place records to buckets based on hashing values
- reason of rejection cannot support range queries
- Advantages?
- to reach a record, you
35Lock Manager Influential
- Multiple granularity (hierarchy) of locks
- records, relations, entire database
- Intention locks for traversing data hierarchies
- lock table with intention before locking tuples
- acquiring locks from top down
- Deadlock handling detection
- We will study this in more detail later (Gray)
36Lock Manager Predicate-Lock Abandoned
- Lock all the tuples satisfying a predicate
- e.g., lock student.dept CS
- Problems
- hard to determine if locks conflict
- dept CS vs. GPA gt 3.0?
- involving semantics
- data snapshot also matters
- locked set can be changing during locking
37Lock Manager Convoy Problem
- Lock resource may interfere with OS resources
- Transaction T1 holds lock while to be dispatched
by OS - Transaction T2, , Tn all wait convoy formed
- T1 dispatched, and then release lock
- say 10K instructions for dispatching
- lock granted to T2, which is waiting to be
dispatched by OS - T1 soon need the lock again, go back to end of
convoy - say only executed 1k instructions between locking
- T1 is now wasting its CPU time slice
- Most cycles are for dispatching only!
- T1 10K dispatching, 1K execution, waiting in
convoy
38Log/Recovery Failures
- Transaction failures
- i.e., transaction rollback if cannot be committed
- System failures
- all data updates in main memory buffer lost
- Media failures
- data on disk lost
39Log/Recovery Before Logging
- Before (undo) logging record old value
- after crash, bring DB back to consistency
- undo all uncommitted transactions
- problems
- data must be flushed to disk before transaction
commit - after crash, cannot bring to the most recent
status
40Log/Recovery After Logging
- After (redo) logging record new value
- after crash, bring DB forward to consistency
- redo all committed transactions
- problem
- must hold data in buffer until commit made
- long history of committed transactions slow in
recovery
41Log/Recovery Before/After Logging
- Combined before/after logging to recover
- checkpoints to archive current healthy state
- redo all committed transactions
- undo all uncommitted transactions
- Dual logs to prevent log failure
- Shadow pages not a good idea
- tend to lose physical clustering
- too much overhead in maintaining
- alternative in-place updates separate logs
42System R Influences Other Ideas
- catalog as relations-- metadata managed as data
- security management (grant/revoke)
- integrity constraints, triggers
43System R Influence
- Ellison's Oracle beats IBM to market by reading
white papers. - IBM releases multiple RDBMSs, settles down to
DB2. - Gray (System R), Jerry Held (Ingres) and others
join Tandem (Non-Stop SQL), Kapali Eswaran starts
EsVal, which begets HP Allbase and Cullinet - Relational Technology Inc (Ingres Corp),
Britton-Lee/Sybase, Wang PACE grow out of Ingres
group - CA releases CA-Universe, a commercialization of
INGRES
44A Comment from Students
- I think it is remarkable how this System R
influenced many actual Relational Database
systems architecture, with it's various
components as the locking , recovery, security
and views subsystems. To read this is not much
different than for example reading the
description of the architecture of Oracle (I used
to work for them).