CS511 Design of Database Management Systems - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

CS511 Design of Database Management Systems

Description:

A way of thinking about the world and the phenomena. Fundamental 'paradigm' of data management ... records of same/neighboring key are packed physically together ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 45

Provided by: kevinc65

Category:

more less

Transcript and Presenter's Notes

Title: CS511 Design of Database Management Systems

1
CS511Design of Database Management Systems

Lecture 02 Relational Model
System R / Ingres

2
(No Transcript)
3
Relational Model
4
The Concept of Data Model

Mathematical formalism providing
notation structure for describing data
set of operations used to manipulate that data
A way of organizing a collection of facts
pertaining to a system being managed
A way of thinking about the world and the
phenomena
Fundamental paradigm of data management

5
The State of the World Before Rel. Model

Network/hierarchical DBMS, 1960s
IDS network DBMS Bachman at GE, 1961
IMS hierarchical DBMS IBM in 1968 Still exists
today!
CODASYL approach to data management, 1960s
CODASYL Conf. Of Data System Languages, set up
by US DOD, to standardize software applications
COBOL (comm. bus. oriented lang.) defined by
CODASYL
ruled the business data processing world
incorporating prior data definition languages
DBTG (Database Task Group, under CODASYL), 1971
closely aligned with COBOL
DBTG Report would standardize network model
Bachman (for network model) got Turing award in
1973

6
Network Model DBTG Report

Network DB
a collection of records
record collection of fields
similar to an entity in E-R model
records connected by binary, many-to-one links
similar to binary relationships in E-R model
simulate one-to-one, many-to-many by many-to-one

7
Network Model Implementation

Student record linked to enrollment record
A lot of pointer linkage
ring-structured ptrs implements many-one links
Data manipulation is thus navigational

Johnson

CS001
A
CS308
B
8
DBTG Query Example

select name from student where dept CS
student.dept CS
find any student using dept
while DB-status 0 do
begin
get student
print (student.name)
find duplicate student using dept
end

9
DBTG Query Example Predicates

select name from student where dept CS and
grade gt A-
student.dept CS
find any student using dept
while DB-status 0 do
begin
get student
if student.grade gt A-
print (student.name)
find duplicate student using dept
end

10
DBTG Query Example Navigation

select E.grade from student S, enrollment E
where S.name Johnson and E.id S.id
student.name Johnson
find any student using name
find first enrollment within StudentEnroll
while DB-status 0 do
begin
get enrollment
print (enrollment.grade)
find next enrollment within StudentEnroll
end

11
Whats Wrong?

Mix presentation and access in modeling
very much like HTML today -)
Difficult and complex to program
Data dependence

12
Data Dependence

Ordering dependence
stored ordering should be independent of access
presentation
Indexing dependence
indices should be redundant and for
optimization
Access path dependence
access paths to data should not be hard-coded

13
Codd's Paper

Concepts of data independence and declarative
queries
argues need for more natural data-centric
modeling
argues declarative queries
Complete definition of relational model
data structure based on relations
algebra for manipulating data
Formal theoretic notions
expressive power, redundancy, and consistency
starting research on database theories

14
The State of the World Relational camp

Codds paper in 1970
resistance even within IBM
First implementations, 1973
System R at IBM San Jose Lab
INGRES at UC.Berkeley
The Great Debate in 1975 SIGMOD conf.
Codd got Turing award in 1981

15
Arguments Against the Other Side?

COBOL/CODASYL ? Relational
too mathematical (to understand)
Relational ? COBOL/CODASYL
too complicated (to program)

16
All About Paradigm Shift

Bachman
offline sequential access --gt online navigation
This revolution in thinking is changing the
programmer from a stationery viewer of objects
passing before him in core into mobile navigator
who is able to probe and traverse a database at
will.
The Programmer as Navigator, 1973 Turing
Lecture
Codd
navigation over records links --gt declaration
over relations
The most important motivation for this research
work is to provide a sharp and clear boundary
between the logical and physical aspects of data
management. We call this data independence
objective. Relational Databases A Practical
Foundation for Productivity, 1981 Turing Award
Lecture

17
Relational Model Provide Independence?

Ordering dependence?
Indexing dependence?
Access path dependence?

18
System R / Ingres
19
System R

System R 1974-1978
IBM San Jose Labs, lots of PhD researchers
Won Kim is UIUC alum
Dissertation Query Processing for Relational
Database Systems

20
INGRES

INGRES 1973-1977
U.C. Berkeley faculty graduate students
Mike Stonebraker (then an asst. prof)
ACM SIGMOD Innovation Award 1991
Eugene Wong
? Postgres ? PostgreSQL

21
System R and INGRES Gray

Jim Gray (see System R 25th Reunion page)
Hostility developed between the San Jose IBM
group and the Berkeley group because they were
working on very, very similar things and had
very, very similar ideas
As a consequence we came to the conclusion that
the best thing was not to talk to each other.
The Berkeley folks thought the IBM guys were
ripping off ideas from the INGRES project. We had
a strained relationship.

22
Joint ACM Software System Award 1988

System R
Donald Chamberlin, James Gray, Raymond Lorie,
Gianfranco Putzolu,
Patrici Selinger, Irving Traiger
INGRES
Gerald Held, Michael Stonebraker, Eugene Wong
Citation
The INGRES and System R systems demonstrated
that a practical and efficient database
management system (DBMS) could be implemented
based on the relational data model. These systems
were full-function DBMS's that supported
non-procedural query languages (QUEL and SQL),
automatic query optimization, alternative storage
structures, transactions, crash recovery, views,
integrity, and protection. They have
revolutionized the database system industry by
showing how data stored in a computer can be
conveniently accessed by end users and while at
the same time it can be used by production
application programs.

23
Contributions of System R?

Bringing theory to practice
nice theory implemented into practical system
High level query language (SQL)
Codds relational algebra/calculus were
criticized as too mathematical
System research in action
macro design a complete system architecture
micro identify key problems and provide
solutions
Defining database landscape
industry product spec. and research directions

24
Complete System Study

Phase 0 1974, 1975
initial single-user prototype
try out ideas and find issues
felt a good idea to plan to throw away ver. 1.0
Phase 1 1976, 1977
full-function, multi-user prototype
Phase 2 1978, 1979
evaluation and feedback
lots of good lessons learned
Very similar process took place in INGRES

25
System R Architecture

RDS/RSS divide remains in many systems
RDS query processing (logical)
view, query parser, rewriter, optimizer, executor
RSS storage/access methods (physical)
storage, access methods, buffer manager, lock,
log/recovery

26
Views

View defined as a query
another consistent use of SQL (no separate DDL)
Query on views
query rewriter to flatten view (unfold def.)
form a composite query tree
View transparency
Almost any queries on any views
Not fully transparent though
update only for single-relation views
no right meanings in some cases
many-to-one nature of view def. (ambiguity)
even none-to-one (some view state has no
correspondence)

27
SQL as Query Language

High-level declarative, English-based language
declarative language what not how
well-founded/simple semantics based on relational
algebra
small set of well-understood operators, so
optimizer knows
how operators can be interchanged/transformed
what equivalent implementations are for each op
Consistent for different functionalities
data definition e.g., table creation, view
definition
data manipulation e.g., queries, updates
Uniform for different usage scenarios
embedding from different host languages (canned
queries)
ad hoc user queries from command lines
Unexpected benefit Standardized DB interface
(mid 80s)

28
What makes SQL possible?

Query parse, access-path selection, code gen,
execute
Cost-based access-path selection (optimization)
Pre-compilation for canned queries
remove preprocessing/optimization from run time
data indexes and statistics may change
reoptimize and recompile by observing
dependencies
alternative approaches?
caching of recent used query plans
trigger to invalidate cached plans on relevant
events
e.g., on rebuilding system statistics, on index
creation
in contrast, interpreted QUEL in INGRES admitted
mistake

29
Query Optimizer Cost Based

Cost-based optimizer set up paradigm
largely unchanged since
Cost model
C weighted-sum(CPU time, IO)
CPU time modeled as number of RSS calls

30
Query Optimizer Access Path Selection

Access path selection based on expected costs
select people where job programmer and city
champaign
path 1 job index --gt check city path 2 city
index --gt check job
more paths?
Data independence what are hidden from users?
Cost estimation based on
index selectivity
job (programmer) more selective or city
(chamapign)?
index clustering
records of same/neighboring key are packed
physically together
minimize IO to fetch records of same key or a
range
a relation can typically has at most one
clustering index. Why?

31
Query Optimizer Join Strategies

To evaluate R.a S.a
Nested-loop join
for each tuple r in R
use index, fetch S tuples s s.t. s.ar.a
Q B-tree or hashing index better?
Sort-merge join
sort R, sort S
merge tuples in R and S in order
Q Use B-tree index to speed up? Hashing?
All joins two-way n-ary joins as binary trees
prune away lots of alternative plans for n-ary
joins

32
Storage Phase 0

Tuples have TID, containing page number
direct access by TID to fetch the page
Tuples contain pointers to values in domain
pros
clean and original domain notion in Codds
model
ultimate data normalization
cons inefficient
idea picked up by some web search engines to
store cached documents
Design of Database Systems --gt
0AF1 00A0 A0B8 B001

33
Access Methods Phase 0

Inverted Index (an inversion is also an index)
domain value --gt list of TIDs
similar idea used in text information retrieval
for word to document ID mapping
Q find documents with cat and dog
inverted index
cat --gt d1, d2, d3
dog --gt d1, d3, d5, d6
A d1, d3
How is this different from, say, B-trees?

34
Access Methods Phase 1

B-trees
Hashing rejected
place records to buckets based on hashing values
reason of rejection cannot support range queries
Advantages?
to reach a record, you

35
Lock Manager Influential

Multiple granularity (hierarchy) of locks
records, relations, entire database
Intention locks for traversing data hierarchies
lock table with intention before locking tuples
acquiring locks from top down
Deadlock handling detection
We will study this in more detail later (Gray)

36
Lock Manager Predicate-Lock Abandoned

Lock all the tuples satisfying a predicate
e.g., lock student.dept CS
Problems
hard to determine if locks conflict
dept CS vs. GPA gt 3.0?
involving semantics
data snapshot also matters
locked set can be changing during locking

37
Lock Manager Convoy Problem

Lock resource may interfere with OS resources
Transaction T1 holds lock while to be dispatched
by OS
Transaction T2, , Tn all wait convoy formed
T1 dispatched, and then release lock
say 10K instructions for dispatching
lock granted to T2, which is waiting to be
dispatched by OS
T1 soon need the lock again, go back to end of
convoy
say only executed 1k instructions between locking
T1 is now wasting its CPU time slice
Most cycles are for dispatching only!
T1 10K dispatching, 1K execution, waiting in
convoy

38
Log/Recovery Failures

Transaction failures
i.e., transaction rollback if cannot be committed
System failures
all data updates in main memory buffer lost
Media failures
data on disk lost

39
Log/Recovery Before Logging

Before (undo) logging record old value
after crash, bring DB back to consistency
undo all uncommitted transactions
problems
data must be flushed to disk before transaction
commit
after crash, cannot bring to the most recent
status

40
Log/Recovery After Logging

After (redo) logging record new value
after crash, bring DB forward to consistency
redo all committed transactions
problem
must hold data in buffer until commit made
long history of committed transactions slow in
recovery

41
Log/Recovery Before/After Logging

Combined before/after logging to recover
checkpoints to archive current healthy state
redo all committed transactions
undo all uncommitted transactions
Dual logs to prevent log failure
Shadow pages not a good idea
tend to lose physical clustering
too much overhead in maintaining
alternative in-place updates separate logs

42
System R Influences Other Ideas

catalog as relations-- metadata managed as data
security management (grant/revoke)
integrity constraints, triggers

43
System R Influence

Ellison's Oracle beats IBM to market by reading
white papers.
IBM releases multiple RDBMSs, settles down to
DB2.
Gray (System R), Jerry Held (Ingres) and others
join Tandem (Non-Stop SQL), Kapali Eswaran starts
EsVal, which begets HP Allbase and Cullinet
Relational Technology Inc (Ingres Corp),
Britton-Lee/Sybase, Wang PACE grow out of Ingres
group
CA releases CA-Universe, a commercialization of
INGRES

44
A Comment from Students

I think it is remarkable how this System R
influenced many actual Relational Database
systems architecture, with it's various
components as the locking , recovery, security
and views subsystems. To read this is not much
different than for example reading the
description of the architecture of Oracle (I used
to work for them).

Write a Comment

User Comments (0)