Title: Zachary G. Ives
1Introduction
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- September 4, 2003
Some slide content courtesy of Susan Davidson
Raghu Ramakrishnan
2Welcome!
- To CIS 550, officially Database Information
Systems in the course catalog - A tour of the data management field
- A question for you what does this really
mean? What is this course (and the field) about?
3What The Course Is Not(A Few Warnings)
- This is not a course on Oracle or SQL
- It may not directly impact your marketability
- Its an investigation into the principles of data
management which will improve your
understanding - This course will not be a cakewalk!
- The data management field is broad, and well
touch on many subjects at a rapid pace - 8 homeworks, paper summaries, term project,
midterm, final, - This course is not suitable for people with a
limited programming background need skills in - Algorithms data structures
- Logic
- Programming languages
- Perhaps even a little complexity theory!
4What The Course Will Do
- Most CS courses concentrate on code now youll
understand data management and representation - In the end, its all about the data!
- Background in most of the important areas
- Data design, modeling
- Understanding of DB system internals, performance
- Understanding of data-driven systems (e.g., web
sites) - An understanding of the complexities of
integrating data perhaps the biggest CS problem
today - Understanding of what research topics in data
management are
5Administrivia
- Instructor Zachary Ives, zives_at_cis
- Levine 611 (until end of Sept. then 5th floor
GRW) - Office hours Tuesday, 300-400PM (after class)
- TA Dinkar Gupta, dinkar_at_grasp.cis
- Office hours on Monday time 300-400PM (office
TBA) - Newsgroup upenn.cis.cis550
- Home page www.seas.upenn.edu/zives/cis550/
- Text(s)
- Ramakrishnan Gerke, Database Systems, 3rd ed.
- Supplementary papers (to be handed out in class)
- Other books may be useful (see web page)
6Course Format and Grading
- Well cover roughly one major topic area per week
to two weeks - Readings in the text research papers
- Occasionally, summaries/commentary on papers (5)
- Homework assignment for each topic area (30)
- One midterm (10), one final exam (20)
- Project (30) groups of 3-4
- Build a blogging system on top of a database, or
- Build a P2P data sharing system for XML data
- (Or propose your own idea)
- General participation, discussion, intangibles
(5)
7Diving In
- What is a database and a DBMS?
- Why do we need a DBMS?
- Database and data management architectures
- Process of building a DB
- DBMS components
8Whats Data Management?
- In the 1960s and early 70s
- file formats, traversals, indexes
- In the 1980s (mid- to late-70s in research) and
90s - Separation of logical physical data
representations - Well-defined general purpose, declarative data
manipulation language (DML) and data definition
language (DDL) - Reliable, consistent storage concurrency
control - Sophisticated system that takes DDL statements
and knowledge of physical data representations
and produces answers in optimized way - Today
- All that plus managing and manipulating data in
many models and representations
9What is a DBMS?
- A database (DB) is a large, integrated collection
of data - Generally is cohesive in some way
- A DB models a real-world organization or unit
- A database management system (DBMS) is a software
package designed to store and manage databases - Reliable storage recovery of 100s of GB
- Querying/updating interface and API
- Support for many concurrent users
10Connection to OtherAreas of CS
- Programming languages and software engineering
(obviously) - Algorithms (obviously)
- Logic, discrete math, and theory of computation
- Systems concurrency, operating systems, file
organization and networks, peer-to-peer, - Web (and Semantic Web), information retrieval,
digital libraries, software agents, - AI planning and machine learning
11But 80 of the Worlds Data is Not in a DB!
- Examples
- scientific data (large images, complex programs
that analyze the data) - personal data
- WWW
- Data management is expanding to tackle these
problems - Flexibility data management imposes many
constraints to make problems solvable - Must deal with entities outside our control
- In this course, well start by focusing on
databases, but eventually look outside the box
12Why Not Program up Databases As Needed?
- For simple (single-concept) and small databases
this is often the best solution - Flat files and grep get us a long way
- But there are limits
- The structure is complicated (more than a simple
table) - The database gets large (e.g., bigger than RAM)
- Many people want to use it simultaneously
- Need for reliable recovery from crashes
- Updates generally require complete rewrite of file
13Example Palm-Style Calendar
- We might start by building a file with the
following structure - This text file is easy to deal with. So there's
no need for a DBMS! Right?
14Problem 1 Data Organization
- Consider the all-important who field. Do we also
want to keep e-mail addresses, telephone numbers
etc.? - Expand our file to look like
- Now we are keeping our address book in our
calendar and doing so redundantly
15Link Calendar with Address Book?
- Two conceptual entities contact information
and calendar with a relationship between them,
linking people in the calendar to their contact
information - This link could be based on something as simple
as the person's name
16Problem 2 Efficiency
- Size of personal address book is probably less
than one hundred entries, but there are things
we'd like to do quickly and efficiently - Give me all appointments on 10/28
- When am I next meeting Jim?
- Program these as quickly as possible (and make
them resilient to data format changes) - Have these programs executed efficiently
- What would happen if you were using a corporate
calendar with hundreds of thousands of entries?
17Problem 3 Concurrency and Reliability
- Suppose other people are allowed access to your
calendar and are allowed to modify it? How do we
stop two people changing the file at the same
time and leaving it in a physical (or logical)
mess? - Suppose the system crashes while we are changing
the calendar. How do we recover our work?
18Transactions
- Key concept for concurrency is that of a
transaction an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry). - Key concept for recoverability is that of a log
keeping track of all actions carried out by the
db. - Sounds like operating systems all over again!
19Database Architecture The Traditional View
- It is common to describe databases in two ways
- Logical structure
- What users see. The program or query language
interface. - Physical structure
- How files are organized. What indexing
mechanisms are used. - The logical level is further split into two
components - Overall database design (conceptual seen by the
DB designer) - Views that various users get to see
20Three-level Architecture
View 1 View 2 View N
Schema
Conceptual Level
21Data Independence
- A user of a relational database system should be
able to use query the database without knowing
about how the precisely how data is stored, e.g. - After all, you don't worry much how numbers are
stored when you program some arithmetic or use a
computer-based calculator
SELECT When, Where FROM Calendar WHERE Who
"Bill"
22More on Data Independence
- Logical data independence
- Protects the user from changes in the logical
structure of the data could reorganize the
calendar schema without changing how I query it - Physical data independence
- Protects the user from changes in the physical
structure of data could add an index on who (or
sort by when) without changing how the user would
write the query, but the query would execute
faster (query optimization)
23That's the Traditional View, But ...
- Three-level architecture is not always
achievable when databases get big, queries must
be carefully written to achieve efficiency - Also, may need a 4th tier Sometimes this is
called middleware
24Combining Databases with Mediators(a kind of
middleware)
Mediated Schema
XML
- May need to add further layers to combine
multiple databases/sources on the Web - Some of these are databases over which we have
no control - Some must be accessed in special ways
25Data-Driven Web Sites Consumers of Database
Output
HTML
view
Processing
- Data driven web sites also add an HTML
presentation layer on top of what weve seen
26The Process of Building a Database
- Start with a conceptual model
- Design implement schema
- Write applications using DBMS and other tools
- Many ways of doing this where the hard problems
are taken care of by other people (DBMS, API
writers, library authors, web server, etc.)
27Conceptual Design
PROFESSOR
fid
name
Teaches
Takes
STUDENT
COURSE
cid
name
semester
sid
name
28Designing a Schema (Set of Relations)
STUDENT
COURSE
Takes
- Convert to tables constraints
- Then need to do physical design the layout on
disk, indices, etc.
cid name sem
550-0103 DB F03
700-1003 AI S03
501-0103 Arch F03
sid name
1 Jill
2 Qun
3 Nitin
sid cid
1 550-0103
1 700-1003
3 500-0103
PROFESSOR
Teaches
fid name
1 Ives
2 Saul
8 Roth
fid cid
1 550-0103
2 700-1003
8 501-0103
29Applications Use Queries in SQL
- Structured Query Language
- Based on restricted first-order logic expressions
over relations - Not procedural defines constraints on the
output - Converted into a query plan that exploits
properties run over the data by the query
optimizer and query execution engine
- lthtmlgt
- ltbodygt
- lt!-- hypotheticalEmbeddedSQL
- SELECT FROM STUDENT, Takes, COURSE
- WHERE STUDENT.sid Takes.sID
- AND Takes.cID cid
- --gt
- lt/bodygt
- lt/htmlgt
30Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
31DBMS in a Bit More Detail
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Index/file/rec Mgr
Data/etc
Requests
Buffer Mgr
Pages
Pages
Storage Mgr
Requests
Data
Storage
32DBMSs in the Wild
- A huge industry for 20 of the worlds data!
- Big, mature relational databases
- IBM, Oracle, Microsoft
- Middleware above these
- SAP, PeopleSoft, dozens of special-purpose apps
- Application servers
- Integration and warehousing systems
- Trends
- More integration web services XML everywhere
- Smarter, self-tuning systems
33The Research World
- Conventional databases arent interesting!
- Understanding whats possible to do with XML
- Better query processing
- Better languages for meta-info (e.g.,
constraints) - Data streams
- Peer-to-peer
- Integrating data from different formats
- Lots of theory and systems-building
34In this Course...
- Study relational databases, their design, how to
query, what forms of indices to use. - Beyond relational algebra a logical model of
data (Datalog), recursion - XML and semi-structured data models
- Understanding DB internals
- How DBs are built
- Performance implications
- Integrating and mediating between databases (a
huge problem today)
35Questions?
Dilbert, 8/9/2003 (via online archive)