Title: Zachary G. Ives
1Introduction
- Zachary G. Ives
- University of Pennsylvania
- CIS 550 Database Information Systems
- November 21, 2009
Some slide content courtesy of Susan Davidson
Raghu Ramakrishnan
2Welcome to CIS 550, Database and Information
Systems!
- Instructor Zachary Ives, zives_at_cis.upenn.edu
- 576 Levine Hall North
- Office hours Wednesdays, 330-430PM (before
class) - TA Kareem Amin, akareem_at_seas.upenn.edu
- Office hours Tuesdays, 330-430PM, Levine N
4th Floor Lounge - Discussion group groups.google.com/group/cis-550
-fall09 - Home page www.seas.upenn.edu/zives/cis550/
- Texts and readings
- Ramakrishnan Gerke, Database Systems, 3rd ed.
- Supplementary papers (to be handed out in class)
3Course Format and Grading
- Roughly one major topic area per week to two
weeks - Readings in the text current or influential
research papers - Occasionally, summaries/commentary on papers (5)
- Homework assignment for each topic area and
mini-projects (30) - One midterm (10), one final exam (20)
- Project (30) groups of 3 or 4
- Build a facebook clone
- (Or propose your own idea)
- General participation, discussion, intangibles
(5)
4Whats this Course About?
- Most CS courses concentrate on code our
interest is managing and representing data - Warning this course doesnt focus on teaching
SQL or how to be an Oracle DBA (though it will
get you started) - Its not an application course!
- So what in the world are we studying for 14
weeks???
5Data What Do We Need to Do with It?
6Ways Information Is Represented
7Example An Encyclopedia Entry(www.wikipedia.com)
- A database is a collection of data elements
(facts) stored in a computer in a systematic way,
such that a computer program can consult it to
answer questions. The answers to those questions
become information that can be used to make
decisions that may not be made with the data
elements alone. The computer program used to
manage and query a database is known as a
database management system (DBMS). The properties
of database systems are studied in information
science. - At the core of the concept of a database is the
idea of a collection of generic facts, or pieces
of knowledge. Facts may be structured in a number
of ways, known as database models. For instance,
one database model is to associate each fact with
a record representing an entity (such as a
person), and to arrange these entities into trees
or hierarchies the hierarchical database model.
Another model is to arrange facts into sets of
values which satisfy logical predicates the
relational database model. - The first database management systems were
developed in the 1960s. A pioneer in the field
was Charles Bachman. Two key data models arose at
this time the network model (developed by
CODASYL) followed by the hierarchical model (as
implemented in IMS). These were later usurped by
the relational model, which was contemporary with
the so-called flat model designed for very small
tasks.
8Example To-Do List
- Buy school supplies due 9/4
- Go to orientation on 9/4
- Exercise every M/W/F
- Buy Philly postcards
- Take photo with Ben F.
- How does this differ from the plain text model?
What might you do with it that you couldnt?
9Example Your PDA/Cell Phone
Event Day When Who Where Lunch 10/24 1pm Zack C
avanaughs Advice 10/25 9am Dr. Smith 599
Levine Biking 10/26 9am Jane Pottruck Dinner 10/2
6 6PM Jane Food Court
Calendar
Who Phone Email Office Zack 6-2789 zives 576
Levine N Dr. Smith 6-1234 drsmith 599
Levine Jane 543-2198 jane 2220 Walnut St.
Contacts
10What If We Want to Include Contact Info on Our
Calendar?
- Do we also want to keep e-mail addresses,
telephone numbers etc.? - Should we expand the number of columns in our
table - Event When Who-name Who-email Who-tel . Where
- Lunch 1pm Zack zives 6-2789 . Cav
-
- What is the trade-off in terms of entering data?
11Link Calendar with Contacts?
- Why cant we link calendar entries with contact
info, and show the results of the two? - The link could be based on something as simple as
the person's name - (Whats the danger here? What else might work
better?) - This brings up an issue how to follow links
- If we were to do this in C, how might it be done?
In Java?
12Another Kind of Link Classes and Subclasses
- Person has attributes
- ssn
- PennID
- set of user IDs
- given name
- family name
-
- Student IS A person who
- takes courses
- is given grades
- learns an academic topic
- listens to lectures in class, OR over the Web, OR
on videotape - This is yet another kind of information
- How have you previously seen such relationships
encoded?
13Data Representation and Modeling
- All of the data weve seen have an implicit data
model - The data model includes some basic assumptions
about whats an item of data, how to interpret
it, and so on - The relational data model was the first model
that is independent of its data structures and
implementation - A theory of normalization guides you in designing
relations - Concepts have been adapted to form
object-oriented data models, XML, etc. - There are sibling fields to databases with
semantic models - natural language (meanings of bags or lists of
words) - information retrieval (associations between words
and documents) - ontologies (inferences about relationships
between classes, and classes and subclasses of
relationships) - Well mostly focus on the relational model and
its descendants
14The DBMS Provides an Interfaceover the Database
- A database (DB) is a large, integrated collection
of data - Generally is cohesive in some way
- A DB models a real-world organization or unit
- A database management system (DBMS) is a software
package designed to store and manage databases - Reliable storage recovery of 100s of GB
- Querying/updating interface and API (for
applications and Web pages) - Support for many concurrent users
- Why do we need a DBMS, instead of coding in Java?
15DBMS Benefit 1 Generality and Declarativity
- Dont require the programmer or user to know
details like indices, sort orders, machine
speeds, disk speeds, concurrent users, etc. - Instead, the programmer/user programs with a
logical model in mind - The DBMS makes it happen based on an
understanding of relative costs of different
methods
16Benefit 2 Efficiency and Scale
- Size of personal address book is probably less
than 100 entries, but there are things we'd like
to do quickly and efficiently - Give me all appointments on 10/28
- When am I next meeting Jim?
- Program these as quickly as possible (and make
them resilient to data format changes) - Scale to a corporate calendar with hundreds of
thousands of entries
17Benefit 3 Management of Concurrency and
Reliability
- Suppose other people are allowed access to your
calendar and are allowed to modify it? How do we
stop two people changing the file at the same
time and leaving it in a physical (or logical)
mess? - Suppose the system crashes while we are changing
the calendar. How do we recover our work? - This requires a basic concept
18Transactions
- Key concept for concurrency is that of a
transaction an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry). - Key concept for recoverability is that of a log
keeping track of all actions carried out by the
db.
19Anatomy of a Typical DBMS
(Simplification!)
API/GUI
Query
Optimizer
Stats
Physical plan
Logging, Recovery,
Exec. Engine
Schemas
Catalog
Requests
Data/etc
Index/file/rec Mgr
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Storage Mgr
Requests
Data
Storage
20The Database Abstraction Provided by the DBMS
- We think of databases at two levels
- Logical structure
- What users/programmers see program or query
interface - Physical structure
- Organization on disk, indices, etc.
- The logical level is further split into
- Overall database design (conceptual seen by the
DB designer) - Views that various users get to see
21The Three-level Architecture forDatabases
View 1 View 2 View N
Schema
Logical, Conceptual Level
Physical Level (file organization, indexing)
22Data Independence
- A user of a relational database system should be
able to use the database without knowing about
how the precisely how data is stored, e.g. - After all, you don't worry about the IEEE
floating-point specs when you do division in a
Java program or with a calculator
SELECT When, Where FROM Calendar WHERE Who
Jane"
23More on Data Independence
- Logical data independence
- Protects the user from changes in the logical
structure of the data - could reorganize the calendar schema without
changing how we query it - Physical data independence
- Protects the user from changes in the physical
structure of data - could add an index on who (or sort by when)
without changing how the user would write the
query, but the query would execute faster (query
optimization)
24Presentation Layer (4th Tier) Data-Driven Web
Sites
HTML
view
Processing
- Data driven web sites also add an HTML
presentation layer on top of what weve seen - Model (data model) / View (query) /
Controller (GUI) - Or they use XML plus style sheets to get the
same effect
25An Issue 80 of the Worlds Data is Not in a
DB!
- Examples
- Scientific data (large images, complex programs
that analyze the data) - Personal data
- WWW and email (some of it is stored in something
resembling a DBMS) - Data management is expanding to tackle these
problems - Flexibility data management imposes many
constraints to make problems solvable - Must deal with entities outside our control
- In this course, well start by focusing on
databases, but eventually look outside the box
at the Web and at gluing together data from many
places
26Combining Databases with Mediators(a kind of
middleware)
Mediated Schema
XML
- A layer above the three-tiered architecture, to
combine multiple databases/sources on the Web - Some of these are databases over which we have no
control - Some must be accessed in special ways
- We generally need to think about how to translate
between different database formats - This problem of data integration is a particular
focus here at Penn (and other top research
universities)
27How Does One Build a Database?
- Start with a conceptual model
- Design implement schema
- Write applications using DBMS and other tools
- Many ways of doing this where the hard problems
are taken care of by other people (DBMS, API
writers, library authors, web server, etc.) - Common applications include PHP/JSP/servlet-driven
web sites - The DBMS takes care of query optimization and
execution
28Conceptual Design
PROFESSOR
fid
name
Teaches
Takes
STUDENT
COURSE
cid
name
semester
sid
name
29Designing a Schema (Set of Relations)
STUDENT
COURSE
Takes
PROFESSOR
Teaches
- Convert to tables constraints
- Then need to do physical design the layout on
disk, indices, etc.
30Applications Use Queries in SQL
- Structured Query Language, often embedded (e.g.,
in servlets, JSP, to some extent LINQ) - Based on restricted first-order logic expressions
over relations - Not procedural defines constraints on the
output - Converted into a query plan that exploits
properties run over the data by the query
optimizer and query execution engine
- lthtmlgt
- ltbodygt
- lt!-- hypotheticalEmbeddedSQL
- SELECT FROM STUDENT, Takes, COURSE
- WHERE STUDENT.sid Takes.sID
- AND Takes.cID cid
- --gt
- lt/bodygt
- lt/htmlgt
31Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
32DBMSs in the Real World
- A huge industry for 20 of the worlds data!
- Big, mature relational databases
- IBM DB2, Oracle, Microsoft SQL Server
- Adding advanced features, including native XML
support - Middleware above these systems
- SAP, Siebel, PeopleSoft, dozens of
special-purpose apps - Application servers, aka TP Monitors
- WebSphere, WebLogic, Tomcat,
- Support transactional applications in EJB, .NET,
etc. - Integration and warehousing systems
- BEA AquaLogic, DB2 Information Integrator
- Current trends
- Web services XML everywhere
- Smarter, self-tuning systems
33So What about Database Research?
- Not focusing on the problems of Oracle
- Understanding whats possible to do with XML
- Better query processing
- Better languages for meta-info (e.g.,
constraints) - Better interaction with the non-database world
- Data streams and sensors
- Peer-to-peer architectures and data cooperatives
- Integrating data from different formats
- Lots of theory and systems-building
- Youll see familiar concepts in this course from
operating systems and from complexity
theory/logic - And from programming languages, AI planning,
34In this Course...
- Study relational databases, their design, how to
query, what forms of indices to use. - Beyond relational algebra a logical model of
data (Datalog), recursion - XML and semi-structured data models
- Understanding DB internals
- How DBs are built
- Performance implications
- Integrating and mediating between databases (a
huge problem today)
35Your First Assignment
- Sign up to the Google group (theres a link on
the homepage) - Read the Codd paper
- Write a one-paragraph summary of the key ideas in
this paper and post to the discussion group