Title: Zachary G' Ives
1Introduction
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- January 10, 2005
2Welcome to CIS 650, Database and Information
Systems!
- Instructor Zachary Ives, zives_at_cis
- 576 Levine Hall North
- Office hours Tuesday, 230-330PM (before
colloquium) - Home page www.seas.upenn.edu/zives/cis650/
- Texts and readings
- Hellerstein and Stonebraker Readings in Database
Systems, 4th ed. - (Should be available soon)
- Supplementary papers (will be linked via schedule)
3Course Format and Grading
- Very discussion-oriented about one topic area
per week or two - Readings in the text other research papers
summaries/commentary on papers (20) - Midterm report (25)
- Youll take one of the topics weve discussed and
write a summary and synthesis paper - Graded for organization, clarity, grammar, etc.
as well as content - Project (50) -- may choose to work in teams
- Implementation
- Experimentation / validation
- Project report (should be in the style of a
research paper) - Brief (15-minute) presentation
- Participation, discussion, intangibles (5)
- At the end, you should be equipped to do research
in this field, or to take ideas from databases
and apply them to your field
4So What Is This Course About?
- Not how to build an Oracle-driven Web site
- nor even how to build Oracle
5What Is Unique about Data Management?
- Its been said that databases and data management
focus on scalability to huge volumes of data - What is it that makes this possible and what
makes the work interesting if NOT at huge scale? - Why are data management techniques useful in
situations where scale isnt the bottleneck?
6The Key Principle Data Independence
- Most methods of programming dont separate the
logical and physical representations of data - The data structures, access methods, etc. are all
given via interfaces! - The relational data model was the first model for
data that is independent of its data structures
and implementation
7What Is Data Independence?
- Codd points out that previous methods had
- Order dependence
- Index dependence
- Access path dependence
- Still true in todays Java/C what is the
drawback? - What might you be able to do in removing those?
8The Relational Data Model
- More than just tables!
- True relations sets of tuples
- The only data representation a user/programmer
sees - Explicit encoding of everything in values
- Additional integrity constraints
- Key constraints, functional dependencies,
- General and universal means of encoding
everything! - (Semantics are pushed to queries)
- A secondary concept views
- Define virtual, derived relations that are always
live - A way of encapsulating, abstracting data
9Constraints and Normalization
- Fundamental idea we dont want to build
semantics into the data model, but we want to be
able to encode certain constraints - Functional dependencies, key constraints,
foreign-key constraints, multivalued
dependencies, join dependencies, etc. - Allows limited data validation, plus
opportunities for optimization - The theory of normalization (see CSE 330, CIS
550) makes use of known constraints - Idea eliminate redundancy, in order to maintain
consistency in the presence of updates - (Note that theres no reason for normalization of
data in views!) - Ergo, XML???
10Relational Completeness(Plus Extensions)
Declarativity
- What is special about relational query languages
that makes them amenable to scalability? - Limited expressiveness particularly when we
consider conjunctive queries (even with
recursion) - Guaranteed polytime execution in size of data
- Can reason about containment, invert them, etc.
- Magic sets
- (What about XQuerys Turing-completeness???)
- Equivalence between relational calculus and
algebra - Calculus ? fully declarative, basis of query
languages - Algebra ? imperative but polytime, basis of
runtime systems - Predictability of operations ? cost models
- Ability to supplement data with auxiliary
structures for performance
11Concurrency and Reliability(Generally requires
full control)
- Another key element of databases ACID
properties - Atomicity, Consistency, Isolation, Durability
- Transaction an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry) - Recoverability via a log keeping track of all
actions carried out by the database - How do distributed systems, Web services,
service-oriented architectures, and the like
affect these properties?
12Other Data Models
- Concepts from the relational data model have been
adapted to form object-oriented data models (with
classes and subclasses), XML models, etc. - But doesnt this result in some loss of
logical-physical independence? - GMAP and answering queries using views?
13What Is a Data Management System?
- Of course, there are traditional databases
- The focus of most work in the past 25 years
- Tight loops due to locally controlled data
- Indexing, transactions, concurrency, recovery,
optimization - But
1480 of the Worlds Data is Not in Databases!
- Examples
- Scientific data (large images, complex programs
that analyze the data) - Personal data
- WWW and email (some of it is stored in something
resembling a DBMS) - Network traffic logs
- Sensor data
- Are there benefits to declarative techniques and
data independence in tackling these issues? - XML is a great way to make this data available
- Also need to deal with data we dont control and
cant guarantee consistency over
15An Example of Data Management with Heterogeneity
Data Integration
Mediated Schema
XML
- A layer above heterogeneous sources, to combine
them under a unified logical abstraction - Some of these are databases over which we have no
control - Some must be accessed in special ways
- Data integration system translates queries over
mediated schema to the languages of the sources
converts answers to mediated schema
16Other Interesting Points
- Data streams and sensor data
- How do we process infinite amounts of data?
- Peer-to-peer architectures
- Whats the best way of finding data here?
- Personal information management
- Can we use integration-style concepts and a bit
of AI to manage associations between our data? - Web search
- Whats the back-end behind Google?
- Semantic Web
- How do we semantically interrelate data to build
a better Web?
17Layers of a Typical Data Management System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Access Methods
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Physical retrieval
Requests
Data
Source
18Query Answering in a Data Management System
- Based on declarative query languages
- Based on restricted first-order logic expressions
over relations - Not procedural defines constraints on the
output - Converted into a query plan that exploits
properties run over the data by the query
optimizer and query execution engine - Data may be local or remote
- Data may be heterogeneous or homogeneous
- Data sources may have different interfaces,
access methods, etc. - Most common query languages
- SQL (based on tuple relational calculus)
- Datalog (based on domain relational calculus,
plus fixpoint) - XQuery (functional language has an XML calculus
core)
19Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
20DBMSs in the Real World
- Big, mature relational databases
- IBM, Oracle, Microsoft
- Middleware above these
- SAP, PeopleSoft, dozens of special-purpose apps
- Application servers
- Integration and warehousing systems
- Current trends
- Web services XML everywhere
- Smarter, self-tuning systems
- Stream systems
21Our Agenda this Semester
- Reading the canonical papers in the data
management literature - Some are very systems-y
- Some are very experimental
- Some are highly algorithmic, complexity-oriented
- Gaining an understanding of the principles of
building systems to handle declarative queries
over large volumes of data
22For Next Time
- Skim Codd if you havent already
- Read the overview papers of the two first
database systems - Astrahan et al., pp. 117-
- Wong et al. (skip Section 2 focus on pp. 200-)
- Write a summary of your assigned paper and email
it to me at zives_at_cis - Key question how well did this system mesh with
Codds relational model? (You may need to skim
through other aspects of your assigned paper to
help answer that question)