Title: Zachary G. Ives
1Introduction
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- September 4, 2008
2Welcome to CIS 650, Database and Information
Systems!
- Instructor Zachary Ives, zives_at_cis
- 576 Levine Hall North
- Office hours Wednesdays, 2PM
- Home page www.cis.upenn.edu/zives/cis650/
- Discussion group cis650-fall08_at_googlegroups.com
- Texts and readings
- Hellerstein Stonebraker Readings in Database
Systems, 4th ed. - Most papers will be linked via the Web, but its
often nice to have the book - Supplementary papers (will be linked via schedule)
3Course Format and Prerequisites
- Read classic and important research papers
- Lectures will be very discussion-oriented about
one topic area per week or two - Gain experience building some sort of data
management engine and experimentally validating
it this is a systems course - At end you should be equipped to do research in
this field, or apply ideas from data management
to your field - Prerequisites
- Strong undergraduate DB course, or CIS 550
- SQL, data modeling, basics of query optimization
and execution, ACID, - Strong Java coding abilities
4Grading
- Summaries/commentary on papers (20)
- Midterm report (25)
- Take one of the topics weve discussed and write
a summary and synthesis paper - Graded for organization, clarity, grammar, etc.
as well as content - Project (50) team or individual
- One focus a SIGMOD demo to build a smart
research environment instrumented machines,
labs, building (more shortly) - Implementation
- Experimentation / validation
- Project report (should be in the style of a
research paper) - Brief (15-minute) presentation for each group /
project - Participation, discussion, intangibles (5)
5Potential Projects
- Smart CIS integrate data from sensors,
machines, power monitoring, calendars, etc. to
build a queriable building, labs, machines - Goal demo at SIGMOD, possibly some research
papers! - Cloud computing adapt a query processor to run
on Hadoop - Sensors build a real app with Crossbow motes
- Data visualizer help understand and manipulate
data - Transformation reverse engineering create data
instances to determine what a Perl or other tool
is doing when converting from one format to
another
6So What Is This Course About?
- Not how to build an Oracle-driven Web site
- nor even how to build Oracle
7What Is Unique about Data Management?
- Its been said that databases and data management
focus on scalability to huge volumes of data - What is it that makes this possible and what
makes the work interesting if NOT at huge scale?
8The Key Principle Data Independence
- Most methods of programming dont separate the
logical and physical representations of data - The data structures, access methods, etc. are all
given via interfaces! - The relational data model was the first model for
data that is independent of its data structures
and implementation
9What Is Data Independence?
- Codd points out that previous methods had
- Order dependence
- Index dependence
- Access path dependence
- Still true in todays Java/C what is the
drawback? - What might you be able to do in removing those?
10The Relational Data Model
- More than just tables!
- True relations sets of tuples
- The only data representation a user/programmer
sees - Explicit encoding of everything in values
- General and universal means of encoding
everything! - Connections are explicitly represented as values
- All semantics are pushed to queries
- Additional integrity constraints
- Key constraints, functional dependencies,
- A secondary concept views
- Define derived relations that are always live
- A way of encapsulating, abstracting, protecting,
integrating data
11Constraints and Normalization
- Fundamental idea we dont want to build
semantics into the data model, but we want to be
able to encode certain constraints - Functional dependencies, key constraints,
foreign-key constraints, multivalued
dependencies, join dependencies, etc. - Allows limited data validation, plus
opportunities for optimization - The theory of normalization (see CSE 330, CIS
550) makes use of known constraints - Idea eliminate redundancy, in order to maintain
consistency in the presence of updates - (Note that theres no reason for normalization of
data in views!) - Ergo, XML???
12Relational Completeness(Plus Extensions)
Declarativity
- What is special about relational query languages
that makes them amenable to scalability? - Limited expressiveness particularly when we
consider conjunctive queries (even with
recursion) - Guaranteed polytime execution in size of data
- Can reason about containment, invert them, etc.
- Equivalence between relational calculus and
algebra - Calculus ? fully declarative, basis of query
languages - Algebra ? imperative but polytime, basis of
runtime systems - Predictability of operations (in bulk) ? cost
models - Ability to supplement data with auxiliary
structures for performance - Interfaces to other external languages
13Concurrency and Consistency
- Traditionally, DB efforts provide ACID
properties - Atomicity, Consistency, Isolation, Durability
- Transaction an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry) - Recoverability via a log keeping track of all
actions carried out by the database - But theres a cost to all of this!
- How do distributed systems, Web services,
service-oriented architectures, and the like
affect these properties? - Well consider one relaxation of these
properties the MapReduce / BigTable style of
computing
14Other Data Models
- Concepts from the relational data model have been
adapted to form object-oriented data models (with
classes and subclasses), XML models, etc. - Doesnt this result in some loss of
logical-physical independence?
15What Is a Data Management System?
- Of course, there are traditional databases
- The focus of most work in the past 25 years
- Tight loops due to locally controlled data
- Indexing, transactions, concurrency, recovery,
optimization - Also, today there are DB-like components in
- Your email client and server (transactional
storage) - Enterprise Java Beans (distributed transactions)
- Google Base, BigTable, (distributed indexing,
storage) - But
1680 of the Worlds Data is Not in Databases!
- Examples
- Scientific data (large images, complex programs
that analyze the data) - Personal data
- WWW and email
- Network traffic logs
- Sensor data, network router data, stream data,
- Are there benefits to declarative techniques and
data independence in tackling these issues? - Need to deal with data we dont control and cant
guarantee consistency over - In recent years increasing connection between
databases, data integration, information
retrieval, information extraction, sensors
17Some Questions Well Consider
- What are the right architectures for data
sharing? How do they change as consistency needs
(or other requirements) change? - How much can we abstract away heterogeneity,
physical properties, etc.? - How do we get good performance from declarative
queries?
18Some Classes of Systems Well Consider
- Databases
- How do we optimize and execute queries or ensure
ACID? - Data integration
- How do we handle heterogeneity in data and
meaning? - Data streams and sensor data
- How do we process infinite amounts of data?
- Cloud computing, Web search
- How do we partition computation along 1000s of
machines and achieve reliable execution? - Peer-to-peer architectures
- Whats the best way of finding data?
19Our Agenda this Semester
- Reading the canonical papers in the data
management literature, starting with databases
and later going to other data management systems - Some are very systems-y
- Some are very experimental
- Some are highly algorithmic, complexity-oriented
- Gaining an understanding of the principles of
building systems to handle declarative queries
over large volumes of data
20Recap Query Answering in a Data Management System
- Based on declarative query languages
- Based on restricted first-order logic expressions
over relations - Not procedural defines constraints on the
output - Converted into a query plan that exploits
properties run over the data by the query
optimizer and query execution engine - Data may be local or remote
- Data may be heterogeneous or homogeneous
- Data sources may have different interfaces,
access methods, etc. - Most common query languages
- SQL (based on tuple relational calculus)
- Datalog (based on domain relational calculus,
plus fixpoint) - XQuery (functional language has an XML calculus
core)
21Recap Layers of a Typical Data Management
System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Access Methods
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Physical retrieval
Requests
Data
Source
22Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
23DBMSs in the Real World
- Big, mature relational databases
- IBM, Oracle, Microsoft
- Middleware above these
- SAP, PeopleSoft, dozens of special-purpose apps
- Application servers
- Integration and warehousing systems
- DB2 Integrator, Oracle Fusion,
- Current trends
- Web services XML everywhere
- Smarter, self-tuning systems (AutoAdmin, )
- Stream systems (Vertica, Microsoft, IBM)
24For Next Time
- Skim Codd if you havent already
- Read the overview papers of the two first
database systems - Astrahan et al., pp. 117-
- Wong et al. (skip Section 2 focus on pp. 200-)
- Write a summary of your assigned paper and post
to the Google Group cis650-fall08_at_googlegroups.c
om - Key question how well did this system mesh with
Codds relational model? (You may need to skim
through other aspects of your assigned paper to
help answer that question)