Zachary G. Ives - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Zachary G. Ives

Description:

Database and Information Systems! Instructor: Zachary Ives, zives_at_cis. 576 ... Lectures will be very discussion-oriented; about one topic area per week or two ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 25

Provided by: zack4

Learn more at: https://www.seas.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Zachary G. Ives

1
Introduction

Zachary G. Ives
University of Pennsylvania
CIS 650 Implementing Data Management Systems
September 4, 2008

2
Welcome to CIS 650, Database and Information
Systems!

Instructor Zachary Ives, zives_at_cis
576 Levine Hall North
Office hours Wednesdays, 2PM
Home page www.cis.upenn.edu/zives/cis650/
Discussion group cis650-fall08_at_googlegroups.com
Texts and readings
Hellerstein Stonebraker Readings in Database
Systems, 4th ed.
Most papers will be linked via the Web, but its
often nice to have the book
Supplementary papers (will be linked via schedule)

3
Course Format and Prerequisites

Read classic and important research papers
Lectures will be very discussion-oriented about
one topic area per week or two
Gain experience building some sort of data
management engine and experimentally validating
it this is a systems course
At end you should be equipped to do research in
this field, or apply ideas from data management
to your field
Prerequisites
Strong undergraduate DB course, or CIS 550
SQL, data modeling, basics of query optimization
and execution, ACID,
Strong Java coding abilities

4
Grading

Summaries/commentary on papers (20)
Midterm report (25)
Take one of the topics weve discussed and write
a summary and synthesis paper
Graded for organization, clarity, grammar, etc.
as well as content
Project (50) team or individual
One focus a SIGMOD demo to build a smart
research environment instrumented machines,
labs, building (more shortly)
Implementation
Experimentation / validation
Project report (should be in the style of a
research paper)
Brief (15-minute) presentation for each group /
project
Participation, discussion, intangibles (5)

5
Potential Projects

Smart CIS integrate data from sensors,
machines, power monitoring, calendars, etc. to
build a queriable building, labs, machines
Goal demo at SIGMOD, possibly some research
papers!
Cloud computing adapt a query processor to run
on Hadoop
Sensors build a real app with Crossbow motes
Data visualizer help understand and manipulate
data
Transformation reverse engineering create data
instances to determine what a Perl or other tool
is doing when converting from one format to
another

6
So What Is This Course About?

Not how to build an Oracle-driven Web site
nor even how to build Oracle

7
What Is Unique about Data Management?

Its been said that databases and data management
focus on scalability to huge volumes of data
What is it that makes this possible and what
makes the work interesting if NOT at huge scale?

8
The Key Principle Data Independence

Most methods of programming dont separate the
logical and physical representations of data
The data structures, access methods, etc. are all
given via interfaces!
The relational data model was the first model for
data that is independent of its data structures
and implementation

9
What Is Data Independence?

Codd points out that previous methods had
Order dependence
Index dependence
Access path dependence
Still true in todays Java/C what is the
drawback?
What might you be able to do in removing those?

10
The Relational Data Model

More than just tables!
True relations sets of tuples
The only data representation a user/programmer
sees
Explicit encoding of everything in values
General and universal means of encoding
everything!
Connections are explicitly represented as values
All semantics are pushed to queries
Additional integrity constraints
Key constraints, functional dependencies,
A secondary concept views
Define derived relations that are always live
A way of encapsulating, abstracting, protecting,
integrating data

11
Constraints and Normalization

Fundamental idea we dont want to build
semantics into the data model, but we want to be
able to encode certain constraints
Functional dependencies, key constraints,
foreign-key constraints, multivalued
dependencies, join dependencies, etc.
Allows limited data validation, plus
opportunities for optimization
The theory of normalization (see CSE 330, CIS
550) makes use of known constraints
Idea eliminate redundancy, in order to maintain
consistency in the presence of updates
(Note that theres no reason for normalization of
data in views!)
Ergo, XML???

12
Relational Completeness(Plus Extensions)
Declarativity

What is special about relational query languages
that makes them amenable to scalability?
Limited expressiveness particularly when we
consider conjunctive queries (even with
recursion)
Guaranteed polytime execution in size of data
Can reason about containment, invert them, etc.
Equivalence between relational calculus and
algebra
Calculus ? fully declarative, basis of query
languages
Algebra ? imperative but polytime, basis of
runtime systems
Predictability of operations (in bulk) ? cost
models
Ability to supplement data with auxiliary
structures for performance
Interfaces to other external languages

13
Concurrency and Consistency

Traditionally, DB efforts provide ACID
properties
Atomicity, Consistency, Isolation, Durability
Transaction an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry)
Recoverability via a log keeping track of all
actions carried out by the database
But theres a cost to all of this!
How do distributed systems, Web services,
service-oriented architectures, and the like
affect these properties?
Well consider one relaxation of these
properties the MapReduce / BigTable style of
computing

14
Other Data Models

Concepts from the relational data model have been
adapted to form object-oriented data models (with
classes and subclasses), XML models, etc.
Doesnt this result in some loss of
logical-physical independence?

15
What Is a Data Management System?

Of course, there are traditional databases
The focus of most work in the past 25 years
Tight loops due to locally controlled data
Indexing, transactions, concurrency, recovery,
optimization
Also, today there are DB-like components in
Your email client and server (transactional
storage)
Enterprise Java Beans (distributed transactions)
Google Base, BigTable, (distributed indexing,
storage)
But

16
80 of the Worlds Data is Not in Databases!

Examples
Scientific data (large images, complex programs
that analyze the data)
Personal data
WWW and email
Network traffic logs
Sensor data, network router data, stream data,
Are there benefits to declarative techniques and
data independence in tackling these issues?
Need to deal with data we dont control and cant
guarantee consistency over
In recent years increasing connection between
databases, data integration, information
retrieval, information extraction, sensors

17
Some Questions Well Consider

What are the right architectures for data
sharing? How do they change as consistency needs
(or other requirements) change?
How much can we abstract away heterogeneity,
physical properties, etc.?
How do we get good performance from declarative
queries?

18
Some Classes of Systems Well Consider

Databases
How do we optimize and execute queries or ensure
ACID?
Data integration
How do we handle heterogeneity in data and
meaning?
Data streams and sensor data
How do we process infinite amounts of data?
Cloud computing, Web search
How do we partition computation along 1000s of
machines and achieve reliable execution?
Peer-to-peer architectures
Whats the best way of finding data?

19
Our Agenda this Semester

Reading the canonical papers in the data
management literature, starting with databases
and later going to other data management systems
Some are very systems-y
Some are very experimental
Some are highly algorithmic, complexity-oriented
Gaining an understanding of the principles of
building systems to handle declarative queries
over large volumes of data

20
Recap Query Answering in a Data Management System

Based on declarative query languages
Based on restricted first-order logic expressions
over relations
Not procedural defines constraints on the
output
Converted into a query plan that exploits
properties run over the data by the query
optimizer and query execution engine
Data may be local or remote
Data may be heterogeneous or homogeneous
Data sources may have different interfaces,
access methods, etc.
Most common query languages
SQL (based on tuple relational calculus)
Datalog (based on domain relational calculus,
plus fixpoint)
XQuery (functional language has an XML calculus
core)

21
Recap Layers of a Typical Data Management
System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Access Methods
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Physical retrieval
Requests
Data
Source
22
Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
23
DBMSs in the Real World

Big, mature relational databases
IBM, Oracle, Microsoft
Middleware above these
SAP, PeopleSoft, dozens of special-purpose apps
Application servers
Integration and warehousing systems
DB2 Integrator, Oracle Fusion,
Current trends
Web services XML everywhere
Smarter, self-tuning systems (AutoAdmin, )
Stream systems (Vertica, Microsoft, IBM)

24
For Next Time

Skim Codd if you havent already
Read the overview papers of the two first
database systems
Astrahan et al., pp. 117-
Wong et al. (skip Section 2 focus on pp. 200-)
Write a summary of your assigned paper and post
to the Google Group cis650-fall08_at_googlegroups.c
om
Key question how well did this system mesh with
Codds relational model? (You may need to skim
through other aspects of your assigned paper to
help answer that question)