Zachary G. Ives - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Zachary G. Ives

Description:

Database and Information Systems! Instructor: Zachary Ives, zives_at_cis. 576 ... Lectures will be very discussion-oriented; about one topic area per week or two ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 25
Provided by: zack4
Category:
Tags: ives | modeling | zachary

less

Transcript and Presenter's Notes

Title: Zachary G. Ives


1
Introduction
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 650 Implementing Data Management Systems
  • September 4, 2008

2
Welcome to CIS 650, Database and Information
Systems!
  • Instructor Zachary Ives, zives_at_cis
  • 576 Levine Hall North
  • Office hours Wednesdays, 2PM
  • Home page www.cis.upenn.edu/zives/cis650/
  • Discussion group cis650-fall08_at_googlegroups.com
  • Texts and readings
  • Hellerstein Stonebraker Readings in Database
    Systems, 4th ed.
  • Most papers will be linked via the Web, but its
    often nice to have the book
  • Supplementary papers (will be linked via schedule)

3
Course Format and Prerequisites
  • Read classic and important research papers
  • Lectures will be very discussion-oriented about
    one topic area per week or two
  • Gain experience building some sort of data
    management engine and experimentally validating
    it this is a systems course
  • At end you should be equipped to do research in
    this field, or apply ideas from data management
    to your field
  • Prerequisites
  • Strong undergraduate DB course, or CIS 550
  • SQL, data modeling, basics of query optimization
    and execution, ACID,
  • Strong Java coding abilities

4
Grading
  • Summaries/commentary on papers (20)
  • Midterm report (25)
  • Take one of the topics weve discussed and write
    a summary and synthesis paper
  • Graded for organization, clarity, grammar, etc.
    as well as content
  • Project (50) team or individual
  • One focus a SIGMOD demo to build a smart
    research environment instrumented machines,
    labs, building (more shortly)
  • Implementation
  • Experimentation / validation
  • Project report (should be in the style of a
    research paper)
  • Brief (15-minute) presentation for each group /
    project
  • Participation, discussion, intangibles (5)

5
Potential Projects
  • Smart CIS integrate data from sensors,
    machines, power monitoring, calendars, etc. to
    build a queriable building, labs, machines
  • Goal demo at SIGMOD, possibly some research
    papers!
  • Cloud computing adapt a query processor to run
    on Hadoop
  • Sensors build a real app with Crossbow motes
  • Data visualizer help understand and manipulate
    data
  • Transformation reverse engineering create data
    instances to determine what a Perl or other tool
    is doing when converting from one format to
    another

6
So What Is This Course About?
  • Not how to build an Oracle-driven Web site
  • nor even how to build Oracle

7
What Is Unique about Data Management?
  • Its been said that databases and data management
    focus on scalability to huge volumes of data
  • What is it that makes this possible and what
    makes the work interesting if NOT at huge scale?

8
The Key Principle Data Independence
  • Most methods of programming dont separate the
    logical and physical representations of data
  • The data structures, access methods, etc. are all
    given via interfaces!
  • The relational data model was the first model for
    data that is independent of its data structures
    and implementation

9
What Is Data Independence?
  • Codd points out that previous methods had
  • Order dependence
  • Index dependence
  • Access path dependence
  • Still true in todays Java/C what is the
    drawback?
  • What might you be able to do in removing those?

10
The Relational Data Model
  • More than just tables!
  • True relations sets of tuples
  • The only data representation a user/programmer
    sees
  • Explicit encoding of everything in values
  • General and universal means of encoding
    everything!
  • Connections are explicitly represented as values
  • All semantics are pushed to queries
  • Additional integrity constraints
  • Key constraints, functional dependencies,
  • A secondary concept views
  • Define derived relations that are always live
  • A way of encapsulating, abstracting, protecting,
    integrating data

11
Constraints and Normalization
  • Fundamental idea we dont want to build
    semantics into the data model, but we want to be
    able to encode certain constraints
  • Functional dependencies, key constraints,
    foreign-key constraints, multivalued
    dependencies, join dependencies, etc.
  • Allows limited data validation, plus
    opportunities for optimization
  • The theory of normalization (see CSE 330, CIS
    550) makes use of known constraints
  • Idea eliminate redundancy, in order to maintain
    consistency in the presence of updates
  • (Note that theres no reason for normalization of
    data in views!)
  • Ergo, XML???

12
Relational Completeness(Plus Extensions)
Declarativity
  • What is special about relational query languages
    that makes them amenable to scalability?
  • Limited expressiveness particularly when we
    consider conjunctive queries (even with
    recursion)
  • Guaranteed polytime execution in size of data
  • Can reason about containment, invert them, etc.
  • Equivalence between relational calculus and
    algebra
  • Calculus ? fully declarative, basis of query
    languages
  • Algebra ? imperative but polytime, basis of
    runtime systems
  • Predictability of operations (in bulk) ? cost
    models
  • Ability to supplement data with auxiliary
    structures for performance
  • Interfaces to other external languages

13
Concurrency and Consistency
  • Traditionally, DB efforts provide ACID
    properties
  • Atomicity, Consistency, Isolation, Durability
  • Transaction an atomic sequence of database
    actions (read/write) on data items (e.g. calendar
    entry)
  • Recoverability via a log keeping track of all
    actions carried out by the database
  • But theres a cost to all of this!
  • How do distributed systems, Web services,
    service-oriented architectures, and the like
    affect these properties?
  • Well consider one relaxation of these
    properties the MapReduce / BigTable style of
    computing

14
Other Data Models
  • Concepts from the relational data model have been
    adapted to form object-oriented data models (with
    classes and subclasses), XML models, etc.
  • Doesnt this result in some loss of
    logical-physical independence?

15
What Is a Data Management System?
  • Of course, there are traditional databases
  • The focus of most work in the past 25 years
  • Tight loops due to locally controlled data
  • Indexing, transactions, concurrency, recovery,
    optimization
  • Also, today there are DB-like components in
  • Your email client and server (transactional
    storage)
  • Enterprise Java Beans (distributed transactions)
  • Google Base, BigTable, (distributed indexing,
    storage)
  • But

16
80 of the Worlds Data is Not in Databases!
  • Examples
  • Scientific data (large images, complex programs
    that analyze the data)
  • Personal data
  • WWW and email
  • Network traffic logs
  • Sensor data, network router data, stream data,
  • Are there benefits to declarative techniques and
    data independence in tackling these issues?
  • Need to deal with data we dont control and cant
    guarantee consistency over
  • In recent years increasing connection between
    databases, data integration, information
    retrieval, information extraction, sensors

17
Some Questions Well Consider
  • What are the right architectures for data
    sharing? How do they change as consistency needs
    (or other requirements) change?
  • How much can we abstract away heterogeneity,
    physical properties, etc.?
  • How do we get good performance from declarative
    queries?

18
Some Classes of Systems Well Consider
  • Databases
  • How do we optimize and execute queries or ensure
    ACID?
  • Data integration
  • How do we handle heterogeneity in data and
    meaning?
  • Data streams and sensor data
  • How do we process infinite amounts of data?
  • Cloud computing, Web search
  • How do we partition computation along 1000s of
    machines and achieve reliable execution?
  • Peer-to-peer architectures
  • Whats the best way of finding data?

19
Our Agenda this Semester
  • Reading the canonical papers in the data
    management literature, starting with databases
    and later going to other data management systems
  • Some are very systems-y
  • Some are very experimental
  • Some are highly algorithmic, complexity-oriented
  • Gaining an understanding of the principles of
    building systems to handle declarative queries
    over large volumes of data

20
Recap Query Answering in a Data Management System
  • Based on declarative query languages
  • Based on restricted first-order logic expressions
    over relations
  • Not procedural defines constraints on the
    output
  • Converted into a query plan that exploits
    properties run over the data by the query
    optimizer and query execution engine
  • Data may be local or remote
  • Data may be heterogeneous or homogeneous
  • Data sources may have different interfaces,
    access methods, etc.
  • Most common query languages
  • SQL (based on tuple relational calculus)
  • Datalog (based on domain relational calculus,
    plus fixpoint)
  • XQuery (functional language has an XML calculus
    core)

21
Recap Layers of a Typical Data Management
System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Logging, recovery
Schemas
Catalog
Requests
Data/etc
Access Methods
Data/etc
Requests
Buffer Mgr
Red logical Blue physical
Pages
Pages
Physical retrieval
Requests
Data
Source
22
Processing the Query
Web Server / UI / etc
Execution Engine
Optimizer
Storage Subsystem
SELECT FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid Takes.sID AND
Takes.cID cid
23
DBMSs in the Real World
  • Big, mature relational databases
  • IBM, Oracle, Microsoft
  • Middleware above these
  • SAP, PeopleSoft, dozens of special-purpose apps
  • Application servers
  • Integration and warehousing systems
  • DB2 Integrator, Oracle Fusion,
  • Current trends
  • Web services XML everywhere
  • Smarter, self-tuning systems (AutoAdmin, )
  • Stream systems (Vertica, Microsoft, IBM)

24
For Next Time
  • Skim Codd if you havent already
  • Read the overview papers of the two first
    database systems
  • Astrahan et al., pp. 117-
  • Wong et al. (skip Section 2 focus on pp. 200-)
  • Write a summary of your assigned paper and post
    to the Google Group cis650-fall08_at_googlegroups.c
    om
  • Key question how well did this system mesh with
    Codds relational model? (You may need to skim
    through other aspects of your assigned paper to
    help answer that question)
Write a Comment
User Comments (0)
About PowerShow.com