Title: Database Systems
1Database Systems
2Management Challenges
- Organizational obstacles to a database
environment - Integrating data and ensuring quality
3OS Support for Data Management
- Data can be stored in RAM
- this is what every programming language offers!
- RAM is fast, and random access
- Every OS includes a File System
- manages files on a magnetic disk
- allows open, read, seek, close on a file
- allows protections to be set on a file
- drawbacks relative to RAM?
4COMMUNICATIONS
PROCESSING
Control
A/L
Register
OUTPUT
INPUT
Memory
Secondary STORAGE
5Memory Hierarchy
Register
Cache
MEMORY
RAM
SECONDARY STORAGE
Magnetic Tapes Magnetic Disks Optical Disks Zip,
PCMCIA
6File Organization
- Bit Smallest unit of data binary digit (0,1)
- Byte Group of bits that represents a single
character - Field Group of words or complete number
- Record Group of related fields
- File Group of records of the same type
7File Organization
- FILE
- A Collection of Similar RECORDS
- Contains data about similar objects
- DATABASE
- An Organizations Electronic Library of FILES
- A repository containing organizational data
8File Organization
- Entity person, place, thing, event about which
data must be kept - Attribute description of a particular entity
- Key field field used to retrieve, update, sort
record
9Entities and Attributes
10Key field
- Field in each record
- Uniquely identifies this record for
- Retrieval
- Updating
- Sorting
11Sequential vs. Direct File Organization
- Sequential
- Tape oriented
- One record follows another
- Follows physical sequence
- Direct
- Disk oriented
- Can be accessed without regard to physical
sequence
12Direct File Access-Indexing
- Indexed sequential access method (ISAM)
- Each record identified by key
- Grouped in blocks and cylinders
- Keys in index
- Virtual storage access method (VSAM)
- Memory divided into areas and intervals
- Dynamic file space
- Btree widely used for relational databases
- Direct file access method
13The Indexed Sequential Access Method
14Direct or Random File Access Method
- Each record has key field
- Key field fed into transform algorithm
- Algorithm generates physical storage location of
record (record address)
15The Direct File Access Method
16Traditional File Environment (Flat File)
- Data redundancy The presence of duplicate data
in multiple data files so that the same data are
stored in more than one place or location - Data inconsistency The same attribute may have
different values.
17Traditional File Environment (Flat File)
- Program-data dependence
- The coupling of data stored in files and the
specific programs required to update and maintain
those files - changes in programs require changes to the data
- Lack of flexibility
- can deliver routine scheduled reports after
extensive programming efforts, but - it cannot deliver ad-hoc reports or
- respond to unanticipated information requirements
in a timely fashion.
18Traditional File Environment (Flat File)
- Poor security
- little control or management of data,
- management will have no knowledge of
- who is accessing or
- even making changes to the organizations data.
- Lack of data sharing and availability
- Information cannot flow freely across different
functional areas or different parts of the
organization. - Users find different values of the same piece of
information in two different systems, and - hence they may not use these systems because they
cannot trust the accuracy of the data.
19Traditional File Processing
20Example A Bank
Checking File
Reports
Savings File
Reports
Investment File
Reports
Loan File
Reports
Home loan system
21Example A Bank
Checking Acct. Applications.
D B M S
Saving Acct. Applications
Securities Acct. Apps
Ad hoc Applications
By using a DBMS, the Bank can quickly and easily
create customer profiles
22Database
- Collection of centralized data
- Controls redundant data
- Data stored so as to appear to users in one
location - Services multiple applications
23The Contemporary Database Environment
24Database Management System (DBMS)
- Software for creating and maintaining databases
- Permits firms to rationally manage data for the
entire firm - Acts as interface between application programs
and physical data files - Separates logical and design views of data
- Solves many problems of the traditional data file
approach - Examples
- Oracle
- DB2
- Microsoft SQL Server
- Sybase
25DBMS Features
- Optimize Queries
- Manage memory
- Control concurrent data access
26Elements of Database Environment
27DB Stakeholders
- DBMS vendors, programmers
- Oracle, IBM, Microsoft, Sybase, Informix,
- End users in many fields
- Business, education, science,
- DB application programmers
- Build data entry and analysis tools on top of
DBMSs - Build web services that run off DBMSs
- Database administrators (DBAs)
- Design logical/physical schemas
- Handle security and authorization
- Data availability, crash recovery
- Database tuning as needs evolve
28Database Administration
- Defines and organizes database structure and
content - Develops security procedures
- Develops database documentation
- Maintains DBMS
29Three Components to a DBMS
- Data definition language Specifies content and
structure of database and defines each data
element - Data manipulation language Used to process data
in a database, e.g. SQL - Data dictionary Stores definitions of data
elements and data characteristics
30Advantages of DBMS
- Reduces development / maintenance costs
- Enhances system flexibility
- Increases access / availability of information
- Allows central management of data, data use, and
security - Improved data security
- Improved data integrity
- Uncouples programs from data
- Improved programmer productivity
- Improved data sharing
- Reduces data redundancy
- Eliminates data inconsistency
31Do not use a DBMS when
- The initial investment in hardware, software, and
training is too high - The generality a DBMS provides is not needed
- The overhead for security, concurrency control,
and recovery is too high - Data and applications are simple and stable
- Real-time requirements cannot be met by it
- Multiple user access is not needed
32Databases Model the Real World
- Data Model allows us to translate real world
objects into structures computers can store - Many models
- Relational,
- E-R,
- O-O,
- Network,
- Hierarchical, etc.
- Relational
- Rows and Columns
- Keys and Foreign Keys to link Relations
33Hierarchical DBMS
- Organizes data in a tree-like structure
- Supports one-to-many parent-child relationships
- Prevalent in large legacy systems
34Hierarchical Data Model
35Pointer
- Field in one record is address of next record in
sequence
36Types of Relations
37Network Data Model
- Depicts data logically as many-to-many
relationships - A coordination variation of hierarchical model
- Useful for many-to-many relationships
38The Relational Data Model
39Object-Oriented Model(s)
- Based on the object-oriented paradigm,
- e.g., C, Java, smalltalk
- Object-oriented DBMS Stores data and procedures
as objects that can be retrieved and shared
automatically - Object-relational DBMS Provides capabilities of
both object-oriented and relational DBMS
40Comparison of Database Alternatives
- Hierarchical
- Processing Efficiency High
- Flexibility Low
- User Friendliness Low
- Program Complexity High
41Comparison of Database Alternatives
- Network
- Processing Efficiency Medium / High
- Flexibility Low / Medium
- User Friendliness Low / Moderate
- Program Complexity High
42Comparison of Database Alternatives
- Relational
- Processing Efficiency Low But Improving
- Flexibility High
- User Friendliness High
- Program Complexity Low
43Structure of a Relational DBMS
- A typical DBMS has a layered architecture.
- Concurrency control and recovery components not
shown. - Each system has its own variations.
44Describing Data Data Models
- A data model is a collection of concepts for
describing data. - A schema is a description of a particular
collection of data, using the given data model. - The relational model of data is the most widely
used model today. - Main concept relation, basically a table with
rows and columns. - Every relation has a schema, which describes the
columns, or fields.
45Data Modeling
DATABASE SYSTEM
- REALITY
- structures
- processes
MODEL
data modeling
- The model represents a perception of structures
of reality - The data modeling process is to fix a perception
of structures of reality and represent this
perception - In the data modeling process we select aspects
and we abstract
46Process Modeling
DATABASE SYSTEM
- REALITY
- structures
- processes
MODEL
process modeling
- The use of the model reflects processes of
reality - Processes may be represented by programs with
embedded database queries and updates - Processes may be represented by ad-hoc database
queries and updates at run-time
DML
PROG
DML
47Levels of Abstraction
- Views describe how users see the data.
- Conceptual schema defines logical structure
- Physical schema describes the files and indexes
used. - Sometimes called the ANSI/SPARC model
48Example University Database
- Conceptual schema
- Students(sid string, name string, login
string, age integer, gpareal) - Courses(cid string, cnamestring,
creditsinteger) - Enrolled(sidstring, cidstring, gradestring)
- Physical schema
- Relations stored as unordered files.
- Index on first column of Students.
- External Schema (View)
- Course_info(cidstring,enrollmentinteger)
49Data Independence
- Applications are insulated from how data is
structured and stored. - Logical data independence Protection from
changes in logical structure of data. - Physical data independence Protection from
changes in physical structure of data. - Q Why is this particularly important for DBMS?
- A Because databases and their associated
applications persist.
50Concurrency Control
- Concurrent execution of user programs a key to
good DBMS performance. - Disk accesses frequent, pretty slow
- Keep the CPU working on several programs
concurrently. - Interleaving actions of different programs
trouble! - e.g., deposit and withdrawal on same account
- DBMS ensures such problems dont arise
- users can pretend they are using a single-user
system. (called Isolation)
51Transaction An Execution of a DB Program
- Key concept is a transaction an atomic sequence
of database actions (reads/writes). - Each transaction, executed completely, must take
the DB between consistent states. - Users can specify simple integrity constraints on
the data. The DBMS enforces these. - Beyond this, the DBMS does not understand the
semantics of the data. - Ensuring that a single transaction (run alone)
preserves consistency is ultimately the users
responsibility!
52Advanced DBMS Features
- Integrity maintenance
- Entity Integrity (not null, uniqueness)
- Referential Integrity (foreign key references)
53Advanced DBMS Features
- Access Control
- Grant and Revoke
- Access to tables
- Capability to perform operations (Insert, update,
delete, etc.)
54Desirable ACID Properties
- Atomicity - Transactions are atomic
- Consistency - Transformations preserve database
consistency - Isolation - concurrent users updates dont
interfere - Durability - Once a transaction commits, its
update survives (a system crash, not a hard drive
failure)
55Ensuring Transaction Properties
- DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a
transaction. - DBMS ensures durability of committed transactions
even if system crashes. - Idea Keep a log (history) of all actions carried
out by the DBMS while executing a set of
transactions - Before a change is made to the database, the
corresponding log entry is forced to a safe
location. - After a crash, using the log, the effects of
- partially executed transactions are undone
- committed transactions are redone.
56Scheduling Concurrent Transactions
- DBMS ensures that execution of T1, .. , Tn is
equivalent to some serial execution T1 ... Tn. - Before reading/writing an object, a transaction
requests a lock on the object, and waits until
the DBMS gives it the lock. All locks are held
until the end of the transaction. - Idea If an action of Ti (say, writing X) affects
Tj (which perhaps reads X), one of them, say Ti,
will obtain the lock on X first and Tj is forced
to wait until Ti completes this effectively
orders the transactions. - What if Tj already has a lock on Y and Ti later
requests a lock on Y? (Deadlock!) Ti or Tj is
aborted and restarted!
57Deadlocks
- When concurrent transactions hold locks on common
objects - A and B want to update both x and y
- A locks x, B locks y, neither can complete the
transaction - Solution
- Select one transaction as a victim and roll it
back - Deadlock detection and resolution is built into
DBMS
58Database Design
- Requirements Analysis
- user needs what must the database do?
- Conceptual Design
- Abstract model of database from a business
perspective (often done with ER model) - Logical Design
- translate ER into DBMS data model
- Schema Refinement
- consistency, normalization
- Physical Design
- Detailed description of business information
needs - indexes, disk layout
- Security Design
- who accesses what
59Conceptual Design
- What are the entities and relationships in the
enterprise? - What information about these entities and
relationships should we store in the database? - What are the integrity constraints or business
rules that hold? - A database schema in the ER Model can be
represented pictorially (ER diagrams). - Can map an ER diagram into a relational schema.
60ER Model Basics
- Entity-relationship diagram Methodology for
documenting databases illustrating relationships
between database entities - Entity Real-world object distinguishable from
other objects. - An entity is described (in DB) using a set of
attributes. - Entity Set A collection of similar entities.
e.g., all employees. - All entities in an entity set have the same set
of attributes. - Each entity set has a key (underlined).
- Each attribute has a domain.
61ER Model Basics
- Relationship Association among two or more
entities. - e.g., Ahmet works in CRM department.
- relationships can have their own attributes.
- Relationship Set Collection of similar
relationships. - An n-ary relationship set R relates n entity sets
E1 ... En - Each relationship in R involves entities e1? E1,
..., en ? En - Same entity set can participate in different
relationship sets, or in different roles in the
same set.
62Key Constraints
- Example
- An employee can work in many departments
- A department can have many employees.
- In contrast, each department has at most one
manager, according to the key constraint on
Manages.
63Conceptual Design Using the ER Model
- Design choices
- Should a concept be modeled as an entity or an
attribute? - Should a concept be modeled as an entity or a
relationship? - Identifying relationships Binary or ternary?
Aggregation? - Constraints in the ER Model
- A lot of data semantics can (and should) be
captured. - But some constraints cannot be captured in ER
diagrams.
64Entity- Relationship Diagram
65Entity vs. Attribute
- Should address be an attribute of Employees or an
entity (related to Employees)? - Depends upon how we want to use address
information, and the semantics of the data - If we have several addresses per employee,
address must be an entity (since attributes
cannot be set-valued). - If the structure (city, street, etc.) is
important, address must be modeled as an entity
(since attribute values are atomic).
66Relational Database Definitions
- Relational database a set of relations.
- Relation made up of 2 parts
- Schema specifies name of relation, plus name
and type of each column. - e.g. Students(sid string, name string, login
string, age integer, gpa real) - Instance a table, with rows and columns.
- rows cardinality
- fields degree / arity
- Can think of a relation as a set of rows or
tuples. - i.e., all rows are distinct
67Terminology related to relations
68Example Instance of Students Relation
- Cardinality 3, arity 5 , all rows distinct
- Do all values in each column of a relation
instance have to be distinct?
69Data
- Example
- Products
- ProductID, name, price, quantity
- Orders
- OrderID, ProductID, userID
- Users
- userID, name, credit card
70The Table Perspective
- Viewed by the user as
- Rows records
- Columns fields
- Stored value cell
- Relations are abstract tables
- Hence the name relational databases
- Describes users perspective, not the internal
storage format
71Mathematical Foundations
- A relation is a collection of tuples
- Usually a subset of the data
- A tuple is a collection of attributes
- ltA1vi1gt ltAn vingt
- An attribute is a scalar
- Each scalar has a domain
72Scalars
- Smallest syntactic unit of data
- Atomic
- No internal structure
- Typical attribute types
- Integer
- Float
- String
- Time/Date
73Properties of relations
- There are no duplicate tuples
- Tuples are unordered (Top to Bottom)
- Attributes are unordered (Left to Right)
- Implies that they are referenced by name, not
position - All attribute values are atomic
74Named Relations
- Base relation references a non-derived relation
- View refers to a named (derived) relation
75Relational Operators
- Operations that (conceptually) return a table!
- Select (restrict)
- Returns a subset of rows
- Project
- Returns a subset of fields
- Join
- Returns a combined set of tables
- Division
- Returns rows of A that contain all elements of B
76Relational Query Languages
- A major strength of the relational model
supports simple, powerful querying of data. - Queries can be written intuitively.
- The DBMS is responsible for efficient evaluation.
- The key precise semantics for relational
queries. - Allows the optimizer to extensively re-order
operations, and still ensure that the answer does
not change.
77SQL
- SQL (Sequel), standard language for Relational
DBs - Standard Query Language
- Developed in early 1970s by IBM
- Now an international standard
- SQL can be used (invoked) in a number of ways
- Interactive
- Embedded in various applications both static and
dynamic - SQL is not 100 in accordance with what is
theoretically correct - Current std is SQL99, SQL92 is a basic subset
78Components of SQL
- Data Definition Language (DDL)
- create, modify, delete relations
- specify constraints
- administer users, security, etc.
- Data Manipulation Language (DML)
- Specify queries to find tuples that satisfy
criteria - add, modify, remove tuples
79Three Basic Operations in a Relational Database
- Select Creates subset of rows that meet specific
criteria - Join Combines relational tables to provide users
with information - Project Enables users to create new tables
containing only relevant information
80Project Operation
81SQL Overview
- CREATE TABLE ltnamegt ( ltfieldgt ltdomaingt, )
- INSERT INTO ltnamegt (ltfield namesgt)
- VALUES (ltfield valuesgt)
- DELETE FROM ltnamegt
- WHERE ltconditiongt
- UPDATE ltnamegt
- SET ltfield namegt ltvaluegt
- WHERE ltconditiongt
- SELECT ltfieldsgt
- FROM ltnamegt
- WHERE ltconditiongt
82Creating Relations in SQL
- Creates the Students relation.
- Note the type (domain) of each field is
specified, and enforced by the DBMS whenever
tuples are added or modified. - CREATE TABLE Students
- (sid CHAR(20),
- name CHAR(20),
- login CHAR(10),
- age INTEGER,
- gpa FLOAT)
83Table Creation (continued)
- Another example the Enrolled table holds
information about courses students take. - CREATE TABLE Enrolled
- (sid CHAR(20),
- cid CHAR(20),
- grade CHAR(2))
84Adding and Deleting Tuples
- Can insert a single tuple using
- INSERT INTO Students (sid, name, login, age, gpa)
- VALUES (2001100, Ali, alik, 18, 1.89)
- Can delete all tuples satisfying some condition
(e.g., name Ali) - DELETE
- FROM Students S
- WHERE S.name Ali
85Keys
- Keys are a way to associate tuples in different
relations - Keys are one form of integrity constraint (IC)
- PRIMARY Key
- FOREIGN Key
86Primary Keys
- A set of fields is a superkey if
- No two distinct tuples can have same values in
all key fields - A set of fields is a key for a relation if
- It is a superkey
- No subset of the fields is a superkey
- what if there are many keys for a relation?
- one of the keys is chosen (by DBA) to be the
primary key. Other keys are called candidate
keys. - e.g.
- sid is a key for Students.
- What about name?
- The set sid, gpa is a superkey.
87Primary and Candidate Keys in SQL
- Possibly many candidate keys (specified using
UNIQUE), one of which is chosen as the primary
key. - Keys must be used carefully!
- For a given student and course, there is a
single grade. - CREATE TABLE Enrolled
- (sid CHAR(20)
- cid CHAR(20),
- grade CHAR(2),
- PRIMARY KEY (sid,cid))
88Foreign Keys, Referential Integrity
- Foreign key Set of fields in one relation that
is used to refer to a tuple in another
relation. - Must correspond to the primary key of the other
relation. - Like a logical pointer.
- If all foreign key constraints are enforced,
referential integrity is achieved (i.e., no
dangling references.)
89Foreign Keys in SQL
- Only students listed in the Students relation
should be allowed to enroll for courses. - sid is a foreign key referring to Students
- CREATE TABLE Enrolled
- (sid CHAR(20),cid CHAR(20),grade CHAR(2),
- PRIMARY KEY (sid,cid),
- FOREIGN KEY (sid) REFERENCES Students )
90Enforcing Referential Integrity
- Consider Students and Enrolled
- sid in Enrolled is a foreign key that references
Students. - What should be done if an Enrolled tuple with a
non-existent student id is inserted? (Reject it!) - What should be done if a Students tuple is
deleted? - Also delete all Enrolled tuples that refer to it?
- Disallow deletion of a Students tuple that is
referred to? - Set sid in Enrolled tuples that refer to it to a
default sid? - (In SQL, also Set sid in Enrolled tuples that
refer to it to a special value null, denoting
unknown or inapplicable.) - Similar issues arise if primary key of Students
tuple is updated.
91Integrity Constraints
- IC condition that must be true for any instance
of the database e.g., domain constraints. - ICs are specified when schema is defined.
- ICs are checked when relations are modified.
- A legal instance of a relation is one that
satisfies all specified ICs. - DBMS should not allow illegal instances.
- If the DBMS checks ICs, stored data are more
faithful to real-world meaning. - Avoids data entry errors, too!
92Where do ICs Come From?
- ICs are based upon the semantics of the
real-world that is being described in the
database relations. - We can check a database instance to see if an IC
is violated, but we can NEVER infer that an IC is
true by looking at an instance. - An IC is a statement about all possible
instances! - From example, we know name is not a key, but the
assertion that sid is a key is given to us. - Key and foreign key ICs are the most common more
general ICs supported, too.
93Querying Multiple Relations
- What is the meaning of the following query?
- SELECT S.name, E.cid
- FROM Students S, Enrolled E
- WHERE S.sidE.sid AND E.gradeB'
94Basic SQL Query
- SELECT DISTINCT target-list
- FROM relation-list
- WHERE qualification
- relation-list A list of relation names
- possibly with a range-variable after each name
- target-list A list of attributes of tables in
relation-list - qualification Comparisons combined using AND,
OR and NOT. - Comparisons are Attr op const or Attr1 op Attr2,
- DISTINCT optional keyword indicating that the
answer should not contain duplicates. - In SQL SELECT, the default is that duplicates are
not eliminated! (Result is called a multiset)
95Query Semantics
- Semantics of an SQL query are defined in terms of
the following conceptual evaluation strategy - 1. do FROM clause compute cross-product of
tables (e.g., Students and Enrolled). - 2. do WHERE clause Check conditions, discard
tuples that fail. (called selection). - 3. do SELECT clause Delete unwanted fields.
(called projection). - 4. If DISTINCT specified, eliminate duplicate
rows. - Probably the least efficient way to compute a
query! - An optimizer will find more efficient strategies
to get the same answer.
96SQL Select and Project
- Syntax
- SELECT column_nameFROM table_nameWHERE
where_clause - Example
- SELECT FROM orders WHERE user_ID 1
- SELECT name FROM products WHERE price lt 1.00
97SQL More about WHERE
- Comparisons
- gt, gt, lt, lt,
- Logical Operators (AND, OR, NOT)
- SELECT name FROM products WHERE product_id gt
1234 ANDproduct_id gt 1237 -
- Between
- WHERE product_ID BETWEEN 1234 AND 1237
98SQL Join
- Merges tables that have common column(s)
- A new row for every pair of rows with the common
value in the field - Typically operates on Foreign/candidate keys
- Example
- Users JOIN Orders
99SQL Select with (Implicit) Join
- Example Suppose we wanted the names of everyone
who ordered carrots - SELECT name FROM orders, users WHERE
orders.user_ID users.user_ID AND product_ID
1234
100SQL 3-way Join
- But this assumes we know carrots are product
1234!, - SELECT users.name FROM orders, users, products
WHERE orders.user_ID users.user_IDAND
products.product_ID orders.product_IDAND
product.name carrot
101SQL In
- Another way to achieve the previous request
- SELECT name FROM usersWHERE user_ID in(SELECT
user_ID FROM ordersWHERE product_ID (SELECT
product_IDFROM products WHERE name carrot))
102Functions
- Min, Max
- SELECT MAX (price) FROM products
- Count
- SELECT COUNT (product_ID) FROM products WHERE
price gt 1.00 - Average
- SELECT AVG(price) FROM products
103Other Elements of Select
- Distinct
- SELECT DISTINCT user_ID FROM orders
- Order by
- SELECT name FROM products ORDER BY name
- Group by, having
104Inserting New Data
- Syntax
- INSERT INTO table_name (column_names) VALUES
(values) - Example
- INSERT INTO users (user_ID, name, credit_card)
VALUES (4, David, 234-55-6776)
105SQL Select All
- To select all columns
- SELECT FROM orders
106SQL In
- Set membership operator
- Another way to achieve the previous request
- SELECT name FROM usersWHERE user_ID IN(SELECT
user_ID FROM ordersWHERE product_ID IN (SELECT
product_IDFROM products WHERE name carrot))
107SQL In
- Notice the nesting of SELECT statements
- In general, the IN subquery works only if the
relation returned is the same attribute - i.e. belongs to the same table
- Is a foreign key to the table
- Orders.user_ID is a foreign key referring to
users.user_ID
108SQL Like
- To perform a pattern match search in character
fields - SELECT FROM users WHERE name LIKE A
- Two pattern matching symbols
- represents any sequence of zero or more
characters - _ (underscore) represents a single character
109Inserting New Data
- Syntax
- INSERT INTO table_name (column_names) VALUES
(values) - Example
- INSERT INTO users (user_ID, name, credit_card)
VALUES (4, David, 234-55-6776) - Can drop the explicit column references if they
will be supplied in order defined
110SQL Update
- Modifies existing data
- UPDATE table_name SET column_name some_value
WHERE where_clause - The most dangerous command in SQL!
- Example
- UPDATE users SET credit_card 122-12-1212
WHERE user_ID 1
111Delete
- Syntax
- DELETE FROM table_name WHERE where_clause
- Example
- DELETE FROM products WHERE product_ID 1234
- Would this command succeed?
- Depends upon the configuration of the database
112Normalization
- Process of creating small stable data structures
from complex groups of data - Relations that reduce data redundancy are created
so the insertion, deletion and update operations
can be safely done. - EXAMPLE
- SP (S, SNAME, STATUSCITY, QTY, P, PNAME, COLOR,
WEIGHT) - Can be decomposed to
- S (S, SNAME, STATUSCITY)
- P (P, PNAME, COLOR, WEIGHT)
- SP (S, P, QTY)
113An Unnormalized Relation for ORDER
114Normalized Tables Created from ORDER
115Database Design
- Think about how data is related
- Entity-Relation diagrams
- Think about how queries will be executed
- Avoid repeated data
- Changing the structure of tables
- ALTER Adds/Deletes columns, constraints, defaults
- Cannot change field sizes
116Triggers
- Trigger procedure that starts automatically if
specified changes occur to the DBMS - Three parts
- Event (activates the trigger)
- Condition (tests whether the triggers should run)
- Action (what happens if the trigger runs)
- Triggers (in some form) are supported by most
DBMSs Assertions are not. - Support for triggers is defined in the SQL1999
standard.
117Triggers Example
- CREATE TRIGGER member_delete
- ON member FOR DELETE
- AS
- IF (Select COUNT () FROM loan INNER JOIN deleted
- ON loan.member_no deleted.member_no) gt 0
- BEGIN
- PRINT ERROR - member has books on loan.
- ROLLBACK TRANSACTION
- END
- ELSE
- DELETE reservation WHERE reservation.member_no
deleted.member_no
118Using SQL
- Interactively
- Via a command line program
- Embedded SQL
- Embed complete SQL statements in code
- Precompiler modifies the source code
- Dynamic SQL
- Embedded in scripts or other programs
- Dynamically interpreted
- Variables bound at run time
119What is Embedded SQL?
- The standard SQL embedded in general purpose
programming languages such as Pascal, Fortran,
Cobol, C and C. - A language in which SQL queries are embedded is
referred to as a host language, and the SQL
structures permitted in the host language
constitute embedded SQL.
120Why Embedded SQL?
- There exist queries that can be expressed in a
language such as C, Cobol, or Fortran that can
not be expressed in SQL. - Non-declarative actions-such as printing a
report, interacting with a user, or sending the
results of a query to a graphical user
interface-cannot be done from within SQL.
121How embedded SQL works?
- An embedded SQL program must be processed by a
special preprocessor prior to compilation. - Embedded SQL requests are replaced with
host-language declarations and procedure calls
that allow run-time execution of the database
accesses. - Then, the resulting program is compiled by the
host-language compiler.
122Embedding SQL in C An Example
- char SQLSTATE6
- EXEC SQL BEGIN DECLARE SECTION
- char c_sname20 short c_minrating float c_age
- EXEC SQL END DECLARE SECTION
- c_minrating random()
- EXEC SQL DECLARE sinfo CURSOR FOR
- SELECT S.sname, S.age FROM Sailors S
- WHERE S.rating gt c_minrating
- ORDER BY S.sname
- EXEC SQL OPEN sinfo
- do
- EXEC SQL FETCH sinfo INTO c_sname, c_age
- printf(s is d years old\n, c_sname, c_age)
- while (SQLSTATE ! 02000)
- EXEC SQL CLOSE sinfo
123Database APIs alternative to embedding
- Rather than modify compiler, add a library with
database calls (API) - special procedures/objects
- passes SQL strings from language, presents result
sets in a language-friendly way - Microsofts ODBC becoming C/C standard on
Windows - Suns JDBC a Java equivalent
- For Perl there is DBI or oraPerl
- Mostly DBMS-neutral (or at least they try to hide
the complexities of dealing with different
database systems).
124Evolution of Database Technology
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.) - 1990s2000s
- Data mining and data warehousing, multimedia
databases, and Web databases
125Centralized database
- Used by single central processor or multiple
processors in client/server network - There are advantages and disadvantages to having
all corporate data in one location. - Security is higher in central environments, risks
lower. - If data demands are highly decentralized, then a
decentralized design is less costly, and more
flexible.
126Distributed database
- Databases can be decentralized either by
partitioning or by replicating - Partitioned database Database is divided into
segments or regions. For example, a customer
database can be divided into Eastern customers
and Western customers, and two separate databases
maintained in the two regions. - Duplicated database The database is completely
duplicated at two or more locations. The
separate databases are synchronized in off hours
on a batch basis.
127Distributed Databases
128Ensuring Data Quality
- Corporate and government databases have
unexpectedly poor levels of data quality. - National consumer credit reporting databases have
error rates of 20-35. - 32 of the records in the FBIs Computerized
Criminal History file are inaccurate, incomplete,
or ambiguous. - Gartner Group estimates that consumer data in
corporate databases degrades at the rate of 2 a
month.
129Ensuring Data Quality (Continued)
- The quality of decision making in a firm is
directly related to the quality of data in its
databases. - Data Quality Audit Structured survey of the
accuracy and level of completeness of the data in
an information system - Data Cleansing Consists of activities for
detecting and correcting data in a database or
file that are incorrect, incomplete, improperly
formatted, or redundant
130Online Analytical Processing (OLAP)
- Multidimensional data analysis
- Supports manipulation and analysis of large
volumes of data from multiple dimensions/perspecti
ves
131Multidimensional Data Model
132Data warehouse
- Supports reporting and query tools
- Stores current and historical data
- Consolidates data for management analysis and
decision making
133Components of a Data Warehouse
134Data mart
- Subset of data warehouse
- Contains summarized or highly focused portion of
data for a specified function or group of users
135Benefits of Data Warehouses
- Improved and easy accessibility to information
- Ability to model and remodel the data
136What Is Data Mining?
- Data mining (knowledge discovery in databases)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases - Alternative names and their inside stories
- Data mining a misnomer?
- Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc. - What is not data mining?
- (Deductive) query processing.
- Expert systems or small ML/statistical programs
137Why Data Mining? Potential Applications
- Database analysis and decision support
- Market analysis and management
- target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation - Risk analysis and management
- Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis - Fraud detection and management
- Other Applications
- Text mining (news group, email, documents) and
Web analysis. - Intelligent query answering
138Market Analysis and Management
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Conversion of single to a joint bank account
marriage, etc. - Cross-market analysis
- Associations/co-relations between product sales
- Prediction based on the association information
139Market Analysis and Management
- Customer profiling
- data mining can tell you what types of customers
buy what products (clustering or classification) - Identifying customer requirements
- identifying the best products for different
customers - use prediction to find what factors will attract
new customers - Provides summary information
- various multidimensional summary reports
- statistical summary information (data central
tendency and variation)
140Corporate Analysis and Risk Management
- Finance planning and asset evaluation
- cash flow analysis and prediction
- contingent claim analysis to evaluate assets
- cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.) - Resource planning
- summarize and compare the resources and spending
- Competition
- monitor competitors and market directions
- group customers into classes and a class-based
pricing procedure - set pricing strategy in a highly competitive
market
141Fraud Detection and Management
- Applications
- widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc. - Approach
- use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples
- auto insurance detect a group of people who
stage accidents to collect on insurance - money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network) - medical insurance detect professional patients
and ring of doctors and ring of references
142Fraud Detection and Management
- Detecting inappropriate medical treatment
- Australian Health Insurance Commission identifies
that in many cases blanket screening tests were
requested (save Australian 1m/yr). - Detecting telephone fraud
- Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm. - British Telecom identified discrete groups of
callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud. - Retail
- Analysts estimate that 38 of retail shrink is
due to dishonest employees.
143Other Applications
- Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat - Astronomy
- JPL and the Palomar Observatory discovered 22
quasars with the help of data mining - Internet Web Surf-Aid
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
144Data Mining A KDD Process
Knowledge
- Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
145Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
146Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
147Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning and data integration
Data Warehouse
Databases
148Databases and the Web
- The Web and Hypermedia database
- Organizes data as network of nodes
- Links nodes in pattern specified by user
- Supports text, graphic, sound, video, and
executable programs
149- Database server
- Computer in a client/server environment runs a
DBMS to process SQL statements and perform
database management tasks. - Application server
- Software handling all application operations
150Linking Internal Databases to the Web
151Databases and the Web
- The Web extends the database interface
- Consider making a plane reservation
- Typically involves multiple forms (choose flight,
choose seat, choose payment method) - Information from the beginning may be invalid by
the end
152Keeping Track of Users
- HTTP is stateless
- Request-response paradigm
- If we need to keep track of users between
requests, we can - Embedding hidden fields with values in sequential
forms - Cookies
153Hidden Fields
- What are some of the problems?
- Only work in forms
- Bookmarkable? If form uses GET, but not if POST
- Can be manipulated
154Cookies
- Two HTTP headers
- COOKIE
- Used by browser sending cookie info back to
server - SET-COOKIE2
- Used by server to request that a cookie be stored
- Associated with hosts
- Typically one line/cookie in a text file
- Body contains keyvalue pairs
155Cookie Format
- The header includes a line
- Set-Cookie2 NAMEVALUE ( attributevalues)
- Attributes include
- Comments
- Discard the cookie when client quits
- Maximum age
- Domain, path, and port of relevance
156Cookie Example
- Netscape stores in a MagicCookie file
- .ncsu.edu TRUE / FALSE 2051222363 SITESERVER
IDbd2f6b1d2e2eed30af221fe4a64eb077
157What Can You Do With a Cookie?
- Can store data on client,
- Store a user ID
- Store a transaction ID
- Store preferences on client machine
- A server can store more than one cookie (using
different names)
158When is the Data Returned?
- Whenever the client is sending a request to a
server and - The host, URL, and port match an existing cookie
- The cookie has not expired
- It should include the cookie data in the request
header
159ACID Properties For Web Transactions?
- Awkward, at best
- Must be carefully managed by the developers
- unlike databases, in which it is a part of the
system
160Dynamic SQL Basics
- Typically
- Prepare phase
- Parses, validates, optimizes the query
- Creates an execution plan
- Execute Phase
- Executes the query
- Can be called repeatedly once prepared
- Syntax varies
161Dynamic SQL Example
- Using VBScript and ASP, ActiveX Data Object
- ltSQLCAR"SELECT FROM tableset conn
server.createobject("ADODB.Connection")conn.open
database_nameset carsconn.execute(SQLCAR)gt
162Cursors
- Allow application to access an arbitrary number
of rows - FETCH to retrieve the next row
- Placed in a loop to retrieve next row
- Until an SQLCODE of Not Found is returned
163TPC
164TPC-C
- TPC-C simulates a complete computing environment
where a population of users executes transactions
against a database. - The benchmark is centered around the principal
activities (transactions) of an order-entry
environment. - These transactions include entering and
delivering orders, recording payments, checking
the status of orders, and monitoring the level of
stock at the warehouses. - TPC-C performance is measured in new-order
transactions per minute. The primary metrics are
the transaction rate (tpmC), the associated price
per transaction (/tpmC), and the availability
date of the priced configuration.
165Management Opportunities
- Business firms have exceptional opportunities to
exploit modern relational database technologies
to improve decision making, and to increase the
efficiency of their business processes.
166Management Challenges
- Organizational obstacles to a database
environment - Need for cooperation in developing corporate-wide
data administration - Cost/benefit considerations
- Bringing about significant change in the
database environment of a firm can be very
expensive and time consuming.
167Solution Guidelines
- The critical elements for creating a database
environment are - Data administration
- Data-planning and modeling methodology
- Database technology and management
- Users
168Key Organizational Elements in the Database
Environment