Title: The Software Infrastructure for Electronic Commerce
1The Software Infrastructurefor Electronic
Commerce
- Databases and Data Mining
- Lecture 1 A Managers View of Database
Technology - Johannes Gehrke
- johannes_at_cs.cornell.edu
- http//www.cs.cornell.edu/johannes
2Goal Of Lectures One and Two
- Understand the basics of modern data management
- DBMS technology
- Data models
- Transaction processing versus decision support
- An data management architecture of ane-business
3Goals of the DBMS Lectures (Contd.)
- Gain technical understanding to feel confident to
ask questions. - Learn to ask the right questions.
4The Big Picture
WWW SiteVisitor
Internal User
INTRANET,VPN
THE WEB
InternalWeb Server
MainMemory Cache
Public Web Server
DataWarehouseApplicationServer
BusinessTransactionServer
DBMS
5Web Pages with Database Contents
- Web pages contain the results of database
queries. How do we generate such pages? - Web server creates a new process for a program
interacts with the database. - Web server communicates with this program via
some data exchange protocol - Program generates result page with content from
the database
6Business Transaction Server
- Application server Piece of software between the
web server and the applications - Functionality
- Hold a set of pre-forked threads or processes for
performance - Database connection pooling (reuse a set of
existing connections) - Integration of heterogeneous data sources
- Transaction management involving several data
sources - Session management
7Other Server-Side Processing
- Java Servlets Java programs that run on the
server and interact with the server through a
well-defined API - JavaBeans Reusable software components written
in Java - Java Server Pages and Active Server Pages Code
inside a web page that is interpreted by the web
server
8What Happens If You Click On A Link?
www.company.com
1
Company web serverwww.company.com
Johannes Gehrke
2
Banner web serverwww.banners.com
3
Additional log serverwww.mycookies.com
Hidden link
4
SERVER SIDE
CLIENT SIDE
9Web Server Log Data
- Common Log Formathost ident authuser date
request status bytes - host domain name or IP address of the client
- date date and time of the request
- request the request line from the client
- status three digit status code returned to the
client - bytes number of bytes in the object returned to
the client
10Web Server Log Data (Contd.)
- Usually, even more data available
- URL of the referring server
- Name and version of the browser
- Name of the file requested
- Time to serve the request
- IP address of the requesting browser
- Cookie
11Cookies
- The communication protocol (HTTP) between your
browser and the web server is stateless. (Compare
to a vending machine.) - Remedy Store information (called a cookie) at
the browser of the user - Example cookie (from www.google.com)
- PREFID3415aaaf73b7bfe3,TM956724506google.com/
0261887833632111634266272280029339450(namev
aluedomainsecureexpiration dateexpiration
timelast used datelast used time)
12Cookies (Contd.)
- A cookie is always associated with a specific
domain (amazon.com, bn.com, preferences.com,
doubleclick.net, cornell.edu) - Cookies have expiration dates
- The secrets are in the (namevalue) pairs
(usually encrypted)PREFID3415aaaf73b7bfe3,TM95
6724506 - Cookies have their own life on your computer
- \Documents and Settings\UserName\Cookies,\Windows
\Cookies\Windows\Profiles\UserName\Cookies\Progr
amFiles\Netscape\Users\Default\cookies.txt
13Cookies (Contd.)
- Applications of cookies
- Shopping carts
- Log in once (example New York Times)
- Personalization (amazon.com Welcome back,
Johannes) - General tracking of user behavior
- User privacy
- Other personalization/tracking techniques Hidden
fields in html pages, unique page names - Online demonstration
14And Then You Click Purchase
- (Simplified process)
- Insert customer data into the database/check
customer data - Check order availability
- Insert order data into the database
- Return order confirmation number to the customer
- All this data is stored in a database system
(DBMS).
15Why Store Data in a DBMS?
- Benefits
- Transactions (concurrent data access, recovery
from system crashes) - High-level abstractions for data access,
manipulation, and administration - Data integrity and security
- Performance and scalability
16Transactions
- A transaction is an atomic sequence of database
actions (reading, writing or updating a database
object). - Each transaction must leave the DB in a
consistent state (if DB is consistent when the
transaction starts). - The ACID Properties
- Atomicity
- Consistency
- Isolation
- Durability
17Example Transaction Online Store
- Your purchase transaction
- Atomicity Either the complete purchase happens,
or nothing - Consistency The inventory and internal accounts
are updated correctly - Isolation It does not matter whether other
customers are also currently making a purchase - Durability Once you have received the order
confirmation number, your order information is
permanent, even if the site crashes
18Transactions (Contd.)
- A transaction will commit after completing all
its actions, or it could abort (or be aborted by
the DBMS) after executing some actions.
19Example Transaction ATM
- You withdraw money from the ATM machine
- Atomicity
- Consistency
- Isolation
- Durability
- Commit versus Abort?
- What are reasons for commit or abort?
20Concurrency Control
- (Start A100 B100)
- Consider two transactions
- T1 START, AA100, BB-100, COMMIT
- T2 START, A1.06A, B1.06B, COMMIT
- The first transaction is transferring 100 from
Bs account to As account. The second
transaction is crediting both accounts with a 6
interest payment.
21Example (Contd.)
- (Start A100 B100)
- Consider a possible interleaving (schedule)
- T1 AA100, BB-100 COMMIT
- T2 A1.06A, B1.06B COMMIT
- End result A106 B0
- Another possible interleaving
- T1 AA100, BB-100 COMMIT
- T2 A1.06A, B1.06B COMMIT
- End result A106 B6
- The second interleaving is incorrect! Concurrency
control of a database system makes sure that the
second schedule does not happen.
22Ensuring Atomicity
- DBMS ensures atomicity (all-or-nothing property)
even if the system crashes in the middle of a
transaction. - Idea Keep a log (history) of all actions carried
out by the DBMS while executing - Before a change is made to the database, the
corresponding log entry is forced to a safe
location. - After a crash, the effects of partially executed
transactions are undone using the log.
23Recovery
- A DBMS logs all elementary events on stable
storage. This data is called the log. - The log contains everything that changes data
Inserts, updates, and deletes. - Reasons for logging
- Need to UNDO transactions
- Recover from a systems crash
24Recovery Example
- (Simplified process)
- Insert customer data into the database
- Check order availability
- Insert order data into the database
- Write recovery data (the log) to stable storage
- Return order confirmation number to the customer
25Tips
- Transactions, concurrency control, and recovery
are important aspects of the functionality of a
database system - Tips for capacity planning
- Load influences level of concurrency ?Determines
hardware requirements - Insufficient resources for concurrency and
recovery can force serialization of transactions
? Bad performance - Need ample space for the log, often mirrored onto
two disks at the same time
26Why Store Data in a DBMS?
- Benefits
- Transactions (concurrent data access, recovery
from system crashes) - High-level abstractions for data access,
manipulation, and administration - Data integrity and security
- Performance and scalability
27Data Independence
- Applications should be insulated from how data is
structured and stored. Thus the DBMS needs to
provide high-level abstractions to applications!
View 1
View 2
View 3
Conceptual Schema
Physical Schema
28Data Model
- A data model is a collection of concepts for
describing data. - Examples
- ER model (used for conceptual modeling)
- Relational model, object-oriented model,
object-relational model (actually implemented in
current DBMS)
29The Relational Data Model
- A relational database is a set of relations.
Turing Award (Nobel Price in CS) for Codd in 1980
for his work on the relational model - Example relationCustomers(cid integer, name
string, byear integer, state string)
30The Relational Model Terminology
- Relation instance and schema
- Field (column)
- Record or tuple (row)
- Cardinality
31Customer Relation (Contd.)
- In your enterprise, you are more likely to have a
schema similar to the followingCustomers(cid,
identifier, nameType, salutation, firstName,
middleNames, lastName, culturalGreetingStyle,
gender, customerType, degrees, ethnicity,
companyName, departmentName, jobTitle,
primaryPhone, primaryFax, email, website,
building, floor, mailstop, addressType,
streetNumber, streetName, streetDirection, POBox,
city, state, zipCode, region, country,
assembledAddressBlock, currency, maritalStatus,
bYear, profession)
32Product Relation
- Relation schemaProducts(pid integer, pname
string, price float, category string) - Relation instance
33Transaction Relation
- Relation schemaTransactions( tid
integer, tdate date, cid integer, pid
integer)
34TIPS
- Any enterprise has an abundance of different
relations in its DBMS. Good management of this
meta-data is crucial - Documentation
- Evolution
- Assignment of responsibilities
- Security
- (ERP packages usually create several thousand
relations)
35The Relational DBMS Market
36The Relational DBMS Market (Contd.)
37The Relational DBMS Market (Contd.)
38The Object-Oriented Data Model
- Richer data model. Goal Bridge impedance
mismatch between programming languages and the
database system. - Example components of the data model
Relationships between objects directly as
pointers. - Result Can store abstract data types directly in
the DBMS - Pictures
- Geographic coordinates
- Movies
- CAD objects
39Object-Oriented DBMS
- Advantages Engineering applications (CAD and CAM
and CASE computer aided software engineering),
multimedia applications. - Disadvantages
- Technology not as mature as relational DMBS
- Not suitable for decision support, weak security
- Vendors are much smaller companies and their
financial stability is questionable.
40Object-Oriented DBMS (Contd.)
- Vendors
- Gemstone (www.gemstone.com)
- Objectivity (www.objy.com)
- Object Design (www.odi.com)
- POET (www.poet.com)
- Versant (www.versant.com)
- Organizations
- OMDG Object Database Management
Group(www.omdg.org) - OMG Object Management Group (www.omg.org)
41The OO DBMS Market
42Object-Relational DBMS
- Mixture between the object-oriented and the
object-relational data model - Combines ease of querying with ability to store
abstract data types - Conceptually, the relational model, but every
field - All major relational vendors are currently
extending their relational DBMS to the
object-relational model
43Query Languages
- We need a high-level language to describe and
manipulate the data - Requirements
- Precise semantics
- Easy integration into applications written in
C/Java/Visual Basic/etc. - Easy to learn
- DBMS needs to be able to efficiently evaluate
queries written in the language
44Relational Query Languages
- The relational model supports simple, powerful
querying of data. - Precise semantics for relational queries
- Efficient execution of queries by the DBMS
- Independent of physical storage
45SQL Structured Query Language
- Developed by IBM (System R) in the 1970s
- ANSI standard since 1986
- SQL-86
- SQL-89 (minor revision)
- SQL-92 (major revision, current standard)
- SQL-99 (major extensions)
46Why Store Data in a DBMS?
- Benefits
- Transactions (concurrent data access, recovery
from system crashes) - High-level abstractions for data access,
manipulation, and administration - Data integrity and security
- Performance and scalability
47Integrity Constraints
- Integrity Constraints (ICs) Condition that must
be true for any instance of the database. - ICs are specified when schema is defined.
- ICs are checked when relations are modified.
- A legal instance of a relation is one that
satisfies all specified ICs. - DBMS should only allow legal instances.
- Example Domain constraints.
48Primary Key Constraints
- A set of fields is a superkey for a relation if
no two distinct tuples can have same values in
all key fields. - A set of fields is a key if the set is a
superkey, and none of its subsets is a superkey. - Example
- cid, name is a superkey for Customers
- cid is a key for Customers
- Where do primary key constraints come from?
49Primary Key Constraints (Contd.)
- Can there be more than one key for a relation?
- What is the maximum number of superkeys for a
relation? - What is the primary key of the Products relation?
How about the Transactions relation?
50Foreign Keys, Referential Integrity
- Foreign key Set of fields in one relation that
is refers to a unique tuple in another relation.
(The foreign key must be a superkey of the second
relation.) - Example The field cid in the Transactions
relation is a foreign key referring to Customers. - If all foreign key constraints are enforced, we
say that referential integrity is achieved. - No dangling references.
- Compare to links in HTML.
51Foreign Keys Example
- The pid field of the Transactions relation refers
to the cid field of the Customer relation.
52Enforcing Referential Integrity
- What should be done if a Transaction tuple with a
non-existent Customer id is inserted? (Reject
it!) - What should be done if a Customer tuple is
deleted? - Also delete all Transaction tuples that refer to
it. - Disallow deletion of a Customer tuple that has
associated Transactions. - Set cid in Transactions tuples to a default or
special cid. - SQL supports all three choices
53Where do ICs Come From?
- ICs are based upon the semantics of the
real-world enterprise that is being described in
the database relations. - We can check a database instance to see if an IC
is violated, but we can NEVER infer that an IC is
true by looking at an instance. - An IC is a statement about all possible
instances! - From example, we know state cannot be a key, but
the assertion that cid is a key is given to us. - Key and foreign key ICs are very common a DBMS
supports more general ICs.
54Security
- Secrecy Users should not be able to see things
they are not supposed to. - E.g., A student cant see other students grades.
- Integrity Users should not be able to modify
things they are not supposed to. - E.g., Only instructors can assign grades.
- Availability Users should be able to see and
modify things they are allowed to.
55Discretionary Access Control
- Based on the concept of access rights or
privileges for objects (tables and views), and
mechanisms for giving users privileges (and
revoking privileges). - Creator of a table or a view automatically gets
all privileges on it. - DMBS keeps track of who subsequently gains and
loses privileges, and ensures that only requests
from users who have the necessary privileges (at
the time the request is issued) are allowed. - Users can grant and revoke privileges
56Role-Based Authorization
- In SQL-92, privileges are actually assigned to
authorization ids, which can denote a single user
or a group of users. - In SQL1999 (and in many current systems),
privileges are assigned to roles. - Roles can then be granted to users and to other
roles. - Reflects how real organizations work.
- Illustrates how standards often catch up with de
facto standards embodied in popular systems.
57Security Is Important
- Financial estimate of database losses, by
activity in 1998 - (ENT/CSI/FBI Computer Crime Survey, 5/1999)
- Activity Loss in million
- Theft of Proprietary Information 28.51
- System Penetration by Outsiders 13.39
- Financial Fraud 11.24
- Unauthorized Insider Access 50.57
- Laptop Theft 0.16
- Total 103.87
58Why Store Data in a DBMS?
- Benefits
- Transactions (concurrent data access, recovery
from system crashes) - High-level abstractions for data access,
manipulation, and administration - Data integrity and security
- Performance and scalability
59Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 1.5 minute to scan.
1 Terabyte
Bandwidth
1 Terabyte
10 MB/s
Parallelism Divide a big problem into many
smaller ones to be solved in parallel.
60Parallel DBMS Intro
- Parallelism is natural to DBMS processing
- Pipeline parallelism many machines each doing
one step in a multi-step process. - Partition parallelism many machines doing the
same thing to different pieces of data. - Both are natural in DBMS!
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Any
Any
Partition
Sequential
Sequential
Sequential
Sequential
Sequential
Program
Program
outputs split N ways, inputs merge M ways
61DBMS The Success Story
- DBMSs are the most (only?) successful application
of parallelism. - Every major DBMS vendor has some server
- Workstation manufacturers now depend on DB
server sales. - Reasons for success
- Bulk-processing ( partition -ism).
- Natural pipelining.
- Inexpensive hardware can do the trick!
- Users/app-programmers dont need to think in
62Some Terminology
Ideal
- Speed-Up
- More resources means proportionally less time for
given amount of data. - Scale-Up
- If resources increased in proportion to increase
in data size, time is constant.
Xact/sec. (throughput)
degree of -ism
Ideal
sec./Xact (response time)
degree of -ism
63Architecture Issue Shared What?
Hard to program Cheap to build Easy to scaleup
Easy to program Expensive to build Difficult to
scaleup
Oracle
Compaq, Teradata, SP2
Sybase
64Distributed Database Systems
- Data is stored at several sites, each managed by
a DBMS that can run independently. - Distributed data independence Users should not
have to know where data is located (extends
physical and logical data independence
principles). - Distributed transaction atomicity Users should
be able to write transactions accessing multiple
sites just like local transactions.
65Types of Distributed Databases
- Homogeneous Every site runs same type of DBMS.
- Heterogeneous Different sites run different
DBMSs (different RDBMSs or even non-relational
DBMSs).
Gateway
DBMS1
DBMS2
DBMS3
66Distributed DBMS Architectures
QUERY
CLIENT
CLIENT
Client ships query to single site. All
query processing at server. - Thin vs. fat
clients. - Set-oriented communication,
client side caching.
SERVER
SERVER
SERVER
SERVER
SERVER
Query can span multiple sites.
SERVER
QUERY
67DBMS and Performance
- Efficient implementation of all database
operations - Indexes Auxiliary structures that allow fast
access to the portion of data that a query is
about - Smart buffer management
- Query optimization Finds the best way to execute
a query - Automatic query parallelization
68Performance Tips
- Bad connection management is the number one
scalability problem (can make two order of
magnitude difference) - Reuse once-executed queries as much as possible
- Physical performance database tuning is crucial
for good performance - Performance tuning is an art and a science
- Recent trend DBMS include tuning wizards for
automatic performance tuning - Include space for indexes during space planning
69Summary Of DBMS Benefits
- Transactions
- ACID properties, concurrency control, recovery
- High-level abstractions for data access
- Data models
- Data integrity and security
- Key constraints, foreign key constraints, access
control - Performance and scalability
- Parallel DBMS, distributed DBMS, performance
tuning
70The Future of Data Exchange Markup
- From HTML to XML
- Markup Languages HTML
- Simple markup language
- Text is annotated with language commands called
tags, usually consisting of a start tag and an
end tag
71HTML Example Book Listing
- ltHTMLgtltBODYgt
- Fiction
- ltULgtltLIgtAuthor Milan Kunderalt/LI?
- ltLIgtTitle Identitylt/LIgt
- ltLIgtPublished 1998lt/LIgt
- lt/ULgt
- Science
- ltULgtltLIgtAuthor Richard Feynmanlt/LIgt
- ltLIgtTitle The Character of Physical
Lawlt/LIgt - ltLIgtHardcoverlt/LIgt
- lt/ULgtlt/BODYgtlt/HTMLgt
72Beyond HTML XML
- Extensible Markup Language (XML) Extensible
HTML - Confluence of SGML and HTML The power of SGML
with the simplicity of HTML - Allows definition of new markup languages, called
document type declarations (DTDs)
73XML Language Constructs
- Elements
- Main structural building blocks of XML
- Start and end tag
- Must be properly nested
- Element can have attributes that provide
additional information about the element - Entities like macros, represent common text.
- Comments
- Document type declarations (DTDs)
74Booklist Example in XML
- lt?XML version1.0 standaloneyes?gt
- lt!DOCTYPE BOOKLIST SYSTEM booklist.dtdgt
- ltBOOKLISTgt
- ltBOOK genreFictiongt
- ltAUTHORgt
- ltFIRSTgtMilanlt/FIRSTgtltLASTgtKunderalt/LASTgt
- lt/AUTHORgt
- ltTITLEgtIdentitylt/TITLEgt
- ltPUBLISHEDgt1998lt/PUBLISHEDgt
- ltBOOK genreScience formatHardcovergt
- ltAUTHORgt
- ltFIRSTgtRichardlt/FIRSTgtltLASTgtFeynmanlt/LASTgt
- lt/AUTHORgt
- ltTITLEgtThe Character of Physical Lawlt/TITLEgt
- lt/BOOKgtlt/BOOKLISTgt
75XML DTDs
- A DTD is a set of rules that defines the
elements, attributes, and entities that are
allowed in the document. - An XML document is well-formed if it does not
have an associated DTD but it is properly nested. - An XML document is valid if it has a DTD and the
document follows the rules in the DTD.
76An Example DTD
- lt!DOCTYPE BOOKLIST
- lt!ELEMENT BOOKLIST (BOOK)gt
- lt!ELEMENT BOOK (AUTHOR, TITLE, PUBLISHED?)gt
- lt!ELEMENT AUTHOR (FIRST, LAST)gt
- lt!ELEMENT FIRST (PCDATA)gt
- lt!ELEMENT LAST (PCDATA)gt
- lt!ELEMENT TITLE (PCDATA)gt
- lt!ELEMENT PUBLISHED (PCDATA)gt
- lt!ATTLIST BOOK genre (ScienceFiction)
REQUIREDgt - lt!ATTLIST BOOK format (PaperbackHardcover)
Paperbackgt - gt
77Domain-Specific DTDs
- Development of standardized DTDs for specialized
domains enables data exchange between
heterogeneous sources - Example Mathematical Markup Language (MathML)
- Encodes mathematical material on the web
- In HTML ltIMG SRCxysquare.gif ALT(xy)2gt
- In MathML ltapplygt ltpower/gt ltapplygt ltplus/gt
ltcigtxlt/cigt ltcigtylt/cigt lt/applygt ltcngt2lt/cngt
lt/applygt
78XML The Future
- Seamless integration of heterogeneous B2B data
sources and business processes - XML-specified services, discovered by autonomous
software agents - Industry-specific DTDs that simplify data
exchange - Only XML-industry revenue forecasts1999 31
millions, 2001 93 millions(Interactive Week,
4/1999)
79Summary Of Lecture One
- Database systems are powerful, but also complex
pieces of software - Transactions (concurrent data access, recovery
from system crashes) - High-level abstractions for data access,
manipulation, and administration - Data integrity and security
- Performance and scalability
- XML is the Future of Data Exchange
80In The Second Lecture
- Database design
- A short introduction to SQL
- Transaction processing versus decision support
On to the data warehouse!
81Questions?