Title: CSE490i Advanced Internet Systems
1CSE494 - Information Retrieval, Mining and
Integration on the InternetDatabase Concepts -
A Refresher30th March 2004
2This Day in History
- 1867 US purchases Alaska from Russia for
7.2 million (2 cents/acre) - 1953 Einstein announces revised unified field
theory - 1954 Test Cricket debut of Sir Garry Sobers
vs. England - 1981 President Reagan shot wounded by John
W Hinckley Jr - 2004 The first ever regular class of Rao
taught by someone other than Rao
3Concepts covered so far
- Information Retrieval
- Text retrieval
- Hyper-linked text retrieval
- Improvements
- Information Mining
- Clustering techniques to improve result
presentation - Classification and filtering techniques
4Structured data..
- Focus on text data till date.
- However, a lot of the data available on the web
is actually from (semi-)structured databases !!!! - They do their best to look like they are text
sources - What are the issues and opportunities brought up
by the presence of such sources on the web?
5Databases !!!??? you may have used
6Is the a DBMS?
Skeptics corner
- Fairly sophisticated search available
- crawler indexes pages on the web
- Keyword-based search for pages
- But, currently
- data is mostly unstructured and untyped
- search only
- cant modify the data
- cant get summaries, complex combinations of data
- Web sites typically have a DBMS in the background
to provide these functions. - They dynamically convert (wrap) the structured
data into readable English - ltIndia, New Delhigt gt The capital of India is
New Delhi. - So, if we can unwrap the text, we have
structured data! - Note also that such dynamic pages cannot be
crawled... - The (coming) Semi-structured web
- Most pages are at least semi-structured
- XML standard is expected to ease the
presentation/on-the-wire transfer of such pages.
(BUT..) - The Services
- Travel services, mapping services
- The Sensors
7Structure
An employee record
A generic web page containing text
A movie review
- How will search and querying on these three types
of data differ?
Semi-Structured
8Search vs. Query
- What if you wanted to find out which actors
donated to Al Gores presidential campaign? - Try actors donated to gore in your favorite
search engine.
9Structure helps querying
- Expressive queries
- Give me all pages that have key words Get Rich
Quick - Give me the social security numbers of all the
employees who have stayed with the company for
more than 5 years, and whose yearly salaries are
three standard deviations away from the average
salary - Give me all mails from people from ASU written
this year, which are relevant to get rich quick
- Efficient searching
- equality vs. similarity
- range-limited search
10Why use a DBMS in your website?
- Suppose we are building web-based music
distribution site. - Several questions arise
- How do we store the data? (file organization,
etc.) - How do we query the data? (write programs)
- Make sure that updates dont mess things up?
- Provide different views on the data? (registrar
versus students) - How do we deal with crashes?
- Way too complicated!
- Buy a database system!
11What Is a Database System?
- Database
a very
large, integrated collection of data. - Models a real-world enterprise
- Entities (e.g., teams, games)
- Relationships
- (e.g., The Patriots are playing in The
Superbowl) - More recently, also includes active components ,
often called business logic. (e.g., the BCS
ranking system) - A Database Management System (DBMS) is a software
system designed to store, manage, and facilitate
access to databases.
12Functionality of a DBMS
- Data Dictionary Management
- Storage management
- Data storage Definition Language (DDL)
- High level query and data manipulation language
- SQL/XQuery etc.
- May tell us what we are missing in text-based
search - Efficient query processing
- May change in the internet scenario
- Transaction processing
- Resiliency recovery from crashes,
- Different views of the data, security
- May be useful to model a collection of databases
together - Interface with programming languages
13Traditional Database Architecture
14Building an Application with a Database System
- Requirements modeling (conceptual, pictures)
- Decide what entities should be part of the
application and how they should be linked. - Schema design and implementation
- Decide on a set of tables, attributes.
- Define the tables in the database system.
- Populate database (insert tuples).
- Write application programs using the DBMS
- Now much easier, with data management API
15 Conceptual Modeling
ssn
16Data Models
- A data model is a collection of concepts for
describing data. - A schema is a description of a particular
collection of data, using a given data model. - The relational model of data is the most widely
used model today. - Main concept relation, basically a table with
rows and columns. - Every relation has a schema, which describes the
columns, or fields.
17Levels of Abstraction
- Views describe how users see the data.
-
- Conceptual schema defines logical structure
- Physical schema describes the files and indexes
used.
18Example University Database
- Conceptual schema
- Students(sid string, name string,
login string, age integer, gpareal) - Courses(cid string, cnamestring,
creditsinteger) - External Schema (View)
- Course_info(cidstring,enrollmentinteger)
- Physical schema
- Relations stored as unordered files.
- Index on first column of Students.
19Data Independence
- Applications insulated from
- how data is structured and stored.
- Logical data independence Protection from
changes in logical structure of data. - Physical data independence Protection from
changes in physical structure of data. - Q Why are these particularly important for DBMS?
20Schema Design Implementation
- Table Students
- Separates the logical view from the physical view
of the data.
21Terminology
Attribute names
Students
tuples
(Arity3)
22Querying a Database
- Find all the students taking CSE594 in Q1, 2004
- S(tructured) Q(uery) L(anguage)
- select E.name
- from Enroll E
- where E.courseCS490i and
- E.quarterWinter, 2000
- Query processor figures out how to answer the
query efficiently.
23Relational Algebra
- Operators
- tuple sets as input, new set as output
- Basic Binary Set Operators
- Result is table (set) with same attributes
- Sets must be compatible!
- R1(A1,A2,A3) ? R2(B1,B2,B3)
- ? Domain(Ai) Domain(Bi)
- Union
- All tuples in either R1 or in R2
- Intersection
- All tuples in both R1 and R2
- Difference
- All tuples in R1 but not in R2
- Complement
- All tuples not in R1
- Selection, Projection, Cartesian Product, Join
whats the universe?
24Selection s
- Grab a subset of the tuples in a relation that
satisfy a given condition - Use and, or, not, gt, lt to build condition
- Unary operation returns set with same
attributes, but selects rows
25Selection Example
Employee
SSN
Name
DepartmentID
Salary
999999999
John
1
30,000
777777777
Tony
1
32,000
888888888
Alice
2
45,000
26Projection p
- Unary operation, selects columns
- Returned schema is different,
- So returned tuples are not subset of original set
- Contrast with selection
- Eliminates duplicate tuples
27(No Transcript)
28Cartesian Product X
- Binary Operation
- Result is set of tuples combining all elements of
R1 with all elements of R2, for R1 ? R2 - Schema is union of Schema(R1) Schema(R2)
- Notice we could do selection on result to get
meaningful info!
29Cartesian Product Example
30Join
- Most common (and exciting!) operator
- Combines 2 relations
- Selecting only related tuples
- Result has all attributes of the two relations
- Equivalent to
- Cross product followed by selection followed by
Projection - Equijoin
- Join condition is equality between two attributes
- Natural join
- Equijoin on attributes of same name
- result has only one copy of join condition
attribute
31Example Natural Join
32Complex Queries
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number, city)
Find phone numbers of people who bought gizmos
from Fred. Find telephony products that
somebody bought
33Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
prodname) Company (cname, stock price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who bought American products and they
live in Seattle. Ex 5 Find people who bought
stuff from Joe or bought products
from a company whose stock prices is more than
50.
34SQL Introduction
Standard language for querying and manipulating
data Structured Query
Language
Many standards out there SQL92, SQL2, SQL3,
SQL99 Vendors support various subsets of
these (but well only discuss a subset of what
they support) Basic form syntax on relational
algebra (but many other features too) Select
attributes From relations (possibly
multiple, joined) Where conditions
(selections)
35Selections s
SELECT FROM
Company WHERE countryUSA AND
stockPrice gt 50 You can use
Attribute names of the relation(s) used in the
FROM. Comparison operators , ltgt,
lt, gt, lt, gt Apply arithmetic
operations stockprice2 Operations
on strings (e.g., for concatenation).
Lexicographic order on strings.
Pattern matching s LIKE p Special
stuff for comparing dates and times.
36Projection p
Select only a subset of the attributes
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
Rename the attributes in the resulting table
SELECT name AS company,
stockprice AS price FROM
Company WHERE countryUSA AND
stockPrice gt 50
37Ordering the Results
SELECT name, stock price
FROM Company WHERE
countryUSA AND stockPrice gt 50
ORDERBY country, name
Ordering is ascending, unless you specify the
DESC keyword. Ties are broken by the second
attribute on the ORDERBY list, etc.
38Join
SELECT name, store
FROM Person, Purchase WHERE
per-namebuyer AND citySeattle
AND
productgizmo Product ( pname, price,
category, maker) Purchase (buyer, seller,
store, product) Company (cname, stock price,
country) Person( per-name, phone number, city)
39Disambiguating Attributes
Find names of people buying telephony products
SELECT Person.name FROM
Person, Purchase, Product WHERE
Person.namebuyer
AND productProduct.name
AND Product.categorytelephony Product (
name, price, category, maker) Purchase (buyer,
seller, store, product) Person( name, phone
number, city)
40Tuple Variables
Find pairs of companies making products in the
same category
SELECT product1.maker, product2.maker
FROM Product AS product1, Product AS
product2 WHERE
product1.category product2.category
AND product1.maker ltgt
product2.maker
Product ( name, price, category, maker)
41Exercises
Product ( pname, price, category,
maker) Purchase (buyer, seller, store,
product) Company (cname, stock-price,
country) Person( per-name, phone number,
city) Ex 1 Find people who bought telephony
products. Ex 2 Find names of people who bought
American products Ex 3 Find names of people who
bought American products and did not
buy French products Ex 4 Find names of people
who live in Seattle and who bought American
products. Ex 5 Find people who bought stuff
from Joe or bought products from a
company whose stock prices is more than 50.
42Views
43Defining Views
(Virtual) Views are relations, except that they
are not physically stored. They are used
mostly in order to simplify complex queries
and to define conceptually different views of the
database to different classes of users. View
purchases of telephony products CREATE VIEW
telephony-purchases AS SELECT product, buyer,
seller, store FROM Purchase, Product WHERE
Purchase.product Product.name
AND Product.category telephony
44A Different View
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
45Updating Views
How can I insert a tuple into a table that
doesnt exist? CREATE VIEW bon-purchase AS
SELECT store, seller, product FROM
Purchase WHERE store The Bon
Marche If we make the following insertion
INSERT INTO bon-purchase VALUES
(the Bon Marche, Joe, Denby Mug) We can
simply add a tuple (the Bon Marche,
Joe, NULL, Denby Mug) to relation Purchase.
46Materialized Views
- Views whose corresponding queries have been
executed and the data is stored in a separate
database - Uses Caching
- Issues
- Using views in answering queries
- Normally, the views are available in addition to
database - (so, views are local caches)
- In information integration, views may be the only
things we have access to. - An internet source that specializes in woody
allen movies can be seen as a view on a database
of all movies. Except, there is no database out
there which contains all movies.. - Maintaining consistency of materialized views
47Non-Updatable Views
Given Purchase (buyer, seller, store,
product) Person( name, phone-num, city)
CREATE VIEW Seattle-view AS SELECT
seller, product, store FROM Person,
Purchase WHERE Person.city Seattle
AND Person.name
Purchase.buyer
Why non-updatable?
How can we add the following tuple to the view?
(Joe, Shoe Model 12345, Nine West)
48Issues w.r.t. Databases on the Web
- Information Extraction (invert the tuple to text
transformation) - Support lay user queries
- More flexible queries
- Exact (SQL) vs Approximate/Similar (Text search?)
- On semi-structured databases
- Joins over text attributes?
- Exact (SQL) vs Approximate/Similar !!!!!
- Support integration/aggregation of multiple
databases - Take a query from the user and send it to all
relevant databases - TONS of challenges
49Imprecise Queries
- Increasing number of Web accessible databases
- E.g. bibliographies, reservation systems,
department catalogs etc - Support for precise queries only exactly
matching tuples - Difficulty in extracting desired information
- Limited query capabilities provided by form based
query interface - Lack of schema/domain information
- Increasing complexity of types of data e.g.
hyptertext, images etc - Often times user wants about the same instead
of exact - Bibliography search find similar publications
Solution Provide answers closely matching query
constraints
50Query Optimization
51Query Optimization
Goal
Imperative query execution plan
Declarative SQL query
SELECT S.buyer FROM Purchase P, Person Q WHERE
P.buyerQ.name AND Q.cityseattle AND
Q.phone gt 5430000
- Inputs
- the query
- statistics about the data (indexes,
cardinalities, selectivity factors) - available memory
Ideally Want to find best plan. Practically
Avoid worst plans!
52(On-the-fly)
sname
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
- Goal of optimization To find more efficient
plans that compute the same answer.
(On-the-fly)
rating gt 5
with pipelining )
sidsid
(Use hash
Sailors
bid100
index do
not write
result to
temp)
Reserves
53Optimizing Joins
- Q(u,x) - R(u,v), S(v,w), T(w,x)
- R S T
- Many ways of doing a single join R S
- Symmetric vs. asymmetric join operations
- Nested join, hash join, double pipe-lined hash
join etc. - Processing costs alone vs. processing transfer
costs - Get R and S together vs, get R, get just the
tuples of S that will join with R (semi-join) - Many orders in which to do the join
- (R join S) join T
- (S join R) join T
- (T join S) join R etc.
- All with different costs
54Determining Join Order
- In principle, we need to consider all possible
join orderings - As the number of joins increases, the number of
alternative plans grows rapidly we need to
restrict the search space. - System-R consider only left-deep join trees.
- Left-deep trees allow us to generate all fully
pipelined plansIntermediate results not written
to temporary files. - Not all left-deep trees are fully pipelined
(e.g., SM join).
55Query Optimization Process(simplified a bit)
- Parse the SQL query into a logical tree
- identify distinct blocks (corresponding to nested
sub-queries or views). - Query rewrite phase
- apply algebraic transformations to yield a
cheaper plan. - Merge blocks and move predicates between blocks.
- Optimize each block join ordering.
- Complete the optimization select scheduling
(pipelining strategy).
56Cost Estimation
- For each plan considered, must estimate cost
- Must estimate cost of each operation in plan
tree. - Depends on input cardinalities.
- Must estimate size of result for each operation
in tree! - Use information about the input relations.
- For selections and joins, assume independence of
predicates. - System R cost estimation approach.
- Very inexact, but works ok in practice.
- More sophisticated techniques known now.
57Key Lessons in Optimization
- There are many approaches and many details to
consider in query optimization - Classic search/optimization problem!
- Not completely solved yet!
- Main points to take away are
- Algebraic rules and their use in transformations
of queries. - Deciding on join ordering System-R style
(Selinger style) optimization. - Estimating cost of plans and sizes of
intermediate results.
58Concurrency Control
- Concurrent execution of user programs key to
good DBMS performance. - Disk accesses frequent, pretty slow
- Keep the CPU working on several programs
concurrently. - Interleaving actions of different programs
trouble! - e.g., account-transfer print statement at same
time - DBMS ensures such problems dont arise.
- Users/programmers can pretend they are using a
single-user system. (called Isolation) - Thank goodness! Dont have to program very,
very carefully.
59Transactions ACID Properties
- Key concept is a transaction a sequence of
database actions (reads/writes). - DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - Each transaction, executed completely, must take
the DB between consistent states or must not run
at all. - DBMS ensures that concurrent transactions appear
to run in isolation. - DBMS ensures durability of committed Xacts even
if system crashes. -
- Note can specify simple integrity constraints on
the data. The DBMS enforces these. - Beyond this, the DBMS does not understand the
semantics of the data. - Ensuring that a single transaction (run alone)
preserves consistency is largely the users
responsibility!
60Scheduling Concurrent Transactions
- DBMS ensures that execution of T1, ... , Tn is
equivalent to some serial execution T1 ... Tn. - Before reading/writing an object, a transaction
requests a lock on the object, and waits till the
DBMS gives it the lock. All locks are held
until the end of the transaction. (Strict 2PL
locking protocol.) - Idea If an action of Ti (say, writing X) affects
Tj (which perhaps reads X), say Ti obtains the
lock on X first so Tj is forced to wait until
Ti completes.This effectively orders the
transactions. - What if Tj already has a lock on Y and Ti
later requests a lock on Y? (Deadlock!) Ti or Tj
is aborted and restarted!
61Ensuring Transaction Properites
- DBMS ensures atomicity (all-or-nothing property)
even if system crashes in the middle of a Xact. - DBMS ensures durability of committed Xacts even
if system crashes. - Idea Keep a log (history) of all actions carried
out by the DBMS while executing a set of Xacts - Before a change is made to the database, the
corresponding log entry is forced to a safe
location. (WAL protocol OS support for this is
often inadequate.) - After a crash, the effects of partially executed
transactions are undone using the log. Effects of
committed transactions are redone using the log. - trickier than it sounds!