Title: Overview of Query Evaluation
1Overview of Query Evaluation
- Chapter 12 Ramakrishnan Gehrke
- (Sections 12.1-12.3)
2What will you learn from this lecture?
- Query Evaluation basics
- System catalog (revisit)
- Generic techniques for operator evaluation
- Access paths
- Implementation choices for select/project.
3Some basics to be borne in mind.
- Recall ?3 alternatives for data entry
representation. - An index may be clustered or unclustered.
- An index may be dense or sparse
- E.g., students(ID, name, addr, dept) file sorted
on ID and indexed on ID (alternative 2) and on
dept. (alt. 3). ID index sparse dept. index
dense. - Index on ID is a primary index while that on
dept. is a secondary index. - A file may have multiple indexes.
4Schema for Examples
Songs (sid integer, sname string, genre
string, year date) Ratings (uid integer, sid
integer, time date, rating integer)
- Assume the following sizes
- Songs
- Each tuple is 50 bytes long, 80 tuples per page,
500 pages. ( index pages if it had a hash
index with bktpg, and with data entries
alternative 1). what are we assuming here? - Ratings
- Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages. ( leaf pages if it had a
Btree index, with data entries alternative 1,
and pages filled to capacity).
5System catalog
- maintains important metadata about all data in
the db. - what is data?
- application data in tables
- also all indexes for the tables
- system catalog catalog tables data dictionary
catalog.
6System catalog
- what info. is maintained in the catalog?
- static info.
- ?table table name, file name, file structure
(e.g., sorted/heap file), attribute name(s) and
type(s), index name(s), primary and foreign key
constraints.
- ?index index name and structure (e.g., hash vs.
Btrees, clustered or not), search key
attribute(s). - auxiliary info. on views.
7System catalog
- statistical info. ? relation
- cardinality tuples
- size pages
- index cardinality distinct keys
- index size index pages (leaf)
- index height only for tree indexes
- index range min/max key values.
- most of the static info. stored as a table
(already discussed).
- may maintain additional info. e.g., Histograms.
8Generic Techniques for Operator Evaluation
- indexing use index to examine only tuples
satisfying given selection/join condition. (also
called index probe.)
- iteration scan a table (i.e., data pages) to
retrieve tuples satisfying condition (table scan)
OR scan index pages and examine data entries
therein (index scan). - when is the latter feasible? why is it a good
idea? - the actual index structure is NOT used. (i.e.,
index scan ! index probe.)
- partitioning sorting and hashing used for this
purpose. (commonly needed for group-bys,
duplicate elimination, etc.)
9Generic Techniques reviewed
- Table scan read the whole table (usually, from
disk). - Index scan read all data entries in the index
file (makes sense for alternatives 2 and 3). - Have to read much less than for table scan.
- Can we use index scan for alt. 1?
- What if we had alternative 2 but a sparse index?
- Index probe use index to home in on tuples
satisfying selection condition(s). - Typically, best option for equality selection.
10Some important terms
- access path method for accessing tuples of a
table - Table/index scan OR
- index plus matching selection condition. (only
consider conjunctions C of attr op value.) ?
consider probing the index. - when does an index match a selection condition C
? - hash index for each search key attribute A, C
has a conjunct of the form A v. - Btree index for some prefix of search key, for
every attribute A in the prefix, C has a conjunct
of the form A op v. - why treat match differently? what does match do
for us?
11Access Paths
- suppose we have a hash index on uid, sid, time.
does it match uid1 sid2 time3? what about
uid1 sid2? and uid1 time3? uid1 sid2
time gt 2?
- when a hash index matches a selection condition,
we can fetch just those tuples that satisfy it.
- when the match (between the index and selection
condition) covers some (but not all) conjuncts in
it, we can still fetch those tuples that satisfy
all matched conjuncts and check them for
satisfaction of remaining conditions. Clustered
vs. unclustered index makes a big difference!
12Access Paths
- suppose we have a Btree index on (uid, sid,
time). does it match uid1 sid2 timegt3?
what about uid 1 sid2? and uid1 timegt3?
- when there is a match, we can exploit the Btree
index to fetch tuples satisfying the matched
conjuncts.
- if we had separate indexes on uid, sid and on
time what are our options for timegt3 uid1
sid2? - use one of the indexes and verify unmatched
conditions. - use both indexes and intersect rid sets.
13Reduction Factor
- when there is total match, can fetch exactly
tuples that satisfy given condition. - if not, can fetch tuples satisfying the matched
conjuncts. - reduction factor fraction of tuples in table
that satisfy a conjunct. - smaller the RF ? more selective the access path
using the index. - RF often estimated using independence assumption.
14Operator Take 1 Algorithms (Selection)
- ?A op val(R).
- No index? ? table scan (plan 1).
- If there are indexes, use one to fetch tuples
satisfying matched conjuncts and verify rest
(plan 2). - E.g. sname lt S year2007 on Songs.
- Approx. 70 RF for sname lt S (uniformity
assumption extremely approximate for this
example). - Clustered Btree index on sname 70 x 500 350
pages. - Unclustered Btree index 70 x 40,000 28,000
tuples could be 28,000 pages (I/Os) in the
worst case! - Better off scanning table when RF gt 1.25 in this
example! - Can you construct a similar situation for
Ratings?
15Some explanations
- Where did RF70 come from? Assuming snames
starting with each letter are equally likely, for
sname lt S, there are 18 possible cases in
range out of a total of 26 possible cases (26
letters of the alphabet). This gives 18/26 which
is approx. 70. - Why are we better off scanning the table when RF
gt 1.25? Because 40000 1.25 500 tuples which
might in the worst case cost us 500 I/Os. Even if
we did a table scan it would cost us the same!
16Projection -- ??uid,sid(Ratings).
- Straightforward op key challenge duplicate
elimination can be expensive. - Avoid DE if you can. (e.g., no DISTINCT in select
clause.) - Simple table scan or index scan when index keys
include project fields. - DE Plan 1 use partitioning
- Scan Ratings and write out (uid,sid) pairs
handshake with phase 1 of external sorting. - Handshake last pass of phase 2 with DE.
17Projection (contd.)
- Plan 2 If index data entries include uid, sid,
then sort data entries as opposed to records. - Plan 3 Suppose there is a clustered index on
(uid,sid,ltwhatevergt), simply retireve data
entries and DE on the fly. (since they are
already sorted.) - Note both plans above scan index file rather
than data file! (But you look at the data entries
in the index file.)