Title: Communicating with the Outside
1Communicating withthe Outside
2Application
Query Processor
Indexes
Storage Subsystem
Concurrency Control
Recovery
Operating System
HardwareProcessor(s), Disk(s), Memory
3The ways applications accessing the Database
System
- 4GL(fourth generation language)
- Execute at high level resulting in small programs
- Hider portability, different vendors offers
different 4GLs - Programming language Call Level Interface
- ODBC(Open DataBase Connectivity) for c/c
- JDBC Java based API, for Java
- Database specific ODBC drivers OCI (C/Oracle),
CLI (C/ DB2) - Perl/DBI
4API pitfalls
- Cost of portability
- Layer of abstraction on top of ODBC drivers to
hide discrepancies across drivers with different
conformance levels. - Beware of performance problems in this layer of
abstraction - Use of meta-data description when submitting
queries, accessing the result set - Iterations over the result set
5ODBC vs. OCI
- ODBC vs. OCI on Oracle8iEE on Windows 2000
- Iteration over a result set one record at a time.
Prefetching is performed. - Low OCI overhead when number of records
transferred is small - ODBC performs better when number of records
transferred increases due to its good at
prefetching.
6Client-Server Mechanisms
- Connection pooling and multiplexing when multiple
clients access a server - Communication buffer on the database server. One
per connection. - If a client does not consume results fast enough,
then the server holds resources until it can
output the result. - Data is sent either when the communication buffer
is full or when a batch is finished executing. - Small buffer frequent transfer overhead
- Large buffer time to first record increases.
- No actual impact on a 100 Mb network. More
sensitive in an intranet with low bandwidth.
7Communication buffer
8Harmful small objects
- authorized(user, type)
- doc(id, type, date)
- What are the document instances a user can see?
- SQL
- select doc.id, doc.datefrom authorized,
docwhere doc.type authorized.typeand
authorized.user ltinputgt
- If documents and doctypes are encapsulated in
objects, the risk is the following - Find types t authorized for user input
- select doc.type as tfrom authorizedwhere user
ltinputgt - For each type t issue the query
- select id, datefrom docwhere type lttgt
- The join is executed in the application and not
in the DB!
9Tuning the Application Interface
- Avoid user interaction within a transaction
- Minimize the number of roundtrips between the
application and the database - Retrieve needed columns only
- Retrieve needed rows only
- Minimize the number of query compilations
10Avoid User Interaction within a Transaction
- User interaction within a transaction forces
locks to be held for a long time. - Careful transaction design (possibly transaction
chopping) to avoid this problem.
11Minimize the Number of Roundtrips to the Database
- Rule of thumb crossing the interface between the
application and the database server is costly - Avoid Loops
- Application programming languages offer looping
facilities (SQL statements, cursors, positioned
updates) - Positioned updates ODBC allows updating the rows
that are obtained as the result of a query, this
forces the processing of updates one row at a
time - Rigid object-oriented programming might force
such loops.
12Avoid External Loops
- No loop
- sqlStmt select from lineitem where l_partkey
lt 200 - odbc-gtprepareStmt(sqlStmt)
- odbc-gtexecPrepared(sqlStmt)
-
- Loop
- sqlStmt select from lineitem where l_partkey
? - odbc-gtprepareStmt(sqlStmt)
- for (int i1 ilt200 i)
-
- odbc-gtbindParameter(1, SQL_INTEGER, i)
- odbc-gtexecPrepared(sqlStmt)
-
13Avoid External Loops
Let the DBMS optimize set operations Fetch 2000
records Loop 200 queries No loop 1
query Crossing the application interface too
often hurts performances.
- SQL Server 2000 on Windows 2000
- Crossing the application interface has a
significant impact on performance
14Avoid Cursors
- No cursor
- select from employees
- Cursor
- DECLARE d_cursor CURSOR FOR select from
employees - OPEN d_cursorwhile (_at__at_FETCH_STATUS 0)
- BEGIN
- FETCH NEXT from d_cursorEND
- CLOSE d_cursor
- go
15Avoiding Cursors
- Cursors are terribly slow in almost all systems
- Using the cursor, records are transmitted from
the database server to the application one at a
time. - Query fetches 200000 56 bytes records
- Response time is a few seconds with a SQL query
and more than an hour iterating over a cursor.
16Package several SQL statements within one call to
the database server
- Embedded procedural language (Transact SQL) with
control flow facilities. - Use User Defined Functions (UDFs) when they
select out a high number of records. - Do not use positioned updates, which update rows
that are obtained as the result of a query,
forcing updates one row at a time.
17Packing sqls into one call to db server
create table Temp1 (parent varchar(200)) create
table Temp2(parent varchar(200)) create table
Ancestor (person varchar(200)) / Temp2will hold
the latest generation discovered. / INSERT INTO
Temp1 SELECT parent FROM Parental WHERE child
Nicholas Bennet WHILE EXISTS(SELECT FROM
Temp1) BEGIN INSERT INTO Ancestor SELECT FROM
Temp1 INSERT INTO Temp2 SELECT FROM
Temp1 DELETE FROM Temp1 INSERT INTO
Temp1 SELECT Parental.parent FROM Parental,
Temp2 WHERE Parental.child Temp2.parent DELETE
FROM Temp2
IF EXISTS ( SELECT FROM Ancestor WHERE person
Carol Diane ) PRINT Carol Diane is an ancestor
of Nicholas Bennet. ELSE PRINT Carol Diane is
not an ancestor of Nicholas Bennet.
T-SQL to determines whether Carol Diane is an
ancestor ofNicholas Bennet. Assume a genealogical
database containing at least the relation
Parental(parent, child).
18User Defined Functions
- Function computes the number of working days
between two dates. - Function executed either on the database site
(UDF) or on the application site - Applying the UDF yields good performances when it
helps reduce significantly the amount of data
sent back to the application.
19Retrieve Needed Columns Only
- Avoid transferring unnecessary data
- Might prevent the use of a covering index, i.e.
index-only scan - In the experiment the subset contains ¼ of the
attributes. - Reducing the amount of data that crosses the
application interface yields significant
performance improvement.
20Retrieve Needed Rows Only
- If the user is only viewing a small subset of a
very large result set, it is best to - Only transfer that subset
- Only compute that subset
- In cases when users only want a feel of the
data, use TOP or FETCH FIRST to fetch just a few
rows - Applications that allow the formulation of ad-hoc
queries(An Ad Hoc Query is a query that cannot be
determined prior to the moment the query is
issued I ) should permit users to cancel them.
21Minimize the Number of Query Compilations
- Prepared execution yields better performance when
the query is executed more than once - No compilation
- No access to catalog.
- Prepared execution plans become obsolete if
indexes are added or the size of the relation
changes.
Experiment performed on Oracle8iEE on Windows
2000.
Direct ODBC direct execution Prepared ODBC
prepared command, then execute it repeatedly .
22Tuning the Application Interface
- Avoid user interaction within a transaction
- Minimize the number of roundtrips between the
application and the database - Retrieve needed columns only
- Retrieve needed rows only
- Minimize the number of query compilations
23Bulk Loading Data
- Purpose loading large volumes of data into a
database - Tools
- SQL server bcp and a Transact-SQL command BULK
INSERT. - Oracle SQL Loader
- DB2 Load utility
- Tool parameters
- Bypass query engine
- Avoid logging
- No index update
- No constraint check
- Frequency of commits
24Use Direct Path for Bulk Loading
- sqlldr directpathtrue controlload_lineitem.ctl
dataE\Data\lineitem.tbl - load data
- infile "lineitem.tbl"
- into table LINEITEM append
- fields terminated by ''
- (
- L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX,
L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE
"YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD",
L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT,
L_SHIPMODE, L_COMMENT - )
25Direct Path
- Loading 600000 records into the lineitem relation
from TPCH - Direct path loading bypasses the query engine and
the storage manager. It is orders of magnitude
faster than conventional path (with a commit
every 100 records) and inserts (with a commit for
each record).
Experiment performed on Oracle8iEE on Windows
2000.
26Batch Size
- Bulk load of 600000 records.
- Throughput increases steadily when the batch size
increases to 100000 records.Throughput remains
constant afterwards. - Trade-off between performance and amount of data
that has to be reloaded in case of problem.
Experiment performed on SQL Server 2000 on
Windows 2000.
27Storage Engine Parameters
- Bulk load of 600000 records.
- As expected
- Turning off logging helps.
- Collecting statistics hurts
- Maintaining indexes incrementally hurts a lot.
Experiment performed on IBM DB2 UDB V7.1 on
Windows 2000.
28Connecting to Multiple Databases
- Shared connection to reduce start-up cost
- Connection pooling
- Pass-through statements when performance is CPU
bound - E.g. dblink in Oracle
- Eliminates query rewriting, translation and
postprocesing to adapt to specific SQL dialect - Transfer large block of data to minimize
roundtrips when performance is network bound,