Title: 22. Parallel, Distributed Access
122. Parallel, Distributed Access
- Parallel vs Distributed DBMSs
- Parallel DBMSs
- Measuring success scalability
- Hardware Architectures
- Types of parallelism, partitioning
- Parallelizing operators scan, select, sort,
aggregate, join - Distributed DBMSs
- Classifying distributed DBMSs
- Hetero- homo-geneous, client- collaborating
server, horizontal vertical fragmentation - Distributed catalog management
- Distributed query processing
- General queries, joins, optimization
- Replication
- Synchronous voting, read-any write-all
- Asynchronous Peer to Peer, Primary Site (Capture
and Apply) - Distributed Locking, Deadlock Detection,
Transactions
2Parallel vs. Distributed DBMSs
- DBMS may be spread among the nodes of a network
- Parallel DBMS
- Nodes are under a central control
- Typically nodes are in one location
- Motivated by performance
- Distributed DBMS
- Nodes are autonomous, may run different DBMSs
- Nodes are physically distributed
- Motivated by need for remote access to data in
spite of network failures
3Motivation for Parallel DBMSs What if one CPU is
not enough?
- For a large web application, one CPU is never
enough - A large web application may need to scan a 1TB
file or service 100s of customers at one time. - At 50MB/second, scanning a 1TB file takes 5 hours
- Using 1000 CPUs scanning 1000 disks, the scan
will take 20 seconds - Similarly, 100s of customers accessing a database
can result in response times in the minutes - Solution Parallel DBMSs
4DBMS The Success Story
- DBMSs, and Information Retrieval, are the most
(only?) successful application of parallelism. - Every major DBMS vendor has some server
- Large DBMS backends, for web applications, are
all parallelized. - Large IR applications are parallelized
- Reasons for DBMS success
- Relational Algebra has few operators
- Just parallelize each one and compose them
- Bulk-processing ( partition -ism).
- Natural pipelining in operator trees.
5How do we measure the success of parallel
algorithms and architectures?
- Scalability
- There are two kinds of scalability
- Speedup When the number of nodes grows by a
factor n, then so does the efficiency - Scaleup When the number of nodes and the size of
the database both grow by n, then the efficiency
does not change - Efficiency can be measured in terms of
- Elapsed time, or average time per transaction
- Number of transactions per second (throughput)
6Architecture Issue Shared What?
Easy to program. Expensive to build. Difficult to
scale Beyond 30 procs
Hard to program. Cheap to build. Easy to scale
To 1000s of procs
Clusters
SUN, many others
NCR Teradata, IBM SP
7How would you parallelize this plan?
- Assume Sailors, Reservations are already sorted
on sid
Sort-merge join
?sid
On-the-fly
?ranklt5
Sailors
Reservations
8Different Types of DBMS -ism
- Pipeline parallelism Inter-operator
- each operator may run concurrently on a different
node (may exploit pipelining or bushy plans) - Partition parallelism Intra-operator
- multiple nodes work to compute a given operation
(scan, sort, join) - Well focus on intra-operator -ism
- Challenge How to partition the data!
9Three Types of Data Partitioning
Range Hash Round Robin
A...E
F...J
F...J
T...Z
A...E
K...N
O...S
T...Z
F...J
K...N
O...S
T...Z
K...N
O...S
A...E
Good for equijoins, range queries group-by
Good for equijoins Group-by
Good to spread load
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits more from
"good" partitioning
10How to Parallelize the Scan Operator
- Given a partition, what is the parallel version
of the scan operator? - Scan each partition, then merge
- What kind of partition will make the scan
scalable? - Equal size
11Parallelizing Selection
- What is the parallel version of the selection
operator? - Range partition, range selection
-
- Hash partition, range selection
-
- Hash partition, equality selection
- Round Robin partition
- Which of the above is scalable for speedup?
- Range partition, range selection
-
- Hash partition, range selection
-
- Hash partition, equality selection
- Round Robin partition
12Parallel Sorting Two Algorithms
- Distributive sort Range partition the data, then
sort each range - Is it scalable? Only if partitions are equal
sized - How to choose partitioning vector?
- Parallel External Sort-Merge Sort the data at
each node, then merge - Problem how to merge in parallel
- One solution range partition the data
- Is it scalable?
13Parallel Sorting, ctd.
- How to choose partitioning vector? Sampling
- What if sampling is not possible?
- For example, if input is another process, such as
another operator in a query plan - In this case can use external sort-merge and set
up a tree of merging steps - See US Patent 5,852,826
14Parallel Aggregates
- Problem Compute an aggregate function in
parallel - For each aggregate function, need a
decomposition - sum(S) sum( sum(s(i)) )
- count(S) sum( count(s(i)) )
- avg(S) (S sum(s(i))) / S count(s(i))
- Is it scalable?
- Sometimes its not so simple median( )
- Similar for groups
15Parallel Hash Join
Phase 1
- Recall Phase 1 of Hash Join
- Do Phase 1 in parallel. For R, then S
- Perform partitioning at each node
- Send hash bucket is output to node i, j to j,
etc. - Phase 2 is simpler
- Perform join of Ri and Si at each node i
16Dataflow Network for Join A ? B
Node J
Node I
- Good use of split/merge makes it easier to build
parallel versions of sequential join code.
17Complex Parallel Query Plans
- Complex Queries Inter-Operator parallelism
- Pipelining between operators
- note that sort and phase 1 of hash-join block the
pipeline!! - Bushy Trees
Sites 1-8
Sites 1-4
Sites 5-8
18Distributed Databases
- Data is stored at several sites, each managed by
a DBMS that can run independently. - Distributed Data Independence Users should not
have to know where data is located (extends
Physical and Logical Data Independence
principles). - Distributed Transaction ACIDITY Users should be
able to write atomic and durable Xacts accessing
multiple sites
19Types of Distributed Databases
- Homogeneous Every site runs same type of DBMS.
- Heterogeneous Different sites run different
DBMSs (different RDBMSs or even non-relational
DBMSs).
Gateway
DBMS1
DBMS2
DBMS3
20Network Architectures
- Definitions
- Client Requests a service
- Server provides a service
- Client-server architecture one of each
- Fat client Applications run from the client
- Thin client Applications run from the server
- Business Applications use Thin Client
- Fat Clients are too hard to mange
- Personal Applications use Fat Client
- Does anyone use Google Docs?
- 3-tier architecture. Middle tier runs
middleware. - Middlware includes business logic, DBMS
coordination - DBMS back end may involve multiple collaborating
servers
21Availability
- Definition of time can answer queries
- Sometimes only local queries can be answered
- What detracts from availability?
- Node crashes
- Network crashes
- During a network (perhaps partial) crash, the
DBMS at each node should continue to operate.
22Storing Data Fragmentation
- Horizontal Fragmentation
- Usually disjoint
- Example
- Motivation More efficient to place data where
queries are located, if possible - Vertical Fragmentation
- Remember Normalization?
- Not as common in DDBMSs as horizontal
fragmentation
23Storing Data - Replication
- Disadvantages (redundancy!)
- Wasted space
- Possibly inconsistent data values
- Advantages
- Increased availability
- Faster query evaluation
- How do we keep track of all replicas of data?
SITE A
24Distributed Catalog Management
- Catalog contains schema, authorization,
statistics, and location of each
relation/replica/fragment - Problem where to store the catalog?
- Example I want to query sales. Where do I find
it? Location info is in the catalog, but where
is the catalog? - Solutions
- Store the entire catalog at every site
- Catalog updates are too expensive
- Store the entire catalog at a single master site
- Single point of failure, performance bottleneck
- Store catalog info for all replicas and fragments
of a relation at the birthsite of each relation - How do I find the birthsite of a relation?
25Distributed Catalog Management (ctd.)
- Intergalactic standard solution
- Name each relation with ltlocal-name, birth-sitegt
- E.g., ltsales, Portland01gt
- Keep catalog of a relation at its birthplace
- How many network I/Os are required to update a
catalog? - If a data items location is changed, does the
birth-site change? How many network I/Os are
required?
26Distributed Queries
SELECT AVG(S.age) FROM Sailors S WHERE S.rating gt
3 AND S.rating lt 7
- Horizontally Fragmented Tuples with rating lt 5
at Shanghai, gt 5 at Tokyo. - Must compute SUM(age), COUNT(age) at both sites.
- If WHERE contained just S.ratinggt6, just one
site. - Vertically Fragmented sid and rating at
Shanghai, sname and age at Tokyo, tid at both. - How to evaluate the query?
- Replicated Sailors copies at both sites.
- Choice of site based on local costs, network
costs.
27Distributed Joins
LONDON
PARIS
Sailors
Reserves
500 pages
1000 pages
SELECT FROM Sailors S, Reserves R WHERE S.sid
Reserves.sid and rank5
- Fetch as Needed, Page NL, Sailors as outer
- Cost 500 D 500 1000 (DS)/10
- /10 because of rank5 condition
- D is cost to read/write page S is cost to ship
page. - S is very large, so cost is essentially 50,000S
- Ship to One Site Ship Reserves to London.
- Cost 1000 S 4500 D (SM Join cost
3(5001000)) - Essentially 1000S
28Semijoin Technique
- At London, project ?rank5Sailors onto join
column sid and ship this to Paris. Cost 5S - 10 for rank5, 10 for project onto join column
- At Paris, join Sailors projection with Reserves.
- Result is called reduction of Reserves wrt
Sailors. - Ship reduction of Reserves to London. Cost 100S
- At London, join Sailors with reduction of
Reserves. - Total cost 105S
- Idea Tradeoff the cost of computing and
shipping projection for cost of shipping full
Reserves relation.
29Bloomjoin
- At London
- Create a 64K bit vector V (one 8K page)
- Select h a hash function from sids to 0,64K-1
- If theres a sailor with rating 5 set V(h(her
sid))true - Ship V to Paris. Cost 1S
- At Paris
- If a reservation has V(h(its sid))true, ship it
to London - Cost 100S
- Perhaps a bit more if hash function is not
effective, but 64K is a pretty big bit vector - At London
- Join Sailors with reduced Reservations.
- Total cost 101S.
30Distributed Query Optimization
- Cost-based approach consider plans, pick
cheapest similar to centralized optimization. - Difference 1 Communication costs must be
considered. - Difference 2 Local site autonomy must be
respected. - Difference 3 New distributed join methods.
- Query site constructs global plan, with suggested
local plans describing processing at each site. - If a site can improve suggested local plan, free
to do so.
31Updating Distributed Data
- Synchronous Replication All copies of a modified
relation (fragment) must be updated before the
modifying Xact commits. - Data distribution is made transparent to users.
- Asynchronous Replication Copies of a modified
relation are only periodically updated different
copies may get out of synch in the meantime. - Users must be aware of data distribution.
32Synchronous Replication
- Read-any Write-all Writes are slower and reads
are faster, relative to Voting. - Most common approach to synchronous replication.
- Voting Xact must write a majority of copies to
modify an object must read enough copies to be
sure of seeing at least one most recent copy. - E.g., 10 copies 7 written for update 4 copies
read. - Each copy has version number.
- Not attractive usually because reads are common.
- Choice of technique determines which locks to set.
33Cost of Synchronous Replication
- Before an update Xact can commit, it must obtain
locks on all modified copies. - Sends lock requests to remote sites, and while
waiting for the response, holds on to other
locks! - If sites or links fail, Xact cannot commit until
they are back up. - Even if there is no failure, committing must
follow an expensive commit protocol with many
msgs. - So the alternative of asynchronous replication is
becoming widely used.
34Asynchronous Replication
- Allows modifying Xact to commit before all copies
have been changed (and readers nonetheless look
at just one copy). - Users must be aware of which copy they are
reading, and that copies may be out-of-sync for
short periods of time. - Two approaches Primary Site and Peer-to-Peer
replication. - Difference lies in how many copies are
updatable or master copies.
35Peer-to-Peer Replication
- More than one of the copies of an object can be a
master in this approach. - Changes to a master copy must be propagated to
other copies somehow. - If two master copies are changed in a conflicting
manner, this must be resolved. (e.g., Site 1
Joes age changed to 35 Site 2 to 36) - Best used when conflicts do not arise
- E.g., Each master site owns a disjoint fragment.
36Primary Site Replication
- Exactly one copy of a relation is designated the
primary or master copy. Replicas at other sites
cannot be directly updated. - The primary copy is published.
- Other sites subscribe to (fragments of) this
relation these are secondary copies. - Main issue How are changes to the primary copy
propagated to the secondary copies? - Done in two steps. First, capture changes made
by committed Xacts then apply these changes.
37Implementing the Capture Step
- Log-Based Capture The log (kept for recovery) is
used to generate a Change Data Table (CDT). - If this is done when the log tail is written to
disk, must somehow remove changes due to
subsequently aborted Xacts. - Procedural Capture A procedure that is
automatically invoked does the capture
typically, just takes a snapshot. - Log-Based Capture is better (cheaper, faster) but
relies on proprietary log details.
38Implementing the Apply Step
- The Apply process at the secondary site
periodically obtains (a snapshot or) changes to
the CDT table from the primary site, and updates
the copy. - Period can be timer-based or user/application
defined. - Replica can be a view over the modified relation!
- If so, the replication consists of incrementally
updating the materialized view as the relation
changes. - Log-Based Capture plus continuous Apply minimizes
delay in propagating changes. - Procedural Capture plus application-driven Apply
is the most flexible way to process changes.
39Distributed Locking
- How do we manage locks for objects across many
sites? - Centralized One site does all locking.
- Vulnerable to single site failure.
- Primary Copy All locking for an object done at
the primary copy site for this object. - Reading requires access to locking site as well
as site where the object (copy/fragment) is
stored. - Fully Distributed Locking for a copy is done at
the site where the copy/fragment is stored. - Locks at all or many sites while writing an
object.
40Distributed Deadlock Detection
- Each site maintains a local waits-for graph.
- A global deadlock might exist even if the local
graphs contain no cycles
T1
T1
T1
T2
T2
T2
SITE A
SITE B
GLOBAL
- Three solutions Centralized (send all local
graphs to one site) Hierarchical (organize sites
into a hierarchy and send local graphs to parent
in the hierarchy) Timeout (abort Xact if it
waits too long).
41Distributed Transactions
- Problem Atomicity and Durability in DDBMSs
- Example
- A100 at Paris, B-100 at London
- Each is called a subtransaction.
- The query originates at a third site called the
coordinator. - Who keeps the log? Each site.
- How does each subtransaction know when to commit?
422-Phase Commit (2PC) Protocol
- Coordinator sends prepare message to each
subtransaction - Each subtransaction responds with yes or no
- Yes means it has done everything but write the
end record - Coordinator sends commit or abort to all
subtransactions - Subtransactions send ack messages to coordinator
432PC Issues
- What if
- only one subtransaction says no?
- All subtransactions say yes, then one cannot
commit? - All subtransactions say yes, then one crashes?
44Learning Objectives
- Definitions parallel and distributed DBMSs,
scalability, 3 parallel architectures, pipeline
and partition parallelism, 3 types of data
partitioning and their uses - How to parallelize operators and determine their
scalability scan, select, sort, aggregate, hash
join - Definition of distributed data independence and
acidity, network architectures, availability,
fragmentation, replication - How is replication managed?
- How are distributed queries processed?
- How is distributed data updated?
- How are distributed locks and deadlocks managed?
- How are distributed transactions managed?