22. Parallel, Distributed Access

About This Presentation

Title:

22. Parallel, Distributed Access

Description:

Hetero- & homo-geneous, client- & collaborating server, horizontal ... Distributive sort: Range partition the data, then sort each range. Is it scalable? ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 45

Provided by: RaghuRamak

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: 22. Parallel, Distributed Access

1
22. Parallel, Distributed Access

Parallel vs Distributed DBMSs
Parallel DBMSs
Measuring success scalability
Hardware Architectures
Types of parallelism, partitioning
Parallelizing operators scan, select, sort,
aggregate, join
Distributed DBMSs
Classifying distributed DBMSs
Hetero- homo-geneous, client- collaborating
server, horizontal vertical fragmentation
Distributed catalog management
Distributed query processing
General queries, joins, optimization
Replication
Synchronous voting, read-any write-all
Asynchronous Peer to Peer, Primary Site (Capture
and Apply)
Distributed Locking, Deadlock Detection,
Transactions

2
Parallel vs. Distributed DBMSs

DBMS may be spread among the nodes of a network
Parallel DBMS
Nodes are under a central control
Typically nodes are in one location
Motivated by performance
Distributed DBMS
Nodes are autonomous, may run different DBMSs
Nodes are physically distributed
Motivated by need for remote access to data in
spite of network failures

3
Motivation for Parallel DBMSs What if one CPU is
not enough?

For a large web application, one CPU is never
enough
A large web application may need to scan a 1TB
file or service 100s of customers at one time.
At 50MB/second, scanning a 1TB file takes 5 hours
Using 1000 CPUs scanning 1000 disks, the scan
will take 20 seconds
Similarly, 100s of customers accessing a database
can result in response times in the minutes
Solution Parallel DBMSs

4
DBMS The Success Story

DBMSs, and Information Retrieval, are the most
(only?) successful application of parallelism.
Every major DBMS vendor has some server
Large DBMS backends, for web applications, are
all parallelized.
Large IR applications are parallelized
Reasons for DBMS success
Relational Algebra has few operators
Just parallelize each one and compose them
Bulk-processing ( partition -ism).
Natural pipelining in operator trees.

5
How do we measure the success of parallel
algorithms and architectures?

Scalability
There are two kinds of scalability
Speedup When the number of nodes grows by a
factor n, then so does the efficiency
Scaleup When the number of nodes and the size of
the database both grow by n, then the efficiency
does not change
Efficiency can be measured in terms of
Elapsed time, or average time per transaction
Number of transactions per second (throughput)

6
Architecture Issue Shared What?
Easy to program. Expensive to build. Difficult to
scale Beyond 30 procs
Hard to program. Cheap to build. Easy to scale
To 1000s of procs
Clusters
SUN, many others
NCR Teradata, IBM SP
7
How would you parallelize this plan?

Assume Sailors, Reservations are already sorted
on sid

Sort-merge join
?sid
On-the-fly
?ranklt5
Sailors
Reservations
8
Different Types of DBMS -ism

Pipeline parallelism Inter-operator
each operator may run concurrently on a different
node (may exploit pipelining or bushy plans)
Partition parallelism Intra-operator
multiple nodes work to compute a given operation
(scan, sort, join)
Well focus on intra-operator -ism
Challenge How to partition the data!

9
Three Types of Data Partitioning
Range Hash Round Robin
A...E
F...J
F...J
T...Z
A...E
K...N
O...S
T...Z
F...J
K...N
O...S
T...Z
K...N
O...S
A...E
Good for equijoins, range queries group-by
Good for equijoins Group-by
Good to spread load
Shared disk and memory less sensitive to
partitioning, Shared nothing benefits more from
"good" partitioning
10
How to Parallelize the Scan Operator

Given a partition, what is the parallel version
of the scan operator?
Scan each partition, then merge
What kind of partition will make the scan
scalable?
Equal size

11
Parallelizing Selection

What is the parallel version of the selection
operator?
Range partition, range selection
Hash partition, range selection
Hash partition, equality selection
Round Robin partition
Which of the above is scalable for speedup?
Range partition, range selection
Hash partition, range selection
Hash partition, equality selection
Round Robin partition

12
Parallel Sorting Two Algorithms

Distributive sort Range partition the data, then
sort each range
Is it scalable? Only if partitions are equal
sized
How to choose partitioning vector?
Parallel External Sort-Merge Sort the data at
each node, then merge
Problem how to merge in parallel
One solution range partition the data
Is it scalable?

13
Parallel Sorting, ctd.

How to choose partitioning vector? Sampling
What if sampling is not possible?
For example, if input is another process, such as
another operator in a query plan
In this case can use external sort-merge and set
up a tree of merging steps
See US Patent 5,852,826

14
Parallel Aggregates

Problem Compute an aggregate function in
parallel
For each aggregate function, need a
decomposition
sum(S) sum( sum(s(i)) )
count(S) sum( count(s(i)) )
avg(S) (S sum(s(i))) / S count(s(i))
Is it scalable?
Sometimes its not so simple median( )
Similar for groups

15
Parallel Hash Join
Phase 1

Recall Phase 1 of Hash Join
Do Phase 1 in parallel. For R, then S
Perform partitioning at each node
Send hash bucket is output to node i, j to j,
etc.
Phase 2 is simpler
Perform join of Ri and Si at each node i

16
Dataflow Network for Join A ? B
Node J
Node I

Good use of split/merge makes it easier to build
parallel versions of sequential join code.

17
Complex Parallel Query Plans

Complex Queries Inter-Operator parallelism
Pipelining between operators
note that sort and phase 1 of hash-join block the
pipeline!!
Bushy Trees

Sites 1-8
Sites 1-4
Sites 5-8
18
Distributed Databases

Data is stored at several sites, each managed by
a DBMS that can run independently.
Distributed Data Independence Users should not
have to know where data is located (extends
Physical and Logical Data Independence
principles).
Distributed Transaction ACIDITY Users should be
able to write atomic and durable Xacts accessing
multiple sites

19
Types of Distributed Databases

Homogeneous Every site runs same type of DBMS.
Heterogeneous Different sites run different
DBMSs (different RDBMSs or even non-relational
DBMSs).

Gateway
DBMS1
DBMS2
DBMS3
20
Network Architectures

Definitions
Client Requests a service
Server provides a service
Client-server architecture one of each
Fat client Applications run from the client
Thin client Applications run from the server
Business Applications use Thin Client
Fat Clients are too hard to mange
Personal Applications use Fat Client
Does anyone use Google Docs?
3-tier architecture. Middle tier runs
middleware.
Middlware includes business logic, DBMS
coordination
DBMS back end may involve multiple collaborating
servers

21
Availability

Definition of time can answer queries
Sometimes only local queries can be answered
What detracts from availability?
Node crashes
Network crashes
During a network (perhaps partial) crash, the
DBMS at each node should continue to operate.

22
Storing Data Fragmentation

Horizontal Fragmentation
Usually disjoint
Example
Motivation More efficient to place data where
queries are located, if possible
Vertical Fragmentation
Remember Normalization?
Not as common in DDBMSs as horizontal
fragmentation

23
Storing Data - Replication

Disadvantages (redundancy!)
Wasted space
Possibly inconsistent data values
Advantages
Increased availability
Faster query evaluation
How do we keep track of all replicas of data?

SITE A
24
Distributed Catalog Management

Catalog contains schema, authorization,
statistics, and location of each
relation/replica/fragment
Problem where to store the catalog?
Example I want to query sales. Where do I find
it? Location info is in the catalog, but where
is the catalog?
Solutions
Store the entire catalog at every site
Catalog updates are too expensive
Store the entire catalog at a single master site
Single point of failure, performance bottleneck
Store catalog info for all replicas and fragments
of a relation at the birthsite of each relation
How do I find the birthsite of a relation?

25
Distributed Catalog Management (ctd.)

Intergalactic standard solution
Name each relation with ltlocal-name, birth-sitegt
E.g., ltsales, Portland01gt
Keep catalog of a relation at its birthplace
How many network I/Os are required to update a
catalog?
If a data items location is changed, does the
birth-site change? How many network I/Os are
required?

26
Distributed Queries
SELECT AVG(S.age) FROM Sailors S WHERE S.rating gt
3 AND S.rating lt 7

Horizontally Fragmented Tuples with rating lt 5
at Shanghai, gt 5 at Tokyo.
Must compute SUM(age), COUNT(age) at both sites.
If WHERE contained just S.ratinggt6, just one
site.
Vertically Fragmented sid and rating at
Shanghai, sname and age at Tokyo, tid at both.
How to evaluate the query?
Replicated Sailors copies at both sites.
Choice of site based on local costs, network
costs.

27
Distributed Joins
LONDON
PARIS
Sailors
Reserves
500 pages
1000 pages
SELECT FROM Sailors S, Reserves R WHERE S.sid
Reserves.sid and rank5

Fetch as Needed, Page NL, Sailors as outer
Cost 500 D 500 1000 (DS)/10
/10 because of rank5 condition
D is cost to read/write page S is cost to ship
page.
S is very large, so cost is essentially 50,000S
Ship to One Site Ship Reserves to London.
Cost 1000 S 4500 D (SM Join cost
3(5001000))
Essentially 1000S

28
Semijoin Technique

At London, project ?rank5Sailors onto join
column sid and ship this to Paris. Cost 5S
10 for rank5, 10 for project onto join column
At Paris, join Sailors projection with Reserves.
Result is called reduction of Reserves wrt
Sailors.
Ship reduction of Reserves to London. Cost 100S
At London, join Sailors with reduction of
Reserves.
Total cost 105S
Idea Tradeoff the cost of computing and
shipping projection for cost of shipping full
Reserves relation.

29
Bloomjoin

At London
Create a 64K bit vector V (one 8K page)
Select h a hash function from sids to 0,64K-1
If theres a sailor with rating 5 set V(h(her
sid))true
Ship V to Paris. Cost 1S
At Paris
If a reservation has V(h(its sid))true, ship it
to London
Cost 100S
Perhaps a bit more if hash function is not
effective, but 64K is a pretty big bit vector
At London
Join Sailors with reduced Reservations.
Total cost 101S.

30
Distributed Query Optimization

Cost-based approach consider plans, pick
cheapest similar to centralized optimization.
Difference 1 Communication costs must be
considered.
Difference 2 Local site autonomy must be
respected.
Difference 3 New distributed join methods.
Query site constructs global plan, with suggested
local plans describing processing at each site.
If a site can improve suggested local plan, free
to do so.

31
Updating Distributed Data

Synchronous Replication All copies of a modified
relation (fragment) must be updated before the
modifying Xact commits.
Data distribution is made transparent to users.
Asynchronous Replication Copies of a modified
relation are only periodically updated different
copies may get out of synch in the meantime.
Users must be aware of data distribution.

32
Synchronous Replication

Read-any Write-all Writes are slower and reads
are faster, relative to Voting.
Most common approach to synchronous replication.
Voting Xact must write a majority of copies to
modify an object must read enough copies to be
sure of seeing at least one most recent copy.
E.g., 10 copies 7 written for update 4 copies
read.
Each copy has version number.
Not attractive usually because reads are common.
Choice of technique determines which locks to set.

33
Cost of Synchronous Replication

Before an update Xact can commit, it must obtain
locks on all modified copies.
Sends lock requests to remote sites, and while
waiting for the response, holds on to other
locks!
If sites or links fail, Xact cannot commit until
they are back up.
Even if there is no failure, committing must
follow an expensive commit protocol with many
msgs.
So the alternative of asynchronous replication is
becoming widely used.

34
Asynchronous Replication

Allows modifying Xact to commit before all copies
have been changed (and readers nonetheless look
at just one copy).
Users must be aware of which copy they are
reading, and that copies may be out-of-sync for
short periods of time.
Two approaches Primary Site and Peer-to-Peer
replication.
Difference lies in how many copies are
updatable or master copies.

35
Peer-to-Peer Replication

More than one of the copies of an object can be a
master in this approach.
Changes to a master copy must be propagated to
other copies somehow.
If two master copies are changed in a conflicting
manner, this must be resolved. (e.g., Site 1
Joes age changed to 35 Site 2 to 36)
Best used when conflicts do not arise
E.g., Each master site owns a disjoint fragment.

36
Primary Site Replication

Exactly one copy of a relation is designated the
primary or master copy. Replicas at other sites
cannot be directly updated.
The primary copy is published.
Other sites subscribe to (fragments of) this
relation these are secondary copies.
Main issue How are changes to the primary copy
propagated to the secondary copies?
Done in two steps. First, capture changes made
by committed Xacts then apply these changes.

37
Implementing the Capture Step

Log-Based Capture The log (kept for recovery) is
used to generate a Change Data Table (CDT).
If this is done when the log tail is written to
disk, must somehow remove changes due to
subsequently aborted Xacts.
Procedural Capture A procedure that is
automatically invoked does the capture
typically, just takes a snapshot.
Log-Based Capture is better (cheaper, faster) but
relies on proprietary log details.

38
Implementing the Apply Step

The Apply process at the secondary site
periodically obtains (a snapshot or) changes to
the CDT table from the primary site, and updates
the copy.
Period can be timer-based or user/application
defined.
Replica can be a view over the modified relation!
If so, the replication consists of incrementally
updating the materialized view as the relation
changes.
Log-Based Capture plus continuous Apply minimizes
delay in propagating changes.
Procedural Capture plus application-driven Apply
is the most flexible way to process changes.

39
Distributed Locking

How do we manage locks for objects across many
sites?
Centralized One site does all locking.
Vulnerable to single site failure.
Primary Copy All locking for an object done at
the primary copy site for this object.
Reading requires access to locking site as well
as site where the object (copy/fragment) is
stored.
Fully Distributed Locking for a copy is done at
the site where the copy/fragment is stored.
Locks at all or many sites while writing an
object.

40
Distributed Deadlock Detection

Each site maintains a local waits-for graph.
A global deadlock might exist even if the local
graphs contain no cycles

T1
T1
T1
T2
T2
T2
SITE A
SITE B
GLOBAL

Three solutions Centralized (send all local
graphs to one site) Hierarchical (organize sites
into a hierarchy and send local graphs to parent
in the hierarchy) Timeout (abort Xact if it
waits too long).

41
Distributed Transactions

Problem Atomicity and Durability in DDBMSs
Example
A100 at Paris, B-100 at London
Each is called a subtransaction.
The query originates at a third site called the
coordinator.
Who keeps the log? Each site.
How does each subtransaction know when to commit?

42
2-Phase Commit (2PC) Protocol

Coordinator sends prepare message to each
subtransaction
Each subtransaction responds with yes or no
Yes means it has done everything but write the
end record
Coordinator sends commit or abort to all
subtransactions
Subtransactions send ack messages to coordinator

43
2PC Issues

What if
only one subtransaction says no?
All subtransactions say yes, then one cannot
commit?
All subtransactions say yes, then one crashes?

44
Learning Objectives

Definitions parallel and distributed DBMSs,
scalability, 3 parallel architectures, pipeline
and partition parallelism, 3 types of data
partitioning and their uses
How to parallelize operators and determine their
scalability scan, select, sort, aggregate, hash
join
Definition of distributed data independence and
acidity, network architectures, availability,
fragmentation, replication
How is replication managed?
How are distributed queries processed?
How is distributed data updated?
How are distributed locks and deadlocks managed?
How are distributed transactions managed?