Title: ICS 214B: Transaction Processing and Distributed Data Management
1ICS 214B Transaction Processing and Distributed
Data Management
- Lecture 8 Distributed Database Systems
- Professor Chen Li
2So far Centralized DB systems
P
M
...
- Simplifications
- single front end
- one place to keep locks
- if processor fails, system fails, ...
3Next distributed database systems
- Multiple processors ( memories)
- Heterogeneity and autonomy of components
4Why do we need Distributed Databases?
- Example Big Corp. has offices in London, New
York, and Hong Kong. - Employee data
- EMP(ENO, NAME, TITLE, SALARY, )
- Where should the employee data table reside?
5Big Corp. Data Access Pattern
- Mostly, employee data is managed at the office
where the employee works - E.g., payroll, benefits, hire and fire
- Periodically, Big Corp needs consolidated access
to employee data - E.g., Big Corp. changes benefit plans and that
affects all employees. - E.g., Annual bonus depends on global net profit.
6New York Payroll app
London Payroll app
EMP
London
New York
Hong Kong Payroll app
Problem NY and HK payroll apps run very slowly!
Hong Kong
7New York Payroll app
London Payroll app
London Emp
NY Emp
London
New York
Hong Kong Payroll app
Much better!!
Hong Kong
HK Emp
8New York Payroll app
London Payroll app
Annual Bonus app
London Emp
NY Emp
London
New York
Hong Kong Payroll app
Distribution provides opportunities for parallel
execution
Hong Kong
HK Emp
9New York Payroll app
London Payroll app
Annual Bonus app
London Emp
NY Emp
London
New York
Hong Kong Payroll app
Hong Kong
HK Emp
10New York Payroll app
London Payroll app
Annual Bonus app
Lon, NY Emp
NY, HK Emp
London
New York
Hong Kong Payroll app
Replication improves availability
Hong Kong
HK, Lon Emp
11Heterogeneity and Autonomy
Application
Files
Stock ticker tape
RDBMS
Portfolio
History of dividends, ratios,...
12- We will study data management
- with multiple processors and possible
- autonomy, heterogeneity
- Impact on
- Data organization
- Query processing
- Access structures
- Concurrency control
- Recovery
13- We also study transaction monitors
- Coordinate transaction execution
- Multiple DBMSs
- High performance
- Have workflow facilities
- Manage communications with client terminals
14DB architectures
P
P
P
...
M
15DB architectures
...
M
M
...
16DB architectures
...
17DB architectures
- (4) Hybrid example Hierarchical or Clustered
18Issues for selecting architecture
- Reliability
- Scalability
- Geographic distribution of data
- Data clusters
- Performance
- Cost
19Parallel or distributed DB system?
- More similarities than differences!
20- Typically, parallel DBs
- Fast interconnect
- Homogeneous software
- High performance is goal
- Transparency is goal
21- Typically, distributed DBs
- Geographically distributed
- Data sharing is goal (may run into
- heterogeneity, autonomy)
- Disconnected operation possible
22Distributed Database Challenges
- Distributed Database Design
- Deciding what data goes where
- Depends on data access patterns of major
applications - Two subproblems
- Fragmentation partition tables into fragments
- Allocation allocate fragments to nodes
23Distributed Database Challenges
- Distributed Query Processing
- Centralized query plan goal minimize number of
disk I/Os - Additional factors in distributed scenario
- Communication costs
- Opportunity for parallelism
- Space of possible query plans is much larger!
24Distributed Database Challenges
- Distributed Concurrency Control
- Transactions span nodes
- Must be globally serializable
- Two main approaches
- Locking
- Timestamps
- Distributed Deadlock Management
- Multiple data copies need to be kept in sync
when updates occur
25Distributed Database Challenges
- Reliability of Distributed Databases
- Centralized database failure model
- processor fails
- Distributed database failure model
- One or more processors may fail
- Network may fail
- Network may be partitioned
- Data must be kept in sync
-
26? To illustrate synchronization problems
Two Generals Problem
27The one general problem (Trivial!)
G
Troops
28The two general problem
- lt-------------------------------gt
Blue army
Red army
Enemy
Blue G
Red G
messengers
29Rules
- Blue and red army must attack at same time
- Blue and red generals synchronize through
messengers - Messengers can be lost
30Distributed Database Challenges
Application
Files
Stock ticker tape
RDBMS
Portfolio
History of dividends, ratios,...
31Distributed Database Challenges
- Example unable to get statistics
- for query optimization
- Example blue general may have mind of his (or
her) own!
32Next Topic
33Distributed DB Design
- Top-down approach
- - have DB
- - how to split and allocate the sites
- Bottom-up approach
- - multi-database (possibly heterogeneous,
autonomous) - - no design issues!
34Two issues in DDB design
- Fragmentation
- Allocation
- Note issues not independent,
- but will cover separately
35- Employee relation E (,name,loc,sal,)
- 40 of queries 40 of queries
- Qa select Qb select
- from E from E
- where locSa where locSb
- and and ...
365
Sa
10
Joe
7
Sally
Sb
25
8
Tom
Sa
15
..
..
F
NM Loc Sal
NM Loc Sal
5
Sa
10
Joe
7
Sb
25
Sally
..
8
Tom
Sa
15
..
At Sb
At Sa
37- F F1, F2
- F1 ?locSa(E) F2 ?locSb(E)
? called primary horizontal fragmentation
38Fragmentation
- Horizontal Primary
- depends on local attributes
- R Derived
- depends on foreign relation
- Vertical
-
- R
39Three common horizontal fragmentation techniques
- Round robin
- Hash partitioning
- Range partitioning
40 Round robin
- R D0 D1 D2
- t1 t1
- t2 t2
- t3 t3
- t4 t4
- ... t5
- Evenly distributes data
- Good for scanning full relation
- Not good for point or range queries
- Not suitable for databases distributed over WAN
41 Hash partitioning
- R D0 D1 D2
- t1?h(k1)2 t1
- t2?h(k2)0 t2
- t3?h(k3)0 t3
- t4?h(k4)1 t4
- ...
- Good for point queries on key also for joins on
key - Not good for range queries point queries not on
key - If hash function good, even distribution
- Not suitable for databases distributed over a WAN
42 Range partitioning
- R D0 D1 D2
- t1 A5 t1
- t2 A8 t2
- t3 A2 t3
- t4 A3 t4
- ...
partitioning vector
4
7
V0 V1
- Good for point queries on A also for joins on A
- Good for some range queries on A
- Need to select good vector else unbalanced
- data skew, execution skew
43Which are good fragmentations?
- Example
- F F1, F2
- F1 ? sallt10 E F2 ? salgt20 E
? Problem Some tuples lost!
44Which are good fragmentations?
Second example F F3, F4 F3 ? sallt10 E
F4 ? salgt5 E
? Tuples with 5 lt sal lt 10 are duplicated...
45- Better design
- Example F F5, F6, F7
- F5 ?sal ? 5 E F6 ?5ltsallt10 EF7
?sal ? 10 E - ? Then replicate F6 if convenient
- (part of allocation problem)
46Desired properties for fragmentation
- R ? F F1, F2, , Fn
- Completeness
- For every data item x ? R, ? Fi?F such that
x?Fi - Disjointness
- ?x?Fi, ?? Fj such that x?Fj, i ? j
- Reconstruction
- There is function g such that
- R g(F1, F2, , Fn)
47Desired properties for horizontal fragmentation
- R ? F F1, F2, , Fn
- Completeness
- For every tuple t?R, ? Fi?F such that t?Fi
- Disjointness
- ?t?Fi, ?? Fj such that t?Fj, i ? j
- Reconstruction can safely ignore
- Completeness ? R
48How do we get completeness and disjointness?
- (1) Check it manually!
- e.g., F1 ?sallt10 E F2 ?sal?10 E
49How do we get completeness and disjointness?
- (2) Automatically generate fragments
- with these properties
-
- Horizontal fragments are defined by selection
predicates - Generate a set of selection predicates with the
desired properties
50Example of generation
- Say queries use predicates
- Alt10, Agt5, Loc SA, Loc SB
- Next - generate minterm predicates
- - eliminate useless ones
- Given simple predicates Pr p1, p2,.. pn
- minterm predicates are of the form
- p1 ? p2 ? ? pn
- where pk is pk or is pk
51Minterm predicates (part I)
- (1) Alt10 ? Agt5 ? LocSA ? LocSB
- (2) Alt10 ? Agt5 ? LocSA ? (LocSB)
- (3) Alt10 ? Agt5 ? (LocSA) ? LocSB
- (4) Alt10 ? Agt5 ? (LocSA) ? (LocSB)
- (5) Alt10 ? (Agt5) ? LocSA ? LocSB
- (6) Alt10 ? (Agt5) ? LocSA ? (LocSB)
- (7) Alt10 ? (Agt5) ? (LocSA) ? LocSB
- (8) Alt10 ? (Agt5) ? (LocSA) ? (LocSB)
52Minterm predicates (part II)
- (9) (Alt10) ? Agt5 ? LocSA ? LocSB
- (10) (Alt10) ? Agt5 ? LocSA ?(LocSB)
- (11) (Alt10) ? Agt5 ?(LocSA) ? LocSB
- (12) (Alt10) ? Agt5 ?(LocSA) ?(LocSB)
- (13) (Alt10) ?(Agt5) ? LocSA ? LocSB
- (14) (Alt10) ?(Agt5) ? LocSA ?(LocSB)
- (15) (Alt10) ?(Agt5) ?(LocSA) ? LocSB
- (16) (Alt10) ?(Agt5) ?(LocSA) ?(LocSB)
53Final fragments
- F2 5 lt A lt 10 ? LocSA
- F3 5 lt A lt 10 ? LocSB
- F6 A ? 5 ? LocSA
- F7 A ? 5 ? LocSB
- F10 A ? 10 ? LocSA
- F11 A ? 10 ? LocSB
-
54Note elimination of useless fragments depends on
application semantics
- e.g. if LOC could be ? SA, ? SB, we need to
add fragments - F4 5 ltA lt10 ? Loc ? SA ? Loc ? SB
- F8 A ? 5 ? Loc ? SA ? Loc ? SB
- F12 A ? 10 ? Loc ? SA ? Loc ? SB
55Why does this algorithm work?
- Must prove that the set of fragments is
- Complete
- Disjoint
56Summary
- Given simple predicates Pr p1, p2,.. pn
- minterm predicates are
- Mm m ? pk, 1 ? k ?n
- where pk is pk or is pk
pk?Pr
- Fragments ?m R for all m ? M are
- complete and disjoint