Title: Distributed Data Storage and Processing over Commodity Clusters
1Distributed Data Storage and Processing over
Commodity Clusters
Sector Sphere
Yunhong Gu Univ. of Illinois at Chicago _at_Univ.
of Chicago, Feb. 17, 2009
2What is Sector/Sphere?
- Sector Distributed Storage System
- Sphere Run-time middleware that supports
simplified distributed data processing. - Open source software, GPL, written in C.
- Started since 2006, current version 1.18
- http//sector.sf.net
3Overview
- Motivation
- Sector
- Sphere
- Experimental studies
- Future work
4Motivation
Super-computer model Expensive, data IO
bottleneck
Sector/Sphere model Inexpensive, parallel data IO
5Motivation
Parallel/Distributed Programming with MPI,
etc. Flexible and powerful. BUT complicated, no
data locality
Sector/Sphere model Clusters are a unity to the
developer, simplified programming interface, data
locality support from the storage layer. Limited
to certain data parallel applications.
6Motivation
Systems for single data centers Requires
additional effort to locate and move data.
Sector/Sphere model Support wide-area data
collection and distribution.
7Sector Distributed Storage System
Storage System Mgmt. Processing
Scheduling Service provider
System access tools App. Programming Interfaces
User account Data protection System Security
Security Server
Master
Client
SSL
SSL
Data
UDT Encryption optional
slaves
slaves
Storage and Processing
8Sector Distributed Storage System
- Sector stores files on the native/local file
system of each slave node. - Sector does not split files into blocks
- Pro simple/robust, suitable for wide area
- Con file size limit
- Sector uses replications for better reliability
and availability - The master node maintains the file system
metadata. No permanent metadata is needed. - Topology aware
9Sector Write/Read
- Write is exclusive
- Replicas are updated in a chained manner the
client updates one replica, and then this replica
updates another, and so on. All replicas are
updated upon the completion of a Write operation. - Read different replicas can serve different
clients at the same time. Nearest replica to the
client is chosen whenever possible.
10Sector Tools and API
- Supported file system operation ls, stat, mv,
cp, mkdir, rm, upload, download - Wild card characters supported
- System monitoring sysinfo.
- C API list, stat, move, copy, mkdir, remove,
open, close, read, write, sysinfo.
11Sphere Simplified Data Processing
- Data parallel applications
- Data is processed at where it resides, or on the
nearest possible node (locality) - Same user defined functions (UDF) can be applied
on all elements (records, blocks, or files) - Processing output can be written to Sector files,
on the same node or other nodes - Generalized Map/Reduce
12Sphere Simplified Data Processing
Input
Output
UDF
Input
Intermediate
UDF
Output
UDF
Input 1
Output
UDF
Input 2
13Sphere Simplified Data Processing
for each file F in (SDSS datasets) for each
image I in F findBrownDwarf(I, )
SphereStream sdss sdss.init("sdss
files") SphereProcess myproc myproc-gtrun(sdss,"f
indBrownDwarf", ) myproc-gtread(result)
findBrownDwarf(char image, int isize, char
result, int rsize)
14Sphere Data Movement
- Slave -gt Slave Local
- Slave -gt Slaves (Shuffle/Hash)
- Slave -gt Client
15Load Balance Fault Tolerance
- The number of data segments is much more than the
number of SPEs. When an SPE completes a data
segment, a new segment will be assigned to the
SPE. - If one SPE fails, the data segment assigned to it
will be re-assigned to another SPE and be
processed again. - Detect and remove "fault" nodes.
16Open Cloud Testbed
- 4 Racks in Baltimore (JHU), Chicago (StarLight
and UIC), and San Diego (Calit2) - 10Gb/s inter-site connection on CiscoWave
- 1Gb/s inter-rack connection
- Two dual-core AMD CPU, 12GB RAM, 1TB single disk
17Open Cloud Testbed
18Example Sorting a TeraByte
- Data is split into small files, scattered on all
slaves - Stage 1 On each slave, an SPE scans local files,
sends each record to a bucket file on a remote
node according to the key, so that all buckets
are sorted. - Stage 2 On each destination node, an SPE sort
all data inside each bucket.
19TeraSort
Binary Record 100 bytes
Stage 2 Sort each bucket on local node
10-byte
90-byte
Value
Key
Bucket-0
Bucket-0
Bucket-1
Bucket-1
10-bit
0-1023
Stage 1 Hash based on the first 10 bits
Bucket-1023
Bucket-1023
20Performance Results TeraSort
Run time seconds Sector v1.16 vs Hadoop 0.17
Data Size Sphere Hadoop (3 replicas) Hadoop (1 replica)
UIC 300GB 1265 2889 2252
UIC StarLight 600GB 1361 2896 2617
UIC StarLight Calit2 900GB 1430 4341 3069
UIC StarLight Calit2 JHU 1.2TB 1526 6675 3702
21Performance Results TeraSort
- Sorting 1.2TB on 120 nodes
- Hash vs. Local Sort 981sec 545sec
- Hash
- Per rack 220GB in/out Per node 10GB in/out
- CPU 130 MEM 900MB
- Local Sort
- No network IO
- CPU 80 MEM 1.4GB
- Hadoop CPU 150 MEM 2GB
22CreditStone
Text Record
Stage 2 Compute fraudulent rate for each
merchant
Trans IDTimeMerchant IDFraudAmount 01491200300
2007-09-272451330066.49
Transform
merch-000X
merch-000X
Text Record
Merchant ID
Time
Fraud
merch-001X
merch-001X
Key
Value
3-byte
000-999
merch-999X
merch-999x
Stage 1 Process each record and hash into
buckets according to merchant ID
23Performance Results CreditStone
Racks JHU JHU, SL JHU, SL, Calit2 JHU, SL, Calit2, UIC
Number of Nodes 30 59 89 117
Size of Dataset (GB) 840 1652 2492 3276
Size of Dataset (rows) 15B 29.5B 44.5B 58.5B
Hadoop (min) 179 180 191 189
Sector with Index (min) 46 47 64 71
Sector w/o Index (min) 36 37 53 55
Courtesy of Jonathan Seidman of Open Data Group.
24System Monitoring (Testbed)
25System Monitoring (Sector/Sphere)
26Future Work
- High Availability
- Multiple master servers
- Scheduling
- Optimize data channel
- Enhance compute model and fault tolerance
27For More Information
- Sector/Sphere code docs http//sector.sf.net
- Open Cloud Consortium http//www.opencloudconsort
ium.org - NCDM http//www.ncdm.uic.edu
28Inverted Index
HTML page_1
Stage 2 Sort each bucket on local node, merge
same word
word_x word_y word_y word_z
Bucket-A
Bucket-A
1
word_x
Bucket-B
Bucket-B
1
word_y
1
word_z
Bucket-Z
Bucket-Z
1st letter
1
word_z
1, 5, 10
word_z
5
word_z
Stage 1 Process each HTML file and hash (word,
file_id) pair to buckets
10
word_z