Title: ACMS: The Akamai Configuration Management System
1ACMS The Akamai Configuration Management System
Alex Sherman, Philip A. Lisiecki , Andy
Berkheimer and Joel Wein Akamai Technologies,
Inc. Columbia University Polytechnic
University. Presented at NSDI 2005
Most slides are reproduced from the authors
original presentation
2The Akamai Platform
- Akamai operates a Content Delivery Network of
15,000 servers distributed across 1,200 ISPs in
60 countries - Web properties (Akamais customers) use these
servers to bring their web content and
applications closer to the end-users
3Problem configuration and control
- Even with the widely distributed platform
customers need to maintain the control of how
their content is served - Customers need to configure their service options
with the same ease and flexibility as if it was a
centralized, locally hosted system
4Why difficult?
- 15,000 servers must synchronize to the latest
configurations within a few minutes - Some servers may be down or partitioned-off at
the time of reconfiguration - A server that comes up after some downtime must
re-synchronize quickly - Configuration may be initiated from anywhere on
the network and must reach all other servers
5Proposed Architecture
- Front-end a small collection of Storage Points
responsible for accepting, storing, and
synchronizing configuration files - Back-End reliable and efficient delivery of
configuration files to all of the edge servers -
leverages the Akamai CDN
Publisher
Storage Points
SP
SP
SP
15,000 Edge Servers
6Agreement
- A publisher contacts an accepting SP
- The accepting SP replicates a temporary file to a
majority of SPs - If replication succeeds the accepting SP
initiates an agreement algorithm called Vector
Exchange - Upon success the accepting SP accepts and all
SPs upload the new file
Publisher
Storage Points
SP
SP
SP
SP
7Vector Exchange
- For each agreement SPs exchange a bit vector.
- Each bit corresponds to commitment status of a
corresponding SP. - Once a majority of bits are set we say that
agreement takes place - When any SP learns of an agreement it can
upload the submission
8Vector Exchange Guarantees
- If a submission is accepted at least a majority
have stored and agreed on the submission - The agreement is never lost by a future quorum.
Why? - Any future quorum contains at least one SP that
saw an initiated agreement.
9Recovery Routine
- Each SP runs a recovery routine continuously to
query other SPs for missed agreements. - If SP finds that it missed an agreement it
downloads the corresponding configuration file - Overtime they keep a snaphot, a list of latest
versions of all accepted files
10Back-end Delivery
Storage Points
- Processes on edge servers subscribe to specific
configurations via their local Receiver process - Receivers periodically query the snapshots on the
SPs to learn of any updates. - If the updates match any subscriptions the
Receivers download the files via HTTP IMS
requests.
SP
SP
SP
11Evaluation
- Evaluated on top of real Akamai network
- 48 hours in the middle of a week
- 14,276 files submissions with five SPs
- Most (40) are of small sizes, than 1k
- Some 3 were greater in 10M 100M
12Submission and propagation
- Randomly sampled 250 edge servers to measure
propagation time. - 55 seconds on avg.
- Dominated by cache TTL and polling intervals
13Propagation vs. File Sizes
- Mean and 95th percentile propagation time vs.
file size - 99.95 of updates arrived within 3 minutes
- The rest delayed due to temporary connectivity
issues
14Discussion
- Availability of majority SPs guarantees agreement
- Can SPs availability metric be somehow used to
construct the quorum with smaller size? - How many SPs should be there for 15,000 edge
servers? - It experiments with 5. Is it too small?
- How these SPs are selected?
15MapReduce Simplified Data Processing on large
Clusters
- Jeffrey Dean and Sanjay Ghemawat
- Google, Inc.
16What is MapReduce?
- A programming model or design framework or design
pattern. - Allows distributed processing at back-end with
parallelization, fault-tolerance, data
distribution and load balancing. - Several Projects are built on MapReduce such as
Skynet, Hadoop, etc.
17What is MapReduce?
- Terms are borrowed from Functional Language (e.g.
Lisp)? - (map square (1 2 3 4))?
- (1 4 9 16)?
- (reduce (1 4 9 16))?
- ( 16 ( 9 ( 4 1) ) )?
- 30
18Map
- Map()?
- Process a key/value pair to generate intermediate
key/value pairs.
Welcome 1 Everyone 1 Hello 1 Everyone 1
Welcome Everyone Hello Everyone
Input
19Reduce
- Reduce()?
- Merge all intermediate values associated with the
same key
Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
20Some Applications
- Distributed Grep
- Map - emits a line if matches the supplied
pattern - Reduce - Copies the the intermediate data to
output - Count of URL access frequency
- Map process web log and outputs
- Reduce - emits
- Reverse Web-Link Graph
- Map process web log and outputs source
- Reduce - emits
- Google News, Google Search
-
21What is behind MapReduce?
- Make it distributed
- Partition input key/value pairs into chunks, run
map() tasks in parallel. - After all map()s are complete, partition the
stored values. - Run reduce() in parallel
22How MapReduce Works
- User to do list
- indicate
- Input/output files
- M number of map tasks
- R number of reduce tasks
- W number of machines
- Write map and reduce functions
- Submit the job
- This requires no knowledge of parallel/distributed
systems!!! - What about everything else?
23(No Transcript)
24How MapReduce Works
- Input slices are typically 16MB to 64MB.
- Map workers use a partitioning function to store
intermediate key/value pair to the local disk. - e.g., Hash (key) mod R
Output files
Map workers
Reduce workers
partitioning
25Fault Tolerance
- Worker Failure
- Master keeps 3 states for each worker task
- (idle, in-progress, completed)?
- Periodic ping from Master
- If fail while in-progress, mark the task as idle.
- If map workers fail after completed, mark as
idle. - Notify the reduce task about the map worker
failure. - Master Failure
- Checkpoint
26Locality and Backup tasks
- Locality
- GFS stores 3 replicas of each of 64MB chunks.
- Attempt to schedule a map task on a machine that
contains a replica of corresponding input data. - Stragglers
- Due to Bad Disk, Network Bandwidth, CPU, or
Memory. - Perform backup execution.
27Refinements and Extension
- Combiner Function
- User defined
- Done within map task.
- Save network bandwidth.
- Skipping Bad records
- Best solution is to debug fix
- Not always possible third-party source
libraries - On segmentation fault
- Send UDP packet to master from signal handler
- Include sequence number of record being processed
- If master sees two failures for same record
- Next worker is told to skip the record
28Refinements and Extensions
- Local Execution
- For debugging purpose.
- Users have control on specific Map tasks.
- Status Information
- Master runs a HTTP server.
- Status page shows the status of computation
- Link to output file.
- Standard Error list
29Performance
- Tests run on cluster of 1800 machines
- 4 GB of memory
- Dual-processor 2 GHz Xeons with Hyper threading
- Dual 160 GB IDE disks
- Gigabit Ethernet per machine
- Bandwidth approximately 100 Gbps
- Two benchmarks
- Grep 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
- Sort 1010 100-byte records (modeled after
TeraSort - benchmark)?
30Grep
- Locality optimization helps
- 1800 machines read 1 TB at peak 31 GB/s
- W/out this, rack switches would limit to 10 GB/s
- Startup overhead is significant for short jobs
31Sort
M 15000 R 4000
- Normal No backup tasks
200 processes killed
- Backup tasks reduce job completion time a lot!
- System deals well with failures
32Discussion
- Single Point of Failure.
- Limitation on M and R, as Master should take
O(MR) scheduling decision and O(MR) states in
memory. - Restricted Programming model.
- MapReduce vs River.
33Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber Google, Inc.
34- Lots of data
- Copies of the web (crawls), satellite images,
user data, email and USENET - No commercial system is big enough
- Couldnt afford it if there was one
- Might not have made appropriate design choices
- 450,000 machines (NYTimes estimate, June 14th
2006)?
35- Scheduler (Google WorkQueue)?
- Google File System (GFS)?
- Chubby Lock Service
- Other tools
- Sawzall scripting language
- MapReduce parallel processing
- Bigtable is built out of using these tools
35
36Bigtable cell
Bigtable client
Bigtable clientlibrary
Master server
performs metadata ops,load balancing
Open()?
Tablet server
Tablet server
Tablet server
serves data
serves data
serves data
Cluster Scheduling Master
GFS
Lock service
holds metadata,handles master-election
handles failover, monitoring
holds tablet data, logs
37- More than sixty products and projects
- Deals with enormous amount of data
- Crawl, 800 TB
- Google Analytics, 200 TB
- Google map, 0.5 TB
- Google earth, 200 TB
- Millions of requests per second
37
38Google data
Crwal
Email account
38
39- Bigtable is a sparse, distributed, persistent,
multi-dimensional sorted map - Not a relational database table, just a table!
- No integrity constraints and relational semantics
- No multi-row transactions
- all transactions happen to a single row
- It's not a database thing, it's a storage!
39
40- locates a cell in the table
- Each cell contains several timestamped contents
- column key is like familyqualifier
40
41contents
language
aaa.com
cnn.com
EN
cnn.com/sports.html
TABLETS
aaa.com
bbc.uk
Yahoo.com/kids.html?D
Zuppa.com/menu.html
42- Contains some range of rows of the table
- Built out of multiple SSTables
Tablet
Start aaa.com
Endbbc.uk
SSTable
SSTable
64K block
64K block
64K block
64K block
64K block
64K block
Index
Index
42
43- Immutable, sorted file of key-value pairs
- No simultaneous read/write
- Chunks of data plus an index
- Index is of block ranges, not values
- Each block is of size 64K
SSTable
64K block
64K block
64K block
Index
43
44Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
44
45Bigtable
Tablet 1
Tablet 2
Tablet 3
bbc
cnn
mlp
aaa
cbc
ppp
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
Tablet server 1
Tablet server 2
45
46- Three level hierarchy
- Find which range belongs to which tablet
- A sort of binary search
- Similar to Unix file block lookup for an inode
46
47- Tablet servers manage tablets, multiple tablets
per server. Each tablet is of 100-200 MB - Each tablet lives at only one server
- Tablet server splits tablets that get too big
- Master is responsible for load balancing and
fault tolerance - Use Chubby to monitor health of tablet servers,
restart failed servers - GFS replicates data
47
48- Tablet is first located, and accessed
- In case of write, mutations are logged
- chubby authorization is checked
- Write is applied to an in-memory version
- Logfile is stored in GFS
48
49- Minor compaction
- convert the memtable into a new SSTable
- Merging compaction
- Reads contents from a few SSTables and memtable
- Good place to apply policy keep only N versions
- Major compaction
- Merging compaction that results in only one
SSTable - No deleted records, only live data
50- Locality groups
- Multiple column families into a locality group
- Separate SSTable is created to store each group
- Compression and Caching
- Bloom filters
- Filters whether a SSTable contains particular
row/col - Exploiting immutability
- SSTables are immutable so, non-currency control
- Only mutable content is memtable
51- A good number of tablet servers
- Each with
- 1 GB RAM, 2 GB disk, and Duo Opteron 2 GHz
- 100-200 Gbps backbone bandwidth
- Tested with millions of read/write hits
5252
53- Real industrial systems are simple, and they run
a complex big system by a very simple design.
How? - How does immutability preserve consistency?
- Immutable SSTables may waste storage.