Title: Business Plan Overview Grid Dynamics
1Using Grid Technologies on the Cloud for High
Scalability A Practitioner Report for Cloud
User Group
Victoria Livschitz, CEO Grid Dynamics vlivschitz_at_
griddynamics.com September 17th, 2008
2A word about Grid Dynamics
- Who we are global leader in scalability
engineering - Mission enable adoption of scalable applications
and networks though design patterns, best
practices and engineering excellence - Value proposition fusion of innovation with best
practices - Focused on physics, economics and
engineering of extreme scale - Founded in 2006, 30 people and growing, HQ in
Silicon Valley - Services
- Technology consulting
- Application systems architecture, design,
development - Customers
- Users of scalable applications eBay, Bank of
America, web start-ups - Makers of scalable middleware GigaSpaces, Sun,
Microsoft - Partners GridGain, GigaSpaces, Terracotta, Data
Synapse, Sun, MS
3Why I am speaking here tonight?
- We do scalability engineering for a living
- Cloud computing is new, very exciting and
terribly over-hyped - Not a lot of solid data on performance,
scalability, usability, stability - Many of our customers are early adopters or
enablers - Their pains, discoveries and lessons are worth
sharing - The practitioner prospective
- Recently completed 3 benchmark projects that we
can make public - Results are presented here tonight
4Exploring Scalability thru Benchmarking
5Benchmark 1 Scalability of Simple Map/Reduce
Application on EC2
6Basic Scalability of Simple Map/Reduce
- Goal Establish upper limit on scalability of
Monte-Carlo simulations performed on EC2 using
GridGain - Why Monte-Carlo simple, widely-used, perfectly
scalable problem - Why EC2 most popular public cloud
- Why GridGain simple, open-source map-reduce
middleware - Intended Claims
- EC2 scales linearly as grid execution platform
- GridGain scales linearly as map-reduce middleware
- Businesses can run their existing Monte-Carlo
simulations on EC2 today using open-source
technologies
7Other Goals
- Understand process bottlenecks of EC2 platform
- Changes to the programming, deployment,
management model - Ease of use
- Security
- Metering and payment
- Identify scalability bottlenecks at any level in
the stack - EC2
- GridGain
- Glueware
- Robustness
- Stability
- Predictability
8Architecture
Job Execution
Spare EC2 Instances
JMS MessageProcessing
Amazon EC2 Cloud
Manages worker nodes and tasks
Discovery Task Assignment
- Technology Stack
- EC2
- GridGain
- Typica
- OpenMQ
Controls Grid Operation
Configuration Task Repository
Corporate Intranet
9Performance Methodology Results
- Same algorithm exercised on wide range of nodes
- 2,4, 8, 16, , 256, 512. Limited by Amazon
permission of 550 nodes - Simultaneously double the amount of computations
and nodes - Measure completion time
- Repeat several times to get statistical averages
- Conclusions
- Total degradation from 13 to 16 seconds, or 20
- Discarding first 8 nodes, near perfect scale up
to 128 - Slight degradation from 128 to 256 (3), from 256
to 512 (7)
gt Prove point of near linear scalability
end-to-end
10Simple scaling script
- var itersPerNode 5000
- var cnode 1, 2, 4, 8, 16, 32, 64, 128, 256,
512 - for (var i in cnode)
- var n cnodei
- grid.growEC2Grid(n, true)
- grid.waitForGridInstances(n)
- runTask(itersPerNode n, n, 3)
-
11Observations
- Deployment considerations
- Start-up for whole grid in different
configurations is 0.5 - 3 min - 2-step deployment process
- First, bring up one EC2 node as controller
- Next, use the controller on-the-inside to
coordinate bootstrapping - Some of EC2 nodes dont finish bootstrapping
successfully - Average of 0.5 nodes come up in incomplete state
- Not clear the nature of the problem
- If the exact processing power is essential, start
the nodes, then kill off the sick ones and bring
up a few new ones before starting computation - IP address deadlock issue
- IP addresses of the nodes are needed to start
configure the grid - IP addresses are not available until the grid is
up configures - Need carefully choreograph bootstrapping and pass
IPs as parameters into controlling scripts
12Observations
- Monitoring considerations
- Connection to each node from outside is possible,
but not efficient - Check heartbeat from the internal management
nodes - Local scripts must be stored on S3 or passed back
before termination - Programming model considerations
- EC2 does not support IP multicast
- Switched to JMS instead
- Luckily, GridGain supported multiple protocols
- Typica 3rd party connectivity library that use
EC2 query interface - Undocumented limit on URL length is hit with 100s
of nodes - Amazon just disconnects with improper URLs
without specifying the error, so debugging was
hard - Workaround rewrote some parts of our framework
to enquire about individual running nodes. Works,
but less efficient
13Observations
- Metering and payment
- Amazon sets a limit on concurrent VM
- Eventually approval for 550 VMs after some due
diligence from Amazon - Amazon charges by full or partial VM/hours
- Sometimes, short usage of VMs is not metered
- Not clear why
- One hypotheses metering sweeps happen every so
often - Be careful with usage bills for testing
- A test may need to be run multiple times
- Beware of rouge scripts
- Test everything on smaller configurations first
- Scale gradually, or you will miss the bottlenecks
14Achieving scalability
- Software breaks at scale. Including the glueware
- Barrier 1 was hit at 100 nodes because of
ActiveMQ scalability - Correction Switched ActiveMQ for OpenMQ
- Comment some users report better ActiveMQ
scalability with 5.x - Barrier 2 was hit at 300 nodes because of Typica
URL length limit - Correction Changed our use of the API
- Security considerations
- EC2 credentials are passed to Head Node
- 3rd party GridGain tasks can access them
- Sounds like potential vulnerability
15What have we learned?
- EC2 is ready for production usage on large-scale
stateless computations - Price/performance
- Strong linear scale curve
- GridGain showed itself very well
- Scale, stability, ease-of-use, pluggability
- Solid open source choice of map-reduce middleware
- Some level of effort is required to port grid
system to EC2 - Deployment, monitoring, programming mode,
metering, security - Whats next?
- Can we go higher then 512?
- What is the behavior of more complex
applications?
16Benchmark 2 Scalability of Data-Driven Risk
Management Application on EC2
17Data-driven Risk Management on EC2
- Goal Investigate scalability of a prototypical
Risk Management application that use significant
amount of cached data to support large-scale
Monte-Carlo simulations executed on EC2 using
GigaSpaces - Why risk management class of problems widely
used in financial services - Why GigaSpaces leading middleware platform for
compute data grids - Intended Claims
- EC2 scales linearly for data-driven HPC
applications - GigaSpaces scales well as both compute and data
grid middleware - Businesses can run their existing risk management
(and similar) applications on EC2 today using
off-the-shelf technologies
18Architecture
User uses ec2-gdc-tools to manage grid
Workers take tasks, perform calculations, write
results back
Amazon EC2 Grid
Master writes tasks into data grid and waits for
results
19Performance methodology results
- Same algorithm exercised on wide range of nodes
- 16,32, 128, 256, 512. Still limited by Amazon
permission of 550 - Constant size of data grid (4 large EC2 nodes)
- Double the nodes with constant amount of work
- Measure completion time (strive for linear time
reduction)
- Conclusions
- Near perfect scale from 16 to 256 nodes
- 28 degradation from 256 to 512 since data cache
becomes a bottleneck
20What have we learned?
- EC2 is ready for production usage for classes of
large-scale data-driven HPC applications, common
to Risk Management - GigaSpaces showed itself very well
- Compute - data grid scales well in master-worker
pattern - Some level of effort is required to port grid
system to EC2 - Deployment, monitoring, programming mode,
metering, security - Bootstrapping this system is far more complex
then GridGains. For more details, contact me
offline - Whats next?
- How does data grid scale?
- What about more complex applications?
- Whats the scalability of co-located compute-data
grid configuration?
21Benchmark 3 Performance implications of data
in the cloud vs. outside the cloud for
data-intensive analytics applications
22Data-intensive Analytics on MS cloud
- Goal Investigate performance improvements from
data in the cloud vs. outside the cloud for
complex data-intensive Analytical applications in
the context of HPC CompFin Labs environment
using Velocity - What is CompFin Labs MS-funded incubator
compute cloud for exploration of modern compute
data challenges on massive scale - What is Velocity MS new in-memory data grid
middleware, still CTP1 - The Model Computes correlation between stock
prices over time. Algorithms use significant
amount of data which could be cached. Maximum
cache hit ratio for the model is around 90. - Intended Claims
- Measure impact of data closeness to the
computation on the cloud
23Architecture CompFin
24Architecture Anticipated Bottlenecks
25Architecture CompFin Velocity
26Benchmarked configurations
- Same analytical model with complex queries
- Perfect linear scale curve (baseline)
- Original CompFin
- Distributed cache (original CompFin Velocity
distributed cache for financial data) - Local cache (original CompFin Velocity
distributed cache for financial data near cache
with data-aware routing)
27Test methodology
- 3 ways of measuring scalability were used
- Fixed amount of computations, increasing amount
of data - Fixed amount of date, increasing amount of
computations - Proportional Increase of computations and nodes
- Node 1 core
- Data unit 32 million records or 512 megabytes
of tick data
28Performance results
29Performance results
30Conclusions
- Data on the cloud definitely matters!
- Performance improvements up to 31 times over
outside the cloud - Velocity distributed cache has some scalability
challenges - Failure on 50 nodes cluster with 200 concurrent
clients - Good news its a very young product and MS is
actively improving it - Compute-data affinity matters too!
- Significant performance gain of local cache over
distributed cache - Local cache resolved distributed cache
scalability issue by reducing its load
31Final Remarks
- Clouds are proving themselves out
- Early adaptors are there already
- The rest of the real world will join soon
- There are still significant adoption challenges
- Technology immaturity
- Lack of real data, best practices, robust design
patterns - Fitting of application middleware to cloud
platforms is just starting - Amazon is the leading commercial cloud provider,
but is not the only game in town - Companies are building public, private, dedicated
and special-purpose clouds
32Thank You!
- Victoria Livschitz vlivschitz_at_griddynamics.com