Business Plan Overview Grid Dynamics - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Business Plan Overview Grid Dynamics

Description:

Who we are: global leader in scalability engineering ... Users of scalable applications: eBay, Bank of America, web start-ups ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 33

Provided by: dic49

Category:

more less

Transcript and Presenter's Notes

Title: Business Plan Overview Grid Dynamics

1
Using Grid Technologies on the Cloud for High
Scalability A Practitioner Report for Cloud
User Group
Victoria Livschitz, CEO Grid Dynamics vlivschitz_at_
griddynamics.com September 17th, 2008
2
A word about Grid Dynamics

Who we are global leader in scalability
engineering
Mission enable adoption of scalable applications
and networks though design patterns, best
practices and engineering excellence
Value proposition fusion of innovation with best
practices
Focused on physics, economics and
engineering of extreme scale
Founded in 2006, 30 people and growing, HQ in
Silicon Valley
Services
Technology consulting
Application systems architecture, design,
development
Customers
Users of scalable applications eBay, Bank of
America, web start-ups
Makers of scalable middleware GigaSpaces, Sun,
Microsoft
Partners GridGain, GigaSpaces, Terracotta, Data
Synapse, Sun, MS

3
Why I am speaking here tonight?

We do scalability engineering for a living
Cloud computing is new, very exciting and
terribly over-hyped
Not a lot of solid data on performance,
scalability, usability, stability
Many of our customers are early adopters or
enablers
Their pains, discoveries and lessons are worth
sharing
The practitioner prospective
Recently completed 3 benchmark projects that we
can make public
Results are presented here tonight

4
Exploring Scalability thru Benchmarking
5
Benchmark 1 Scalability of Simple Map/Reduce
Application on EC2
6
Basic Scalability of Simple Map/Reduce

Goal Establish upper limit on scalability of
Monte-Carlo simulations performed on EC2 using
GridGain
Why Monte-Carlo simple, widely-used, perfectly
scalable problem
Why EC2 most popular public cloud
Why GridGain simple, open-source map-reduce
middleware
Intended Claims
EC2 scales linearly as grid execution platform
GridGain scales linearly as map-reduce middleware
Businesses can run their existing Monte-Carlo
simulations on EC2 today using open-source
technologies

7
Other Goals

Understand process bottlenecks of EC2 platform
Changes to the programming, deployment,
management model
Ease of use
Security
Metering and payment
Identify scalability bottlenecks at any level in
the stack
EC2
GridGain
Glueware
Robustness
Stability
Predictability

8
Architecture
Job Execution
Spare EC2 Instances
JMS MessageProcessing
Amazon EC2 Cloud
Manages worker nodes and tasks
Discovery Task Assignment

Technology Stack
EC2
GridGain
Typica
OpenMQ

Controls Grid Operation
Configuration Task Repository
Corporate Intranet
9
Performance Methodology Results

Same algorithm exercised on wide range of nodes
2,4, 8, 16, , 256, 512. Limited by Amazon
permission of 550 nodes
Simultaneously double the amount of computations
and nodes
Measure completion time
Repeat several times to get statistical averages

Conclusions
Total degradation from 13 to 16 seconds, or 20
Discarding first 8 nodes, near perfect scale up
to 128
Slight degradation from 128 to 256 (3), from 256
to 512 (7)

gt Prove point of near linear scalability
end-to-end
10
Simple scaling script

var itersPerNode 5000
var cnode 1, 2, 4, 8, 16, 32, 64, 128, 256,
512
for (var i in cnode)
var n cnodei
grid.growEC2Grid(n, true)
grid.waitForGridInstances(n)
runTask(itersPerNode n, n, 3)

11
Observations

Deployment considerations
Start-up for whole grid in different
configurations is 0.5 - 3 min
2-step deployment process
First, bring up one EC2 node as controller
Next, use the controller on-the-inside to
coordinate bootstrapping
Some of EC2 nodes dont finish bootstrapping
successfully
Average of 0.5 nodes come up in incomplete state
Not clear the nature of the problem
If the exact processing power is essential, start
the nodes, then kill off the sick ones and bring
up a few new ones before starting computation
IP address deadlock issue
IP addresses of the nodes are needed to start
configure the grid
IP addresses are not available until the grid is
up configures
Need carefully choreograph bootstrapping and pass
IPs as parameters into controlling scripts

12
Observations

Monitoring considerations
Connection to each node from outside is possible,
but not efficient
Check heartbeat from the internal management
nodes
Local scripts must be stored on S3 or passed back
before termination
Programming model considerations
EC2 does not support IP multicast
Switched to JMS instead
Luckily, GridGain supported multiple protocols
Typica 3rd party connectivity library that use
EC2 query interface
Undocumented limit on URL length is hit with 100s
of nodes
Amazon just disconnects with improper URLs
without specifying the error, so debugging was
hard
Workaround rewrote some parts of our framework
to enquire about individual running nodes. Works,
but less efficient

13
Observations

Metering and payment
Amazon sets a limit on concurrent VM
Eventually approval for 550 VMs after some due
diligence from Amazon
Amazon charges by full or partial VM/hours
Sometimes, short usage of VMs is not metered
Not clear why
One hypotheses metering sweeps happen every so
often
Be careful with usage bills for testing
A test may need to be run multiple times
Beware of rouge scripts
Test everything on smaller configurations first
Scale gradually, or you will miss the bottlenecks

14
Achieving scalability

Software breaks at scale. Including the glueware
Barrier 1 was hit at 100 nodes because of
ActiveMQ scalability
Correction Switched ActiveMQ for OpenMQ
Comment some users report better ActiveMQ
scalability with 5.x
Barrier 2 was hit at 300 nodes because of Typica
URL length limit
Correction Changed our use of the API
Security considerations
EC2 credentials are passed to Head Node
3rd party GridGain tasks can access them
Sounds like potential vulnerability

15
What have we learned?

EC2 is ready for production usage on large-scale
stateless computations
Price/performance
Strong linear scale curve
GridGain showed itself very well
Scale, stability, ease-of-use, pluggability
Solid open source choice of map-reduce middleware
Some level of effort is required to port grid
system to EC2
Deployment, monitoring, programming mode,
metering, security
Whats next?
Can we go higher then 512?
What is the behavior of more complex
applications?

16
Benchmark 2 Scalability of Data-Driven Risk
Management Application on EC2
17
Data-driven Risk Management on EC2

Goal Investigate scalability of a prototypical
Risk Management application that use significant
amount of cached data to support large-scale
Monte-Carlo simulations executed on EC2 using
GigaSpaces
Why risk management class of problems widely
used in financial services
Why GigaSpaces leading middleware platform for
compute data grids
Intended Claims
EC2 scales linearly for data-driven HPC
applications
GigaSpaces scales well as both compute and data
grid middleware
Businesses can run their existing risk management
(and similar) applications on EC2 today using
off-the-shelf technologies

18
Architecture
User uses ec2-gdc-tools to manage grid
Workers take tasks, perform calculations, write
results back
Amazon EC2 Grid
Master writes tasks into data grid and waits for
results
19
Performance methodology results

Same algorithm exercised on wide range of nodes
16,32, 128, 256, 512. Still limited by Amazon
permission of 550
Constant size of data grid (4 large EC2 nodes)
Double the nodes with constant amount of work
Measure completion time (strive for linear time
reduction)

Conclusions
Near perfect scale from 16 to 256 nodes
28 degradation from 256 to 512 since data cache
becomes a bottleneck

20
What have we learned?

EC2 is ready for production usage for classes of
large-scale data-driven HPC applications, common
to Risk Management
GigaSpaces showed itself very well
Compute - data grid scales well in master-worker
pattern
Some level of effort is required to port grid
system to EC2
Deployment, monitoring, programming mode,
metering, security
Bootstrapping this system is far more complex
then GridGains. For more details, contact me
offline
Whats next?
How does data grid scale?
What about more complex applications?
Whats the scalability of co-located compute-data
grid configuration?

21
Benchmark 3 Performance implications of data
in the cloud vs. outside the cloud for
data-intensive analytics applications
22
Data-intensive Analytics on MS cloud

Goal Investigate performance improvements from
data in the cloud vs. outside the cloud for
complex data-intensive Analytical applications in
the context of HPC CompFin Labs environment
using Velocity
What is CompFin Labs MS-funded incubator
compute cloud for exploration of modern compute
data challenges on massive scale
What is Velocity MS new in-memory data grid
middleware, still CTP1
The Model Computes correlation between stock
prices over time. Algorithms use significant
amount of data which could be cached. Maximum
cache hit ratio for the model is around 90.
Intended Claims
Measure impact of data closeness to the
computation on the cloud

23
Architecture CompFin
24
Architecture Anticipated Bottlenecks
25
Architecture CompFin Velocity
26
Benchmarked configurations

Same analytical model with complex queries
Perfect linear scale curve (baseline)
Original CompFin
Distributed cache (original CompFin Velocity
distributed cache for financial data)
Local cache (original CompFin Velocity
distributed cache for financial data near cache
with data-aware routing)

27
Test methodology

3 ways of measuring scalability were used
Fixed amount of computations, increasing amount
of data
Fixed amount of date, increasing amount of
computations
Proportional Increase of computations and nodes
Node 1 core
Data unit 32 million records or 512 megabytes
of tick data

28
Performance results
29
Performance results
30
Conclusions

Data on the cloud definitely matters!
Performance improvements up to 31 times over
outside the cloud
Velocity distributed cache has some scalability
challenges
Failure on 50 nodes cluster with 200 concurrent
clients
Good news its a very young product and MS is
actively improving it
Compute-data affinity matters too!
Significant performance gain of local cache over
distributed cache
Local cache resolved distributed cache
scalability issue by reducing its load

31
Final Remarks

Clouds are proving themselves out
Early adaptors are there already
The rest of the real world will join soon
There are still significant adoption challenges
Technology immaturity
Lack of real data, best practices, robust design
patterns
Fitting of application middleware to cloud
platforms is just starting
Amazon is the leading commercial cloud provider,
but is not the only game in town
Companies are building public, private, dedicated
and special-purpose clouds

32
Thank You!