Business Plan Overview Grid Dynamics - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Business Plan Overview Grid Dynamics

Description:

Who we are: global leader in scalability engineering ... Users of scalable applications: eBay, Bank of America, web start-ups ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 33
Provided by: dic49
Category:

less

Transcript and Presenter's Notes

Title: Business Plan Overview Grid Dynamics


1
Using Grid Technologies on the Cloud for High
Scalability A Practitioner Report for Cloud
User Group
Victoria Livschitz, CEO Grid Dynamics vlivschitz_at_
griddynamics.com September 17th, 2008
2
A word about Grid Dynamics
  • Who we are global leader in scalability
    engineering
  • Mission enable adoption of scalable applications
    and networks though design patterns, best
    practices and engineering excellence
  • Value proposition fusion of innovation with best
    practices
  • Focused on physics, economics and
    engineering of extreme scale
  • Founded in 2006, 30 people and growing, HQ in
    Silicon Valley
  • Services
  • Technology consulting
  • Application systems architecture, design,
    development
  • Customers
  • Users of scalable applications eBay, Bank of
    America, web start-ups
  • Makers of scalable middleware GigaSpaces, Sun,
    Microsoft
  • Partners GridGain, GigaSpaces, Terracotta, Data
    Synapse, Sun, MS

3
Why I am speaking here tonight?
  • We do scalability engineering for a living
  • Cloud computing is new, very exciting and
    terribly over-hyped
  • Not a lot of solid data on performance,
    scalability, usability, stability
  • Many of our customers are early adopters or
    enablers
  • Their pains, discoveries and lessons are worth
    sharing
  • The practitioner prospective
  • Recently completed 3 benchmark projects that we
    can make public
  • Results are presented here tonight

4
Exploring Scalability thru Benchmarking
5
Benchmark 1 Scalability of Simple Map/Reduce
Application on EC2
6
Basic Scalability of Simple Map/Reduce
  • Goal Establish upper limit on scalability of
    Monte-Carlo simulations performed on EC2 using
    GridGain
  • Why Monte-Carlo simple, widely-used, perfectly
    scalable problem
  • Why EC2 most popular public cloud
  • Why GridGain simple, open-source map-reduce
    middleware
  • Intended Claims
  • EC2 scales linearly as grid execution platform
  • GridGain scales linearly as map-reduce middleware
  • Businesses can run their existing Monte-Carlo
    simulations on EC2 today using open-source
    technologies

7
Other Goals
  • Understand process bottlenecks of EC2 platform
  • Changes to the programming, deployment,
    management model
  • Ease of use
  • Security
  • Metering and payment
  • Identify scalability bottlenecks at any level in
    the stack
  • EC2
  • GridGain
  • Glueware
  • Robustness
  • Stability
  • Predictability

8
Architecture
Job Execution
Spare EC2 Instances
JMS MessageProcessing
Amazon EC2 Cloud
Manages worker nodes and tasks
Discovery Task Assignment
  • Technology Stack
  • EC2
  • GridGain
  • Typica
  • OpenMQ

Controls Grid Operation
Configuration Task Repository
Corporate Intranet
9
Performance Methodology Results
  • Same algorithm exercised on wide range of nodes
  • 2,4, 8, 16, , 256, 512. Limited by Amazon
    permission of 550 nodes
  • Simultaneously double the amount of computations
    and nodes
  • Measure completion time
  • Repeat several times to get statistical averages
  • Conclusions
  • Total degradation from 13 to 16 seconds, or 20
  • Discarding first 8 nodes, near perfect scale up
    to 128
  • Slight degradation from 128 to 256 (3), from 256
    to 512 (7)

gt Prove point of near linear scalability
end-to-end
10
Simple scaling script
  • var itersPerNode 5000
  • var cnode 1, 2, 4, 8, 16, 32, 64, 128, 256,
    512
  • for (var i in cnode)
  • var n cnodei
  • grid.growEC2Grid(n, true)
  • grid.waitForGridInstances(n)
  • runTask(itersPerNode n, n, 3)

11
Observations
  • Deployment considerations
  • Start-up for whole grid in different
    configurations is 0.5 - 3 min
  • 2-step deployment process
  • First, bring up one EC2 node as controller
  • Next, use the controller on-the-inside to
    coordinate bootstrapping
  • Some of EC2 nodes dont finish bootstrapping
    successfully
  • Average of 0.5 nodes come up in incomplete state
  • Not clear the nature of the problem
  • If the exact processing power is essential, start
    the nodes, then kill off the sick ones and bring
    up a few new ones before starting computation
  • IP address deadlock issue
  • IP addresses of the nodes are needed to start
    configure the grid
  • IP addresses are not available until the grid is
    up configures
  • Need carefully choreograph bootstrapping and pass
    IPs as parameters into controlling scripts

12
Observations
  • Monitoring considerations
  • Connection to each node from outside is possible,
    but not efficient
  • Check heartbeat from the internal management
    nodes
  • Local scripts must be stored on S3 or passed back
    before termination
  • Programming model considerations
  • EC2 does not support IP multicast
  • Switched to JMS instead
  • Luckily, GridGain supported multiple protocols
  • Typica 3rd party connectivity library that use
    EC2 query interface
  • Undocumented limit on URL length is hit with 100s
    of nodes
  • Amazon just disconnects with improper URLs
    without specifying the error, so debugging was
    hard
  • Workaround rewrote some parts of our framework
    to enquire about individual running nodes. Works,
    but less efficient

13
Observations
  • Metering and payment
  • Amazon sets a limit on concurrent VM
  • Eventually approval for 550 VMs after some due
    diligence from Amazon
  • Amazon charges by full or partial VM/hours
  • Sometimes, short usage of VMs is not metered
  • Not clear why
  • One hypotheses metering sweeps happen every so
    often
  • Be careful with usage bills for testing
  • A test may need to be run multiple times
  • Beware of rouge scripts
  • Test everything on smaller configurations first
  • Scale gradually, or you will miss the bottlenecks

14
Achieving scalability
  • Software breaks at scale. Including the glueware
  • Barrier 1 was hit at 100 nodes because of
    ActiveMQ scalability
  • Correction Switched ActiveMQ for OpenMQ
  • Comment some users report better ActiveMQ
    scalability with 5.x
  • Barrier 2 was hit at 300 nodes because of Typica
    URL length limit
  • Correction Changed our use of the API
  • Security considerations
  • EC2 credentials are passed to Head Node
  • 3rd party GridGain tasks can access them
  • Sounds like potential vulnerability

15
What have we learned?
  • EC2 is ready for production usage on large-scale
    stateless computations
  • Price/performance
  • Strong linear scale curve
  • GridGain showed itself very well
  • Scale, stability, ease-of-use, pluggability
  • Solid open source choice of map-reduce middleware
  • Some level of effort is required to port grid
    system to EC2
  • Deployment, monitoring, programming mode,
    metering, security
  • Whats next?
  • Can we go higher then 512?
  • What is the behavior of more complex
    applications?

16
Benchmark 2 Scalability of Data-Driven Risk
Management Application on EC2
17
Data-driven Risk Management on EC2
  • Goal Investigate scalability of a prototypical
    Risk Management application that use significant
    amount of cached data to support large-scale
    Monte-Carlo simulations executed on EC2 using
    GigaSpaces
  • Why risk management class of problems widely
    used in financial services
  • Why GigaSpaces leading middleware platform for
    compute data grids
  • Intended Claims
  • EC2 scales linearly for data-driven HPC
    applications
  • GigaSpaces scales well as both compute and data
    grid middleware
  • Businesses can run their existing risk management
    (and similar) applications on EC2 today using
    off-the-shelf technologies

18
Architecture
User uses ec2-gdc-tools to manage grid
Workers take tasks, perform calculations, write
results back
Amazon EC2 Grid
Master writes tasks into data grid and waits for
results
19
Performance methodology results
  • Same algorithm exercised on wide range of nodes
  • 16,32, 128, 256, 512. Still limited by Amazon
    permission of 550
  • Constant size of data grid (4 large EC2 nodes)
  • Double the nodes with constant amount of work
  • Measure completion time (strive for linear time
    reduction)
  • Conclusions
  • Near perfect scale from 16 to 256 nodes
  • 28 degradation from 256 to 512 since data cache
    becomes a bottleneck

20
What have we learned?
  • EC2 is ready for production usage for classes of
    large-scale data-driven HPC applications, common
    to Risk Management
  • GigaSpaces showed itself very well
  • Compute - data grid scales well in master-worker
    pattern
  • Some level of effort is required to port grid
    system to EC2
  • Deployment, monitoring, programming mode,
    metering, security
  • Bootstrapping this system is far more complex
    then GridGains. For more details, contact me
    offline
  • Whats next?
  • How does data grid scale?
  • What about more complex applications?
  • Whats the scalability of co-located compute-data
    grid configuration?

21
Benchmark 3 Performance implications of data
in the cloud vs. outside the cloud for
data-intensive analytics applications
22
Data-intensive Analytics on MS cloud
  • Goal Investigate performance improvements from
    data in the cloud vs. outside the cloud for
    complex data-intensive Analytical applications in
    the context of HPC CompFin Labs environment
    using Velocity
  • What is CompFin Labs MS-funded incubator
    compute cloud for exploration of modern compute
    data challenges on massive scale
  • What is Velocity MS new in-memory data grid
    middleware, still CTP1
  • The Model Computes correlation between stock
    prices over time. Algorithms use significant
    amount of data which could be cached. Maximum
    cache hit ratio for the model is around 90.
  • Intended Claims
  • Measure impact of data closeness to the
    computation on the cloud

23
Architecture CompFin
24
Architecture Anticipated Bottlenecks
25
Architecture CompFin Velocity
26
Benchmarked configurations
  • Same analytical model with complex queries
  • Perfect linear scale curve (baseline)
  • Original CompFin
  • Distributed cache (original CompFin Velocity
    distributed cache for financial data)
  • Local cache (original CompFin Velocity
    distributed cache for financial data near cache
    with data-aware routing)

27
Test methodology
  • 3 ways of measuring scalability were used
  • Fixed amount of computations, increasing amount
    of data
  • Fixed amount of date, increasing amount of
    computations
  • Proportional Increase of computations and nodes
  • Node 1 core
  • Data unit 32 million records or 512 megabytes
    of tick data

28
Performance results
29
Performance results
30
Conclusions
  • Data on the cloud definitely matters!
  • Performance improvements up to 31 times over
    outside the cloud
  • Velocity distributed cache has some scalability
    challenges
  • Failure on 50 nodes cluster with 200 concurrent
    clients
  • Good news its a very young product and MS is
    actively improving it
  • Compute-data affinity matters too!
  • Significant performance gain of local cache over
    distributed cache
  • Local cache resolved distributed cache
    scalability issue by reducing its load

31
Final Remarks
  • Clouds are proving themselves out
  • Early adaptors are there already
  • The rest of the real world will join soon
  • There are still significant adoption challenges
  • Technology immaturity
  • Lack of real data, best practices, robust design
    patterns
  • Fitting of application middleware to cloud
    platforms is just starting
  • Amazon is the leading commercial cloud provider,
    but is not the only game in town
  • Companies are building public, private, dedicated
    and special-purpose clouds

32
Thank You!
  • Victoria Livschitz vlivschitz_at_griddynamics.com
Write a Comment
User Comments (0)
About PowerShow.com