Enabling The Fortune One Million - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Enabling The Fortune One Million

Description:

Improving and creating systematic methodologies for the DADO steps. Creating systematic connections among the DADO process stages ... DADO - Operate ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 38
Provided by: georgep6
Category:

less

Transcript and Presenter's Notes

Title: Enabling The Fortune One Million


1
Enabling The Fortune One Million
  • Armando Fox, Stanford University
  • With Randy Katz, Michael Jordan, Dave Patterson,
    Scott Shenker, Ion Stoica
  • IBM P3AD, April 2006

2
Enabling The Fortune One Million
  • Armando Fox, Berkeley RAD Lab
  • With Randy Katz, Michael Jordan, Dave Patterson,
    Scott Shenker, Ion Stoica
  • IBM P3AD, April 2006

3
This talk
  • Untainted by results, proofs, theorems, etc.
  • BUTan attempt to formulate an operational
    vision statement for self- systems and a
    strategy for attaining it
  • presumed achievable based on bona fide previous
    results

4
Steps vs. Process
Cap Dado (The section of a pedestal between
cap and base) Base
  • Steps Traditional, Static Handoff Model, N groups
  • Process SupportDADO Evolution, 1 group

5
  • 1995 Pierre Omidyar develops deploys eBay 1.0
    over a long weekend
  • Has been rewritten twice since then.
  • 2005 HousingMaps.com connects Google Maps and
    Craigslist apartment listings
  • Spawned a whole bunch of map mashups -
    GoogleMapsMania blog
  • 2006 bug in Spell With Flickr mashup results
    in author slapped for driving too much traffic
  • Prototyping getting easierdeployment at scale
    getting harder at least as fast

6
RAD Lab 5-year Mission
  • Provide tools, technology platform to allow a
    single person to Develop, Assess, Deploy, and
    Operate the next-generation IT service
  • Enables The Fortune 1 Million
  • Major partnership with industry
  • Rest of this talk
  • Early progress on DADO
  • Reflections on upcoming technical challenges
    opportunities
  • Lab organization tie-in to course plans, etc.

RAD Lab Robust, Adaptive, Distributed
systems
7
RAD Lab Challenges Center
  • The Challenges
  • Develop a new Service using tools that facilitate
    rapid prototyping
  • Assess Measuring, Testing, and Debugging the new
    Service in a realistic distributed environment
  • Deploy Scaling up a new, geographically
    distributed Service
  • Operate a service that could quickly scale to
    millions of users
  • The Vehicle
  • Interdisciplinary Center creates core technical
    competency to demo 10X to 100X
  • Researchers are leaders in machine learning,
    networking, and systems
  • Industrial Participants leading companies in HW,
    systems SW, and online services

8
Science is a way to createknowledge
  • But science is also about understanding complex
    artifacts
  • What is the science in services science?
  • Going from raw observations of (complex) system
    behavior to actionable interpretations
  • Improving and creating systematic methodologies
    for the DADO steps
  • Creating systematic connections among the DADO
    process stages
  • But science is also about understanding complex
    artifacts
  • What is the science in services science?
  • Going from raw observations of (complex) system
    behavior to actionable interpretations
  • Improving and creating systematic methodologies
    for the DADO steps
  • Creating systematic connections among the DADO
    process stages

Jim Stohrer at IBM University Day, Almaden,
April 2006
9
DADO - Develop Joy of Middleware
  • Dominant way to deploy services
  • Innovate below abstraction
  • Unmodified/proprietary apps can benefit (e.g.
    from instrumentation)
  • Modern middleware tends to make apps more
    declarative
  • CORBA gt J2EE gt Ruby on Rails
  • Can get things running immediately
  • But usually end up being rewritten
  • NCSA httpd gt Apache, Ebay 0.9 gt Ebay 2.0,
    Google.stanford.edu gt Google.com
  • Challenge can we get the best of both worlds?

10
Examples Understanding the curse of success
11
DADO - Assess example Packet Annotations
  • Create new ways to collect information over
    distributed networkAnnotation Layer
  • Incrementally deployable on existing
    infrastructure
  • iBoxes label packets at annotation layer but do
    not change original packet payloads
  • Expose annotations to application layer

Application
Presentation
Session
Annotations
Transport
Network
Link
Phy
12
iBox Placement for Observation and Action
iBoxes strategically placed near entry/exit
points within the Enterprise network
13
DADO - Assess Distributed Debugging
  • Allows inspection of snapshots of distributed
    app state
  • Faithful replay of distributed apps
  • virtual (Lamport) clocks allow consistent replay
  • Has found bugs in Chord/I3 and itself
  • Works with existing toolchains
  • Transparently intercepts libc calls
  • Extends gdb UI

gt replay 132.239.6.225
... running
gt break update_state()
... 1 set line 75
gt advance 10000000
... done
gt fix bug for me
user
14
DADO - Deploy RAMP
  • How can academics experiment withsystems of
    1000 nodes?
  • RAMP (Research Accelerator for Multiple
    Processors) for parallel HW SW research
  • Single FPGA 25 CPUs caches in 2005
  • 100k 4 FPGAs / board, 4 DIMMs / FPGA ,10-20
    boards low-cost Storage Server over Ethernet
  • ? 1000 CPUs, 256 MB DRAM/CPU, 20 GB disk
    storage/CPU
  • Parts of RAMP-1 prototype already running

15
Using RAMP
  • Current status
  • Smaller-size board, 4 machines
  • 4 MicroBlaze cores, Micro-C/Linux, TCP/IP, NFS,
    Telnet, httpdCGI, Python Ruby coming soon
  • Short term plans
  • Instrumentation plane think OpenView, but we can
    instrument whatever we want
  • Simulate run simple Web apps on many many 100MHz
    CPUs
  • Longer term plans
  • Simulate wide-area networks at scales impractical
    on PlanetLab
  • Understand Datacenter-in-a-Box model

X
16
DADO - Operate
  • Apply statistical machine learning to find
    patterns in behavior of complex software
  • Example correlate high-level site health metrics
    with low-level fingerprints associated with bad
    health gt info retrieval
  • Example sample annotated software features
    (language-level constructs) and correlate feature
    sets with failed runs to help pinpoint bugs
  • Combine SML with visualization so operator sees
    understands significance of anomalies
  • Promising early results but just the tip of the
    iceberg
  • State of the art for visualization in operator
    tools is very immature

17
Some results so far
18
Signatures - example
  • Metric has value 1 if it is attributed with the
    violation, -1 if it is not attributed, 0 if it is
    not relevant

Attri- bution
19
DADO - Operate Open sourcelike database of
traces/logs
  • Goal large trace-like database of failure
    logs and other relevant failure data for
    research use
  • So far
  • Complete source sanitized logs of 3 Flickr
    mashup front-ends
  • Working with affiliates to make public a
    sanitized version of data used in our early
    results papers
  • Access to Microsoft desktop crash data collected
    via BOINC (paper submission forthcoming)

20
ReflectionsA good time to be using SML
  • Technology supports use of SML
  • Even building blocks of systems are
    sufficiently complex, instrumentable, and have
    large user bases
  • Advances in online algorithms research make good
    fit for long-running systems
  • Moores Law nontrivial models can be induced and
    evaluated in soft real time (seconds) for many
    of these systems
  • Domain expertise in systems still needed
  • We will develop a corps of researchers whose
    strength is SML/Systems crossover

21
Reflections Whats new in Services Science?
(or SOAs becoming realuh-oh!)
  • Workload challenges
  • AJAX and mashups change workload seen by back-end
    servers
  • Nonlinear dynamics of changing workloads will
    make spike provisioning more challenging (eg
    Flickr mashups)
  • Long tail management challenges
  • For every Amazon or Google, 1000s of smaller
    services
  • This ratio will increase rapidly as ease of
    deploying a meta-service increases
  • Managing 1000s of different services sharing
    resources will be harder than managing one
    mega-service
  • decoupled control loops (Jeff Mogul)
  • Even if each service is well-regulated, can we
    say anything about the meta-services? (like
    mashups)

22
Reflections An interesting opportunity
  • Difficult to scale functionalities increasingly
    being offered as utility services
  • This is mostly why SOA is taking off!
  • Storage Google Base, OpenDHT, Amazon S3 (eg, new
    client-side Wiki software that uses S3 and no
    other server!), Salesforce.com
  • Mapping/GIS Google Yahoo Maps
  • Build customized searches using search engine
    APIs
  • Future functions like MapReduce?
  • Indeed, mashups are often not much more than
    front ends of computation soft state
  • This should be easy to scale!! The rubber meets
    the road here.
  • Experiments to be done soon on RAMP, which Ill
    talk about shortly

23
RAD Lab Organization
  • 2.5M/year, 70 industry, 20 State, 10 Fed.
    govt (NSF)
  • 30 grad students, 10 undergrads, 6 faculty, 2
    staff
  • Founding Companies Sun, Google, Microsoft
  • Affiliate Members include Verisign, IBM,
    Hewlett-Packard, NTT, Oracle, Nortel
  • Mid project review after 3 years by founding
    partners
  • Benefits to Affiliates RAD Lab
  • Prefer founding partner technology in prototypes
  • Designate employees to act as consultants/liaisons
  • Real-life training for next generation of IT
    researchers
  • Research based on real systems data (logs,
    forensics, etc.)

24
Industrial collaboration
  • Intellectual property policy
  • Nonexclusive, royalty-free IP license so partners
    not sued--BSD license (text available at
    opensource.org)
  • Head start on research results for affiliates
    (6-month embargo)
  • Impact from previous projects
  • RAID, RISC, NOW - multibillion-dollar industries
  • Berkeley regularly ranked in top 3 for systems
    research (1 this year, tied with MIT)

25
Education/Course plans
  • Were not teaching students to think in terms of
    a hosted-service development model
  • Inheriting, understanding, extending other
    peoples long lived code
  • How you do testing and upgrades for an online
    24x7 service
  • Technologies we used to teach in detail are now
    encapsulated as open source, running code
  • Symmetric multithreaded I/O intensive apps
    (Apache)
  • transactions and concurrency control (MySQL)
  • Whats important is understanding these as system
    building blocks
  • What are their interactions?
  • What tradeoffs involved in composing these in a
    system? What price do I pay (in performance,
    robustness, or whatever) for selecting a given
    behavior?

26
Course plans
  • Year 1 Graduate project courses
  • Improve the RAD Lab platform, infrastructure,
    technologies
  • Year 2 Undergrad courses
  • Develop, assess, deploy, operate new apps on RAD
    Lab hosting service
  • Improve other peoples existing services, all in
    a hosted environment
  • Year 3 Joint courses between CS and
    Business/Management
  • Design business model along with app
  • Understand how business concerns affect DADO
    process
  • Consistent with IBMs SSME vision of creating a
    multidisciplinary corps of service scientists

27
Summary
  • Technology bets SML, visualization, FPGAs
  • will help us better understand the behaviors of
    these complex distributed systems
  • Will let us run credible experiments at scale
  • Will improve the tools available for operators
  • Eventual goal Fortune 1 Million
  • 1 person can design, deploy and operate next eBay
    without building an eBay-sized organization
  • Strong ties to industry
  • Integrated with course offerings/curriculum
  • http//radlab.cs.berkeley.edu

28
Acknowledgments
  • RAD Lab sponsors founders
  • Co-PIs Patterson, Katz, Stoica, Shenker, Jordan
  • Students whose work was mentioned in this talk
    Archana Ganapathi, Peter Bodik, Dennis Geels, Wei
    Xu,.

29
BACKUP SLIDES
30
DADO - DevelopBack end building block services
  • servers serve client programs--not people
  • Ideas like user-based anomaly detection dont
    work
  • Workloads higher volume different profile
    (e.g., prefetching for Google Maps)
  • Aggregational services multiply workloads (1
    HousingMaps hit N Craigslist hits N Google
    Maps hits)
  • Distributed debugging is tough because the other
    sites are not under your control
  • Large sophisticated sites already deal with this
    internally
  • but must now deal with less-predictable workload,
    evil, etc

31
DADO - OperateCapturing Operator Actions
  • Systematically capture, index, retrieve operator
    actions during incident response
  • Operators role largely ignored in most current
    work
  • Goal try to capture semantics of how operators
    are thinking when they react to a problem
  • Gradually increase trust by suggesting actions
    based on past history
  • Auditable recommendations let operator explore
    how the recomendation was made
  • Various techniques possible--reinforcement
    learning, expert systems, collaborative
    filtering...

32
Capturing operator actions
  • monitoring operators
  • web-based tools look at web server access logs
  • Unix command line sudo logs or history
  • stand-alone GUI tools instrument them?
  • trouble ticket DB operators involved, start/end,
    type of problem
  • Challenge extract sufficient semantics to allow
    cross-analysis of sources (timestamps, intent
    of an action, etc)
  • similarity metric for failures
  • eg compare signatures of failures/problems
  • clickstream analysis and data mining
  • e-commerce sites already do this--for their
    customers

33
RAD Lab Opportunity New Research Model
  • Chance to Partner with the Top University in
    Computer Systems on the Next Great Thing
  • National Academy of Engineering mentions Berkeley
    in 7 of 19 1B industries that came from IT
    research
  • NAE mentions Berkeley 7 times, Stanford 5 Times,
    MIT 5, CMU 3 Timesharing (SDS 940), Client-Server
    Computing (BSD Unix), Graphics, Entertainment,
    Internet, LANs, Workstations, GUI, VLSI Design
    (Spice) ECAD 5B?/yr , RISC 10B?/yr ,
    Relational DB (Ingres/Postgres) RDB 15B?/yr,
    Parallel DB, Data Mining, Parallel Computing,
    RAID 15B?/yr , Portable Communication (BWRC),
    WWW, Speech Recognition, Broadband
  • Berkeley one of the top suppliers of systems
    students to industry and academia
  • US News World Report ranking of CS Systems
    universities1 Berkeley, 2 CMU, 2 MIT, 4
    Stanford, 5 Washington
  • For example Quanta (Taiwan PC laptop clone
    manufacturer) funds MIT CSAIL _at_ 4M/year for 5
    years to reinvent PC April 2005 (Tparty)
  • RAID project (4 faculty, 20 grads, 10 undergrads)
    helped create 15B industry, but not fundable
    today at DARPA, NSF

34
RAD Lab Interdisciplinary Center for Reliable,
Adaptive, Distributed Systems
  • Working with different industries on long-range,
    pre-competitive technology
  • Training of dozens of future leaders of IT, plus
    their recruitment
  • Working with researchers with track records of
    successful technology transfer

35
RAD Lab Timeline
  • 2005 Launch RAD Lab
  • 2006 Collect workloads, Internet in a Box
  • 2007 SLT/CT distributed architectures, Iboxes,
    annotative layer, class testing
  • 2008 Development toolkit 1.0, tuple space, class
    testing Mid Project Review
  • 2009 RAD Lab software suite 1.0, class testing
  • 2010 End of Project Party

36
DADO - Operate
  • Others ideas
  • Fast recovery means can afford false positives,
    enabling automated recovery mechanisms for
    servers via SLT algorithms
  • Microreboot exemplifies Repair as local
    adaptation
  • Safety achieved by state separation
  • Linear Control Theory places constraints on SW
    architectures
  • Will restricting systems to be controllable
    make them easier to operate by humans as well by
    simple controllers?
  • Will cost-performance still be good enough for
    controllable systems?
  • SLT helpful in diagnosing failed components

37
DADO - DevelopControl-Theory-Friendly Systems
  • Problem server-like system consisting of stages
    separated by queues
  • Lack of balance across stages results in
    performance hiccups
  • Straightforward application of LTI control theory
    to regulate queue lengths via combination of
    admission control filtering
  • Insight build systems to allow the use of simple
    linear controllers
  • Example Farsite Scalability TR identifies
    Farsite properties that prevent it from being a
    good candidate for CT
  • Could Farsite be architected to avoid those
    properties? At what cost?
Write a Comment
User Comments (0)
About PowerShow.com