Building better tools for operators in Internet services

1 / 16
About This Presentation
Title:

Building better tools for operators in Internet services

Description:

operators/resolvers don't see the 'big picture' too much information ... zoom in to see datacenters, racks, machines, load balancers ... –

Number of Views:19
Avg rating:3.0/5.0
Slides: 17
Provided by: peter262
Category:

less

Transcript and Presenter's Notes

Title: Building better tools for operators in Internet services


1
Building better tools for operators in Internet
services
  • Peter Bodík, Armando Fox, Dave Patterson,
  • Jon Ingalls (Amazon.com)

2
Current work in AC ignores role of operators
  • operators understand how the system works
  • learn from them
  • we need to understand how they work
  • build better tools, automate their work
  • software developers are operators too
  • 100s 1000s operators vs. a few specialists
  • this presentation
  • describe the work of operators/resolvers in
    Amazon.com
  • two new tools to make them more efficient

3
The work of operators
  • previous study of operators (IBM)
  • surveyed 100 operators, videotaped 200 hours of
    their work
  • large corporate data centers
  • lack of good tools for operators
  • collaboration and communication
  • planning and rehearsal
  • situation awareness
  • tool building
  • multitasking and diversions

4
Amazon.com
  • two-pizza teams
  • 50 software teams (each responsible for a few
    services)
  • most of the software developed in-house
  • networking, hardware, monitoring, operators
  • each team has a primary-resolver on-call 24x7
  • the Monitoring team
  • provide infrastructure for monitoring of SW/HW at
    Amazon, setting up alarms
  • easy for anybody to instrument their SW/HW
  • MT collects all the data, stores in a DB,
    provides visualization tools
  • provides API for accessing the data
  • other teams build their own visualization tools

5
Operations
  • operators vs. resolvers
  • 10 operators
  • monitor the whole site
  • dont fix the problems, but page the resolvers
  • 1000 resolvers (10-15 per team)
  • monitor their own service, fix the problems that
    arise
  • sev1 problems
  • operators notice the problem, perform quick
    troubleshooting, page corresponding resolvers
  • sev2 problems
  • go directly to the primary of the affected service

6
Very dynamic environment
  • most of the software written in-house
  • the code is constantly changing
  • in contrast with standard PC software
  • changes in code pushed to production
  • January through November 2005 (in Monitoring
    team)
  • on average 140 code pushes a month
  • changes in documentation
  • documentation in Wiki since August 2005
  • in October and November, more than 700 changes a
    month

7
Sev1 problems
  • problems that affect customers
  • often detected as decrease of traffic to certain
    URLs
  • how they solve them
  • operators notice the problem, initial
    troubleshooting
  • they dont try to solve the problem
  • engage primary resolvers in multiple teams
  • they have 15 minutes to be at their laptops, join
    a con-call
  • on average 6 people involved (sometimes 20 - 30)
  • later assign the problem to one team
  • sometimes misdiagnosed
  • on average every sev1 problem misdiagnosed once

8
Why sev1 problems are hard
  • dependencies between components
  • failure in one component affects many others
  • many components appear broken, but only one is
  • the dependencies are invisible
  • situation awareness
  • operators/resolvers dont see the big picture
  • too much information
  • thousands of metrics for each component
  • want to know useful metrics, docs, ...
  • thousands of active alarms

9
Maya
  • interactive visualization
  • components, their health, dependencies (logical
    hardware)
  • zoom in to see datacenters, racks, machines, load
    balancers
  • wiki dashboard for each component metrics,
    alarms, notes, ...
  • dependencies
  • hard to detect all automatically
  • let people add/remove dependencies
  • health of components dashboards
  • dont try to find the useful metrics
    automatically use the knowledge of the operators
  • dashboards built like a wiki
  • anybody can add metrics, notes, links, ...

10
(No Transcript)
11
Sev2 problems
  • facts
  • dont directly affect the customers (but still
    15-minute SLA)
  • handled by resolvers (not operators)
  • 100x more frequent than sev1 problems
  • some detected manually
  • the rest detected automatically
  • 70-90 of problems detected automatically through
    alarms
  • new features -gt new bugs -gt cause sev2 problems
  • the bugs eventually fixed, but have to deal w/the
    problems
  • restart application, reboot machine
  • these problems repeat relatively often

12
Fixing sev2 problems
  • they know how to resolve the repeating problems
  • the documentation contains notes for the
    primary
  • how to troubleshoot and fix the most common
    problems
  • obsolete very quickly, needs to be updated very
    often
  • not everything is in the docs
  • new types of problems arise
  • need to train new operators
  • they ask other colleagues, search through emails
  • primary sometimes cant resolve the problem
  • but somebody else can
  • or somebody else had resolved it before

13
Monitor the operators suggest actions
  • create database of past problems
  • with solutions to each problem
  • for a new problem, suggest actions that would
    help
  • populate database by monitoring operators
  • monitoring the resolvers
  • type of the problem
  • sequence of actions from tools they use
  • web-based tools access logs
  • command-line sudo logs, history
  • time intervals when they worked on a problem
  • biggest issue resolvers multitask a lot

14
A prototype
  • trouble ticket database
  • start/end times
  • worklog entries, people working on the problem
  • type of alarm that generated the ticket

start
resolved
30 minutes
30 minutes
actions of user A
user B
15
A prototype (contd)
  • types of actions
  • monitoring tools
  • CPU, mem at hosts for service X
  • documentation (wiki)
  • results
  • for each type of problem the most popular
    metrics and docs
  • no quantitative results yet
  • multitasking of resolvers
  • dont know exactly which actions belong to which
    problem
  • get feedback from resolvers

16
Conclusion
  • Maya
  • useful for sev1 issues
  • like a wiki
  • dependencies
  • metrics, notes, links, ...
  • monitoring operators
  • useful for sev2 problems that repeat
  • monitor how resolvers diagnose fix problems
  • later suggest useful actions

17
add
  • misdiagnosed problems
  • 60-80
  • fairly often since the pages by nature are due to
    performance and availability issues that are
    often outside our direct control
  • and dependencies
Write a Comment
User Comments (0)
About PowerShow.com