Building better tools for operators in Internet services

1 / 16

About This Presentation

Title:

Building better tools for operators in Internet services

Description:

operators/resolvers don't see the 'big picture' too much information ... zoom in to see datacenters, racks, machines, load balancers ... –

Number of Views:19

Avg rating:3.0/5.0

Slides: 17

Provided by: peter262

Category:

Tags: better | big | building | history | internet | of | operators | pizza | racks | services | tools

Transcript and Presenter's Notes

Title: Building better tools for operators in Internet services

1
Building better tools for operators in Internet
services

Peter Bodík, Armando Fox, Dave Patterson,
Jon Ingalls (Amazon.com)

2
Current work in AC ignores role of operators

operators understand how the system works
learn from them
we need to understand how they work
build better tools, automate their work
software developers are operators too
100s 1000s operators vs. a few specialists
this presentation
describe the work of operators/resolvers in
Amazon.com
two new tools to make them more efficient

3
The work of operators

previous study of operators (IBM)
surveyed 100 operators, videotaped 200 hours of
their work
large corporate data centers
lack of good tools for operators
collaboration and communication
planning and rehearsal
situation awareness
tool building
multitasking and diversions

4
Amazon.com

two-pizza teams
50 software teams (each responsible for a few
services)
most of the software developed in-house
networking, hardware, monitoring, operators
each team has a primary-resolver on-call 24x7
the Monitoring team
provide infrastructure for monitoring of SW/HW at
Amazon, setting up alarms
easy for anybody to instrument their SW/HW
MT collects all the data, stores in a DB,
provides visualization tools
provides API for accessing the data
other teams build their own visualization tools

5
Operations

operators vs. resolvers
10 operators
monitor the whole site
dont fix the problems, but page the resolvers
1000 resolvers (10-15 per team)
monitor their own service, fix the problems that
arise
sev1 problems
operators notice the problem, perform quick
troubleshooting, page corresponding resolvers
sev2 problems
go directly to the primary of the affected service

6
Very dynamic environment

most of the software written in-house
the code is constantly changing
in contrast with standard PC software
changes in code pushed to production
January through November 2005 (in Monitoring
team)
on average 140 code pushes a month
changes in documentation
documentation in Wiki since August 2005
in October and November, more than 700 changes a
month

7
Sev1 problems

problems that affect customers
often detected as decrease of traffic to certain
URLs
how they solve them
operators notice the problem, initial
troubleshooting
they dont try to solve the problem
engage primary resolvers in multiple teams
they have 15 minutes to be at their laptops, join
a con-call
on average 6 people involved (sometimes 20 - 30)
later assign the problem to one team
sometimes misdiagnosed
on average every sev1 problem misdiagnosed once

8
Why sev1 problems are hard

dependencies between components
failure in one component affects many others
many components appear broken, but only one is
the dependencies are invisible
situation awareness
operators/resolvers dont see the big picture
too much information
thousands of metrics for each component
want to know useful metrics, docs, ...
thousands of active alarms

9
Maya

interactive visualization
components, their health, dependencies (logical
hardware)
zoom in to see datacenters, racks, machines, load
balancers
wiki dashboard for each component metrics,
alarms, notes, ...
dependencies
hard to detect all automatically
let people add/remove dependencies
health of components dashboards
dont try to find the useful metrics
automatically use the knowledge of the operators
dashboards built like a wiki
anybody can add metrics, notes, links, ...

10
(No Transcript)
11
Sev2 problems

facts
dont directly affect the customers (but still
15-minute SLA)
handled by resolvers (not operators)
100x more frequent than sev1 problems
some detected manually
the rest detected automatically
70-90 of problems detected automatically through
alarms
new features -gt new bugs -gt cause sev2 problems
the bugs eventually fixed, but have to deal w/the
problems
restart application, reboot machine
these problems repeat relatively often

12
Fixing sev2 problems

they know how to resolve the repeating problems
the documentation contains notes for the
primary
how to troubleshoot and fix the most common
problems
obsolete very quickly, needs to be updated very
often
not everything is in the docs
new types of problems arise
need to train new operators
they ask other colleagues, search through emails
primary sometimes cant resolve the problem
but somebody else can
or somebody else had resolved it before

13
Monitor the operators suggest actions

create database of past problems
with solutions to each problem
for a new problem, suggest actions that would
help
populate database by monitoring operators
monitoring the resolvers
type of the problem
sequence of actions from tools they use
web-based tools access logs
command-line sudo logs, history
time intervals when they worked on a problem
biggest issue resolvers multitask a lot

14
A prototype

trouble ticket database
start/end times
worklog entries, people working on the problem
type of alarm that generated the ticket

start
resolved
30 minutes
30 minutes
actions of user A
user B
15
A prototype (contd)

types of actions
monitoring tools
CPU, mem at hosts for service X
documentation (wiki)
results
for each type of problem the most popular
metrics and docs
no quantitative results yet
multitasking of resolvers
dont know exactly which actions belong to which
problem
get feedback from resolvers

16
Conclusion

Maya
useful for sev1 issues
like a wiki
dependencies
metrics, notes, links, ...
monitoring operators
useful for sev2 problems that repeat
monitor how resolvers diagnose fix problems
later suggest useful actions

17
add

misdiagnosed problems
60-80
fairly often since the pages by nature are due to
performance and availability issues that are
often outside our direct control
and dependencies

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them.

Related Presentations

Delivering Better Healthcare Services with Edge AI

Delivering Better Healthcare Services with Edge AI - Medicine has been one of the most renowned success stories of modern science and technology. However, the MIT Technology review observes that until 2020 the pace of digital transformation in this sector has been frustratingly low owing to its risk-averse nature and spiraling costs. The mainstreaming of digital tools for enabling the treatment outcomes was emerging, but slowly. | free to view

Amadeus Web Services API | Amadeus GDS

Amadeus Web Services API | Amadeus GDS - FlightsLogic provides Amadeus Web Services API, Amadeus GDS to travel agencies, tour operators, and TMCs all over the globe. For more information, please visit our website: https://www.flightslogic.com/amadeus-web-services-api.php | free to view

Ricoh cloud print services

Ricoh cloud print services - Ricoh Cloud Print Services enable secure, flexible, and efficient printing from any device, anywhere. With cloud-based access, users can print seamlessly without relying on local servers. Ricoh ensures data security, scalability, and cost savings while supporting remote work environments. Businesses benefit from streamlined workflows, reduced IT burden, and enhanced productivity with Ricoh’s innovative cloud printing solutions. | free to view

Hotel Internet Booking Engine

Hotel Internet Booking Engine - Trawex provides Hotel Internet Booking Engine, Hotel Reservation API, Online Hotel Reservation Software, Hotel GDS System to the global travel industry. Create a fully functional online hotel reservation software with Trawex seamless support to streamline hotel operations, drive more direct bookings, boost occupancy, and elevate guest satisfaction. Hotel Internet Booking Engine is one of the most significant and vital aspects for any hotel, OTA, travel agency, or tour operator looking to prosper in the digital age. For more details, Pls visit our website: https://www.trawex.com/hotel-internet-booking-engine.php | free to view

Internet Reservation System | Online Booking Tool

Internet Reservation System | Online Booking Tool - FlightsLogic provides Internet Reservation System, Online Booking Tool to travel agencies, tour operators, and travel companies all over the globe. For more details, plz visit: https://www.flightslogic.com/internet-reservation-system.php | free to view

Hotel Internet Booking Engine (1)

Hotel Internet Booking Engine (1) - Discover Travelopro's cutting-edge hotel internet booking engine. Elevate your hospitality experience with our tailored services, designed for hotels of all sizes. A hotel Internet booking engine is a software tool on the hotel's own website to ensure that guests can easily book directly with the hotel. Often, a hotel Internet booking engine is commission-free and easy to integrate on its own website. For more details, Pls visit our Website: https://www.travelopro.com/hotel-internet-booking-engine.php | free to view

Hotel Internet Booking Engine

Hotel Internet Booking Engine - Discover Travelopro's cutting-edge hotel internet booking engine. Elevate your hospitality experience with our tailored services, designed for hotels of all sizes. A hotel Internet booking engine is a software tool on the hotel's own website to ensure that guests can easily book directly with the hotel. Often, a hotel Internet booking engine is commission-free and easy to integrate on its own website. For more details, Pls visit our Website: https://www.travelopro.com/hotel-internet-booking-engine.php | free to view

HOTEL INTERNET BOOKING ENGINE

HOTEL INTERNET BOOKING ENGINE - Trawex is a leading tech company provider of hotel booking software solutions. We provide Hotel Internet Booking Engine to the global travel industry. Trawex excels at all aspects of hotel booking engine development, establishing itself as an important player in the sector. Our agile strategy allows us to quickly navigate each development step, reducing expenses without compromising quality | free to view

Tour Reservation Software | Tour Operator Software

Tour Reservation Software | Tour Operator Software - FlightsLogic provides Tour Reservation Software, Tour Operator Software to travel agencies, tour operators, and TMCs all over the globe. | free to view

Airline Internet Booking Engine (1)

Airline Internet Booking Engine (1) - Global GDS collaborates with global travel companies to create robust online airline reservation software that guarantees 24/7 availability and real-time bookings. The airline internet reservation system is a web-based flight booking engine used to make flight reservations. | free to view

Airline Internet Booking Engine

Airline Internet Booking Engine - Global GDS collaborates with global travel companies to create robust online airline reservation software that guarantees 24/7 availability and real-time bookings. The airline internet reservation system is a web-based flight booking engine used to make flight reservations. | free to view

Reshape Your Business With Strategy & Consulting Services

Reshape Your Business With Strategy & Consulting Services - To get a better experience in the business world one needs the support of companies that provide Strategy & Consulting Services Nivid is a one-stop solution. | free to view

Hotel Internet Booking Engine

Hotel Internet Booking Engine - Trawex provides Hotel Internet Booking Engine is one of the most significant and vital aspects for any hotel, OTA, travel agency, or tour operator looking to prosper in the digital age. It's a strategic investment that allows your hotel to thrive in the digital age by attracting more visitors, boosting revenue, and enhancing the entire guest experience For more details, Pls visit our website: https://www.trawex.com/hotel-internet-booking-engine.php | free to view

Internet Reservation System

Internet Reservation System - Global GDS's Internet Reservation System is the ideal travel booking system for you as it includes all of the functionality required to manage your business. It connects to several GDS, third-party suppliers, airlines, and consolidators to access global content for hotels, flights, packages, and car rentals. Customers do not need to call you to check booking availability. Rather, they can visit your website and book in a couple of minutes. | free to view

Building Relationships with Users as a Strategic Concept Experience from two NSOs: Albania and Finland

Building Relationships with Users as a Strategic Concept Experience from two NSOs: Albania and Finland - Building Relationships with Users as a Strategic Concept Experience from two NSOs: Albania and Finland Presentation at the High-Level Forum on Strategic Planning in ... | free to view

Session I5 Creating Secure Services for Internet Telephony

Session I5 Creating Secure Services for Internet Telephony - Session I5 Creating Secure Services for Internet Telephony Henning Schulzrinne Columbia University hgs@cs.columbia.edu Overview What are IP telephony services? | free to view

Online Business Consultation services

Online Business Consultation services - Online businesses demand a unique approach and strategy. Consulting services provide assistance and advice to the client for them to better understand all aspects of the online business, technologies, marketing, and any other requirements. Consultants may conduct research and implement business solutions, marketing strategies, and assist clients with online operations, infrastructure, and growth. Online consultants can help with all aspects of an online business. They provide the tools and resources needed to evaluate the results and make adjustments as needed. Online business consulting services are for individuals, companies, or organizations that simply want to develop a more effective web presence. The offering covers up everything including financial strategy, risk management, and operational setups. | free to view

Building national and largescale Internet Information Gateways

Building national and largescale Internet Information Gateways - Slide 10. They offer ... France - les Signets. DESIRE II. Slide 17. A guided tour ... SOSIG ... setting the stage (ROADS, DESIRE) expansion of services ... | free to view

Network Defense Tools: Firewalls, Traffic shapers, and Intrusion Detection

Network Defense Tools: Firewalls, Traffic shapers, and Intrusion Detection - Spring 2006 CS 155 Network Defense Tools: Firewalls, Traffic shapers, and Intrusion Detection John Mitchell Perimeter and Internal Defenses Commonly deployed defenses ... | free to view

Business Consulting Services

Business Consulting Services - Online businesses demand a unique approach and strategy. Consulting services provide assistance and advice to the client for them to better understand all aspects of the online business, technologies, marketing, and any other requirements. Consultants may conduct research and implement business solutions, marketing strategies, and assist clients with online operations, infrastructure, and growth. Online consultants can help with all aspects of an online business. They provide the tools and resources needed to evaluate the results and make adjustments as needed. Online business consulting services are for individuals, companies, or organizations that simply want to develop a more effective web presence. The offering covers up everything including financial strategy, risk management, and operational setups. | free to view

A Laboratory Based Course on Internet Security

A Laboratory Based Course on Internet Security - Download a rootkit and install. Install and discover back doors. White-Hat Security Tools. ... The Ethics of Hacking. ... with the tools to create mischief. ... | free to view

DOING IT BETTER SEMINAR Building the Future : developing an ICT strategic plan that reflects our role and strategic direction.

DOING IT BETTER SEMINAR Building the Future : developing an ICT strategic plan that reflects our role and strategic direction. - DOING IT BETTER SEMINAR Building the Future : developing an ICT strategic plan that reflects our role and strategic direction. Pere Ruka IT Manager | free to view

5 Key points on How Digital Marketing Services help to grow your Business

5 Key points on How Digital Marketing Services help to grow your Business - Effective SEO and digital marketing is essential for the businesses to run satisfactorily in the market and grow themselves in the prospective arena. Kunsh Technologies is among leading digital marketing services provider to the clients from all across the world. This marketing arena gives a superior horizon to your business and trade where DM strategies, tools, and techniques open the gate of possibilities for your firm’s survival, competition, sustaining abilities and business development. | free to view

Internet Engineering Course

Internet Engineering Course - Internet Engineering Course Semantic Web, Web Services, Semantic Web Services * | free to view

How to Use Emotional Control and Observation Skills To Become a Better Negotiator!

How to Use Emotional Control and Observation Skills To Become a Better Negotiator! - MBDA Business Center of Pennsylvania www.mbc-pa.com Operated by The Enterprise Center | free to view

Creating High Impact Data Warehouses with SQL Server Integration Services and Analysis Services

Creating High Impact Data Warehouses with SQL Server Integration Services and Analysis Services - Discuss design, performance, and scalability for building ETL packages ... a cell (Belgium, Mussels, 2006, Sales Amount) = E10,523,374.83. A Cube. Product. Peas ... | free to view

Internet Security: An Optimist Gropes For Hope

Internet Security: An Optimist Gropes For Hope - Most common question from the press: 'Is Internet security getting better or worse? ... if you can crack something offline, it becomes a game of sniff-and-crack ... | free to view