Pathdiag: Automatic TCP Diagnosis - PowerPoint PPT Presentation

About This Presentation
Title:

Pathdiag: Automatic TCP Diagnosis

Description:

False pass for even the best ... Pass/Fail on the basis of the extrapolated performance. Deploy as a ... all(?) false pass results. More features ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 37
Provided by: PSC74
Category:

less

Transcript and Presenter's Notes

Title: Pathdiag: Automatic TCP Diagnosis


1
PathdiagAutomatic TCP Diagnosis
  • Matt Mathis (PSC)?
  • John Heffner (PSC/Rinera)?
  • Peter O'Neil (NCAR/Mid-Atlantic Crossroads)?
  • Pete Siempsen (NCAR)?
  • 30 April 2008
  • http//staff.psc.edu/mathis/papers/
  • PAM20080430.ppt

2
Outline
  • What is the problem?
  • The pathdiag solution
  • Details
  • The bigger problem

3
What is the problem?
Internet 2 weekly traffic statistics About 3
Mb/s!
4
Why is end-to-end performance difficult?
  • By design TCP/IP hides the net from upper layers
  • TCP/IP provides basic reliable data delivery
  • The hour glass between applications and
    networks
  • This is a good thing, because it allows
  • Invisible recovery from data loss, etc
  • Old applications to use new networks
  • New application to use old networks
  • But then (nearly) all problems have the same
    symptom
  • Less than expected performance
  • The details are hidden from nearly everyone

5
TCP tuning is painful debugging
  • All problems reduce performance
  • But the specific symptoms are hidden
  • Any one problem can prevent good performance
  • Completely masking all other problems
  • Trying to fix the weakest link of an invisible
    chain
  • General tendency is to guess and fix random
    parts
  • Repairs are sometimes random walks
  • Repair one problem at time at best
  • The solution is to instrument TCP

6
The Web100 project
  • Use TCP's ideal diagnostic vantage point
  • Instrument TCP What is limiting the data rate?
  • RFC 4898 TCP-ESTATS-MIB
  • Standards track
  • Prototypes for Linux (www.Web100.org) and Windows
    Vista
  • Fix TCP's part of the problem Autotuning
  • Automatically adjusts TCP socket buffers
  • Linux 2.6.17 default maximum window size is 4 M
    Bytes
  • Microsoft Vista default maximum window size is 8
    M bytes
  • (Except IE)?
  • Web100 is done
  • But still under limited support

7
New insight symptoms scale with RTT
  • Example flaws
  • TCP Buffer Space
  • Packet loss
  • Think RTT in the denominator converts rounds
    to elapsed time.

8
Symptom scaling breaks diagnostics
  • Local Client to Server
  • Flaw has insignificant symptoms
  • All applications work, including all standard
    diagnostics
  • False pass for all diagnostic tests
  • Remote Client to Server all applications fail
  • Leading to faulty implication of other components
  • It seems that the flaws are in the wide are
    network

9
The confounded problems
  • For nearly all network flaws
  • The only symptom is reduced performance
  • But the reduction is scaled by RTT
  • Therefore, flaws are undetectable on short paths
  • False pass for even the best conventional
    diagnostics
  • Leads to faulty inductive reasoning about flaw
    locations
  • Diagnosis often relies on tomography and
    complicated inference techniques
  • This is the real end-to-end performance problem

10
Goals
  • We want to automate debugging for the masses
  • But start with low hanging fruit
  • Who are the users? Assume
  • Analytic (e.g. Non-network scientists)?
  • Not afraid of math or measurements
  • Known data sources
  • Primary data direction is towards the users
  • That they have systems and network support
  • Only need to do first level diagnosis

11
More Goals
  • Automatic
  • one click in a web browser
  • Diagnose first level problems
  • Easily expose all path bottlenecks that limit
    performance to less than 10 MByte/s
  • Easily expose all end-system/OS problems that
    limit performance to less than 10 MByte/s
  • Will become moot as autotuning is deployed
  • Empower the users to apply the proper motivation
  • Results need to be accurate, well explained and
    common to both users and sys/net admins

12
The pathdiag solution
  • Test a short section of the path
  • Most often first or last mile
  • Use Web100 to collect detailed TCP statistics
  • Loss, delay, queuing properties, etc
  • Use models to extrapolate results to the full
    path
  • Assume that the rest of the path is ideal
  • The user has to specify the end-to-end goal
  • Data rate and RTT
  • Pass/Fail on the basis of the extrapolated
    performance

13
Deploy as a Diagnostic Server
  • Use pathdiag in a Diagnostic Server (DS)?
  • Specify End to End target performance
  • From server (S) to client (C) (RTT and data
    rate)?
  • Measure the performance from DS to C
  • Use Web100 in the DS to collect detailed
    statistics
  • On both the path and client
  • Extrapolate performance assuming ideal backbone
  • Pass/Fail on the basis of extrapolated performance

14
Demo
  • Click here for a live server

15
Pathdiag output
16
Pathdiag output
17
Key NPAD/pathdiag features
  • Results are intended to be self explanitory
  • Provides a list of specific items to be corrected
  • Failed tests are show stoppers for fast
    applications
  • Includes explanations and tutorial information
  • Clear differentiation between client and path
    problems
  • Accurate escalation to network or system admins
  • The reports are public and can be viewed by
    either
  • Coverage for a majority of OS and last-mile
    network flaws
  • Coverage is one way need to reverse client and
    server
  • Does not test the application need application
    tools
  • Does not check routing need traceroute
  • Does not check for middleboxes (NATs etc).
  • Eliminates nearly all(?) false pass results

18
More features
  • Tests becomes more sensitive as the path gets
    shorter
  • Conventional diagnostics become less sensitive
  • Depending on models, perhaps too sensitive
  • New problem is false fail (e.g. queue space
    tests)?
  • Flaws no longer completely mask other flaws
  • A single test often detects several flaws
  • E.g. Can find both OS and network flaws in the
    same run
  • They can be repaired concurrently
  • Archived DS results include raw web100 data
    Sample
  • Can reprocess with updated reporting SW
  • New reports from old data
  • Critical feedback for the NPAD project
  • We really want to collect interesting failures

19
Under the covers
  • Same base algorithm as Windowed Ping Mathis,
    INET94
  • Aka mping
  • See http//www.psc.edu/mathis/wping/
  • Killer diagnostic in use at PSC in the early 90s
  • Stopped being useful with the advent of fast
    path routers
  • Use a simple fixed window protocol
  • Scan window size in 1 second steps
  • Pathdiag clamps cwnd to control the TCP window
  • Varies step size fine steps near interesting
    features
  • Measure data rate, loss rate, RTT, etc as window
    changes
  • Reports reflect key features of the measured data

20
Window Size vs Data Rate
21
Window Size vs Loss Rate
22
Window Size vs RTT
23
Window Size vs Power
24
The Bigger Picture
  • Download and Install
  • http//www.psc.edu/networking/projects/pathdiag/
  • The hardest part is building a Linux kernel
  • Beyond end-of-funding, still under limited
    support
  • Barriers to adoption
  • User expectations
  • Our language
  • Network administrators

25
Need to recalibrate user expectations
  • Long history of very poor network performance
  • Users do not know what to expect
  • Users have become completely numb
  • Users have no clue about how poorly they are
    doing
  • Because TCP/IP hides the network all too well
  • We need to re-educate RE users
  • Less than 1/2 gigabyte per minute is not
    highspeed
  • Everyone should be able to reach this rate
  • People who cant should know why or be angry

26
Language problems
  • Nobody except network geeks use bits/second
  • BTW on the last slide
  • 1/2 gigabyte/minute is about
  • 10 M Byte/s or
  • 80 Mb/s
  • 17 year old LAN technology (FDDI)?
  • Nothing slower should be considered High Speed

27
Campus network administrators
  • Generally very underfunded, and know it
  • Can't support all users equally
  • Don't want users to compare results
  • Don't want to enable accurate user complaints
  • Don't want pathdiag
  • Workaround deploy upstream

28
Closing
  • Satisfied our immediate technical goals
  • The bigger problem still requires a lot more work

29
Backup slides
30
What about impact of the test traffic?
  • Pathdiag server is single threaded
  • Only one test at a time
  • Same load as any well tuned TCP application
  • Protected by TCP fairness
  • Large flows are generally softer than small
    flows
  • Large flows are easily disturbed by small flows
  • Note that any short RTT flow is stiffer than a
    long RTT flow

31
NPAD/pathdiag deployment
  • Why should a campus networking organization care?
  • Zero effort solution to miss-tuned end-systems
  • Accurate reports of real problems
  • You have the same view as the user
  • Saves time when there really is a problem
  • You can document reality for management
  • Suggestion
  • Require pathdiag reports for all performance
    problems

32
Download and install
  • User documentation
  • http//www.psc.edu/networking/projects/pathdiag/
  • Follow the link to Installing a Server
  • Easily customized with a site specific skin
  • Designed to be easily upgraded with new releases
  • Roughly every 2 months
  • Improving reports through ongoing field
    experience
  • Drops into existing NDT servers
  • Plans for future integration
  • Enjoy!

33
The Wizard Gap
34
The Wizard Gap Updated
  • Experts have topped out end systems links
  • 10 Gb/s NIC bottleneck
  • 40 Gb/s link bandwidth (striped)?
  • Median I2 bulk rate is 3 Mbit/s
  • See http//netflow.internet2.edu/weekly/
  • Current Gap is about 30001
  • Closing the first factor of 30 should now be
    easy

35
Pathdiag
  • Initial version aimed at NSF domain scientists
  • People with non-networking analytical background
  • Report designed to
  • accurately identify subsystem
  • provide tutorial
  • provide good escalation to network or host admin
  • support the user as the ultimate judge of success
  • Future plan to split reports
  • Even easier for non-experts
  • Better information for experts

36
Pathdiag
  • One click automatic performance diagnosis
  • Designed for (non-expert) end users
  • Future version will better support both expert
    and non-expert
  • Accurate end-systems and last mile diagnosis
  • Eliminate most false pass results
  • Accurate distinction between host and path flaws
  • Accurate and specific identification of most
    flaws
  • Basic networking tutorial info
  • Help the end user understand the problem
  • Help train 1st tier support (sysadmin or
    netadmin)?
  • Backup documentation for support escalation
  • Empower the user to get it fixed
  • The same reports for users and admins
Write a Comment
User Comments (0)
About PowerShow.com