Title: Diagnostic Steps
1Diagnostic Steps
- Les Cottrell SLAC
- Presented at the Networks for Non Networkers 2nd
International Workshop, 21-22 June 2005,
Edinburgh, Scotland - http//www.slac.stanford.edu/grp/scs/net/talk05/nf
nn2-jun05.ppt
Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM), also supported by IUPAP
2Overview
- Goal provide a practical guide to debugging
common problems (Brian covered high performance
problems) - Why is diagnosis difficult yet important?
- Local host
- Ping, Traceroute, PingRoute
- Looking at time series
- Locating bottlenecks
- Correlation of problems with routes
- More tools and problems
- Where is a node
- Who do you tell, what do you say?
- Case studies and More Information
3Why is diagnosis difficult?
- Internet's evolution as a composition of
independently developed and deployed protocols,
technologies, and core applications - Diversity, highly unpredictable, hard to find
invariants - Rapid evolution change, no equilibrium so far
- Findings may be out of date
- Measurement/diagnosis not high on vendors list of
priorities - Resources/skill focus on more interesting an
profitable issues - Tools lacking or inadequate
- Implementations are flaky not fully tested with
new releases
4Add to that
- Distributed systems are very hard
- A distributed system is one in which I can't get
my work done because a computer I've never heard
of has failed. Butler Lampson - Network is deliberately transparent
- The bottlenecks can be in any of the following
components - the applications
- the OS
- the disks, NICs, bus, memory, etc. on sender or
receiver - the network switches and routers, and so on
- Problems may not be logical
- Most problems are operator errors,
configurations, bugs - When building distributed systems, we often
observe unexpectedly low performance - the reasons for which are usually not obvious
- Just when you think youve cracked it, in steps
security - Firewall, NAT boxes etc.
- Block pings, traceroute looks like port scan,
diagnostic tool ports are blocked - ISPs worried about providing access to core,
making results public, privacy issues
5Sources of problems
- Host errors
- TCP buffers, heavy utilization
- Duplex mismatch (Ethernet)
- Misconfigured router/switches
- Including routing errors, especially for backup
paths - Bad equipment, wiring/fiber problem
- Congestion
6Local Host (also see NDT later)
- Usual Unix tools (uname-a, top, vmstat, iostat
..) - Is the host overloaded, do you have a gateway
(route), name server (nslookup), which interface
are you using (mii-tool (needs root), gives
duplex speed common error source) - Net ifconfig a (look at errors), netstat a
- Is server running (if you know port)?
- gttelnet localhost 2811 Trying 127.0.0.1
- 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI
type Globus/GSI wu-2.6.2 (gcc32dbg,
1069715860-42) ready. -
- telnetgt quit
7Local Host - LISA
- Localhost Information Service Agent LISA is a
Java Web Start application which provides - Integration with MonALISA
- Complete Monitoring of the System (Load, CPU,
Memory, Disk, Disk IO, Paging, Processes, Network
Traffic and Connectivity...). - History and instantaneous
- Filters to trigger actions when predefined
conditions are detected. - A user Friendly GUI to present the monitoring
information. - Optimization modules for distributed
applications. - It is a lightweight application that can be
easily deployed on any system. - Modules for End to End network measurements (
e.g. IPERF). - See monalisa.caltech.edu/dev_lisa.html
8Ping
- Ping to localhost, ping to gateway, ping to well
known host to relevant remote host - Use IP address to avoid nameserver problems
- Look for connectivity, loss, RTT, jitter, dups
- May need to run for a long time to see some
pathologies (e.g. bursty loss due to DSL loss of
sync) - Try flood pings if suspect rate limited
- Use synack or sting if ICMP blocked
- www-iepm.slac.stanford.edu/tools/synack/
9Ping example
Packet size
Remote host
Repeat count
RTT
- syrup/home ping -c 6 -s 64 thumper.bellcore.com
- PING thumper.bellcore.com (128.96.41.1) 64 data
bytes - 72 bytes from 128.96.41.1 icmp_seq0 ttl240
time641.8 ms - 72 bytes from 128.96.41.1 icmp_seq2 ttl240
time1072.7 ms - 72 bytes from 128.96.41.1 icmp_seq3 ttl240
time1447.4 ms - 72 bytes from 128.96.41.1 icmp_seq4 ttl240
time758.5 ms - 72 bytes from 128.96.41.1 icmp_seq5 ttl240
time482.1 ms - --- thumper.bellcore.com ping statistics --- 6
packets transmitted, 5 packets received, 16
packet loss round-trip min/avg/max
482.1/880.5/1447.4 ms
Missing seq
Summary
103rd party ping (via Looking Glass)
- Find servers
- www.caida.org/analysis/routing/reversetrace/
- Example http//stats.geant.net/cgi-bin/lg/lg.cgi
- Ok for checking connectivity and RTT but not for
losses (unless huge)
Looking Glass Results - ch1.ch.geant.net Date
Mon May 30 212839 2005 GMT Query Ping
ltIP_Addr FQDNgtReal Query ping rapid count
5Argument(s) www.slac.stanford.edu PING
www8.slac.stanford.edu (134.79.18.163) 56 data
bytes !!!!! --- www8.slac.stanford.edu ping
statistics --- 5 packets transmitted, 5 packets
received, 0 packet loss round-trip
min/avg/max/stddev167.316/172.212/191.222/9.506
ms
11Traceroute
- Traceroute to remote host
- Is the route direct, over commercial congested
nets - Reverse traceroute from remote host to you or 3rd
party - www.slac.stanford.edu/comp/net/wan-mon/traceroute-
srv.html - www.tracert.com/
- www.caida.org/analysis/routing/reversetrace/
CAIDA Mouse sensitive map
12Traceroute
Remote host
Max hops
Probes/hop
- UDP/ICMP tool to show route packets take from
local to remote host - 17cottrell_at_flora06gttraceroute -q 1 -m 20
lhr.comsats.net.pk - traceroute to lhr.comsats.net.pk (210.56.16.10),
20 hops max, 40 byte packets - 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
0.642 ms - 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU
(134.79.135.21) 0.616 ms - 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU
(192.68.191.66) 0.716 ms - 4 snv-slac.es.net (134.55.208.30) 1.377 ms
- 5 nyc-snv.es.net (134.55.205.22) 75.536 ms
- 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms
- 7 gin-nyy-bbl.teleglobe.net (192.157.69.33)
154.742 ms - 8 if-1-0-1.bb5.NewYork.Teleglobe.net
(207.45.223.5) 137.403 ms - 9 if-12-0-0.bb6.NewYork.Teleglobe.net
(207.45.221.72) 135.850 ms - 10 207.45.205.18 (207.45.205.18) 128.648 ms
- 11 210.56.31.94 (210.56.31.94) 762.150 ms
- 12 islamabad-gw2.comsats.net.pk (210.56.8.4)
751.851 ms - 13
- 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms
location
Long delay satellite
No response Lost packet or router ignores
13RTT from California to world
Europe
E. Coast
Brazil
E. Coast US
W. Coast US
300ms
RTT (ms)
Europe S. America
0.30.6c
Longitude (degrees)
300ms
Frequency
Source Palo Alto CA, W. Coast
RTT (ms.)
Data from CAIDA Skitter project
14Traceroute server results
- Example www.slac.stanford.edu/cgi-bin/nph-tracero
ute.pl
Related info
Security warning
Traceroute
Enter IP address or name
15Pingroute
- Ping routers along route, e.g. a tool to install
that helps - www.slac.stanford.edu/comp/net/fpingroute.pl
- or www.slac.stanford.edu/comp/net/fpingroute.pl
if fping N/A
15cottrell_at_noric04gtfpingroute.pl fpingroute.pl
does a traceroute to the selected host. For each
of the hops along the route it then uses fping
to ping each node (in parallel) 'count' times.
Output includes traceroute information, RTTs,
losses for 100 and 'size byte
pings. Version0.21, 8/24/04 Usage
fpingroute.pl Opts host where host is the
remote host's IP address or name e.g.
www.slac.stanford.edu Opts
-c count default10 -s
size default1400 -i
initial default1 Example fpingroute.pl -i 3
-c 10 -s 1400 www.triumf.ca
16Pingroute example
- May help tell where losses start
- Will need many pings if losses small
Start of losses?
But?
Start of sustained losses
Routers may not respond
17Look at time series
- Look at history plots (PingER, AMP, IEPM-BW,
ISPs, own border router etc.), when did problem
start, how big an effect is it? - Assumes you know proximity of paths for which
there are archived active measurements to the
path that you are interested in - Also that relevant measurements exist
- www-iepm.slac.stanford.edu/pinger/
- amp.nlanr.net/
- ISPs plots
- Abilene http//stryper.uits.iu.edu/abilene/
- GEANT http//stats.geant.net/usagemap/usagemap
- RIPE http//www.ripe.net/projects/ttm/Plots/
- ESnet http//measurement.es.net/ (OWAMP)
- Collaboration between Internet2/ESnet/Geant to
provide access to router measurements holds
promise - Look at traceroute histories (see later)
18Example time series
- Look for change in measured value
- Note time
- Correlate
Italy disconnected
19Find location of a bottleneck
- Look at hops along the path
- Pingroute (see earlier)
- If possible look at utilizations or active probes
launched from there - Pipechar (son of pathchar, pchar)
- Send packets of varying sizes to each router
along path - Look at RTT as a function of packet size
- From slope deduce bandwidth
- Diferentiate to find capacity at each hop
- However pchar is no longer supported, pathchar is
very slow, pipechar has uncertain support (ask
Brian) - Packet size variation limited to 1-MTU (1500)
Bytes, so on fast links timing is difficult, with
the result that estimates may not be reliable - Find pipechar at http//www.dsd.lbl.gov/OldProjec
ts/NCS/
20Divide Conquer
- Abilene has hosts at major PoPs running bwctl
- So make measurements from end to middle to ID
loss of performance - http//e2epi.internet2.edu/pipes/ami/bwctl/
21Correlate with routes (traceanal)
22Visualizing traceroutes
- One compact page per day
- One row per host, one column per hour
- One character per traceroute to indicate
pathology or change (usually period(.) no
change) - Identify unique routes with a number
- Be able to inspect the route associated with a
route number - Provide for analysis of long term route
evolutions
Route at start of day, gives idea of route
stability
Multiple route changes (due to GEANT), later
restored to original route
Period (.) means no change
23Changes in network topology (BGP) can result in
dramatic changes in performance
Hour
Samples of traceroute trees generated from the
table
Los-Nettos (100Mbps)
Remote host
Snapshot of traceroute summary table
Notes 1. Caltech misrouted via Los-Nettos
100Mbps commercial net 1400-1700 2. ESnet/GEANT
working on routes from 200 to 1400 3. A
previous occurrence went un-noticed for 2
months 4. Next step is to auto detect and notify
Drop in performance (From original path
SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos
(100Mbps) -Caltech )
Back to original path
Dynamic BW capacity (DBC)
Changes detected by IEPM-Iperf and AbWE
Mbits/s
Available BW (DBC-XT)
Cross-traffic (XT)
Esnet-LosNettos segment in the path (100 Mbits/s)
ABwE measurement one/minute for 24 hours Thurs
Oct 9 900am to Fri Oct 10 901am
24Moving towards application
- See Brians talk
- Try user application (mem to mem disk to disk)
- GridFTP, bbcp, bbftp
- Iperf or thrulay (also provides RTT) to test TCP
or UDP throughput - dast.nlanr.net/Projects/Iperf/
- www.internet2.edu/shalunov/thrulay/
- NDT
- What are the interface speeds?
- What is the bottleneck?
- Is there a duplex mismatch?
- Are buffers set right (both ends)?
25NDT example (Rich Carlson)
26Other tools
- Ntop
- Summarizes libpcap (sniffer) infor
- Internet2 Detective
- Tests connectivity to I2, bandwidth, multicast,
IPv6 - Can run as Java applet
- http//detective.internet2.edu/
- NLANR Internet Advisor
- Ethereal, tcpdump, snoop for masochists
- Passive tools
- Netflow for characterizing network, spotting
abnormalities, e.g. - www.itec.oar.net/abilene-netflow
- www.slac.stanford.edu/comp/net/slac-netflow/html/
SLAC-netflow.html - SNMP based tools
27And then
- Wireless
- Avoid peer-to-peer/ad-hoc connections
- Disable connecting to ad-hoc (set infrastructure
only) - Disable bridging
- How to do it varies by OS (XP, OSX, Linux)
- Ad hoc can still interfere if on same channel
- Tools to locate an access point (e.g.
Yellow-Jacket) - See
- www2.slac.stanford.edu/comp/net/wireless/Wireless-
Meeting-Handout.mht - NAT boxes may block or not support application
- Private addresses
- 10.0.0.0 - 10.255.255.255 a single class A net
- 172.16.0.0 - 172.31.255.255 16 contiguous class
Bs - 192.168.0.0 192.168.255.255 256 contiguous
class Cs
28Where is a host?
- Beware some of information following is
ephemeral, in general use heuristics with Google - Google Internet country codes for TLDs
- Host may not be in TLD country, especially
developing regions often use proxies elsewhere - Location may be encoded in router name
- iplsIndianapolis, snvSunnyvale
- Name server lookup to find hostname given IP
address - 47cottrell_at_netflowgtnslookup 210.56.16.10
- Server localhost
- Address 127.0.0.1
- Name lhr.comsats.net.pk
- Address 210.56.16.10
- Use a whois server, e.g.
- www.networksolutions.com/cgi-bin/whois/whois
(Americas Africa) - www.ripe.net/cgi-bin/whois (Europe)
- www.apnic.net/ (Asia)
- May identify site name, address, contact, etc,
not all domains are in databases (e.g. will not
find comsats.net.pk)
29Where is a host cont.
- Find the Autonomous System (AS) administering
- Form giving AS for domain name
- http//www.fixedorbit.com/search.htm
- Gives AS number, name adjacent ASs web page for
AS - Given an AS find out more about it
- Use http//bgp.potaroo.net/cidr/ go to bottom and
enter AS into form - Gives ISP name, web page, phone number, email,
hours etc. - Review list of AS's ordered by Upstream AS
Adjacency - www.telstra.net/ops/bgp/bgp-as-upsstm.txt
- Tells what AS is upstream of an ISP
30Where is a host - cont.
- May be able to get latitude longitude
- http//www.hostip.info/index.html
- http//www.ip2location.com/
- But it is a subscriber service (, but ),
however it is probably best for developing
regions - Triangulate pings from landmarks (in development)
- planetlab-01.ipv6.lip6.fr10000/cbg.php
31Who you gonna tell?
- Local network support people
- Internet Service Provider (ISP) usually done by
local networker - Usually will know immediate one, e.g.
trouble_at_es.net - Use puck.nether.net/netops/nocs.cgi to find ISP
- Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to
find upstream ISPs - Well managed sites and ISPs maintain a list of
email addresses such as abuse_at_ or postmaster_at_,
that one can send email to, for example to
complain about spam etc. - This follows an Internet recommendation (RFC
2142). - Some less helpful sites do not provide such
services, for more on these, see RFC-ignorant.org
32What ya gonna tell em?
- Describe problem with details
- What is affected?
- Application, host OS (uname a), NIC (ifconfig,
route) - How is it affected?
- Non responsiveness, unable to contact remote host
- Slow performance (see Brians talk), packet loss
- When did it start?
- Send ping output between hosts
- Send traceroute forward reverse if possible
- Maybe use I (ICMP option)
- NDT
- Identify when it started
- If complex think about creating web page with
details - Top, vmstat, pingroute, pipechar, application
output (GridFTP, iperf)
33Web page examples Case studies
- http//www.slac.stanford.edu/grp/scs/net/case/html
/ - http//e2epi.internet2.edu/case-studies/
34More Information
- Tutorial on monitoring
- www.slac.stanford.edu/comp/net/wan-mon/tutorial.ht
ml - RFC 2151 on Internet tools
- www.freesoft.org/CIE/RFC/Orig/rfc2151.txt
- Network monitoring tools
- www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
- www.caida.org/tools/taxonomy/
- Network Performance Tools an I2 Cookbook
- e2epi.internet2.edu/network-perf-wk/tools-cookbook
.pdf - Network Monitoring sites
- www.slac.stanford.edu/comp/net/wan-mon/netmon.html
35Pathology Encodings
Change but same AS
No change
Probe type
End host not pingable
Change in only 4th octet
Hop does not respond
Stutter
ICMP checksum
Multihomed
! Annotation (!X)
36Navigation
traceroute to CCSVSN04.IN2P3.FR
(134.158.104.199), 30 hops max, 38 byte packets
1 rtr-gsr-test (134.79.243.1) 0.102 ms 13
in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063
ms !X
- rt firstseen lastseen
route - 0 1086844945 1089705757
...,192.68.191.83,137.164.23.41,137.164.22.37,...,
131.215.xxx.xxx - 1 1087467754 1089702792
...,192.68.191.83,171.64.1.132,137,...,131.215.xxx
.xxx - 2 1087472550 1087473162
...,192.68.191.83,137.164.23.41,137.164.22.37,...,
131.215.xxx.xxx - 3 1087529551 1087954977
...,192.68.191.83,137.164.23.41,137.164.22.37,...,
131.215.xxx.xxx - 4 1087875771 1087955566
...,192.68.191.83,137.164.23.41,137.164.22.37,...,
(n/a),131.215.xxx.xxx - 5 1087957378 1087957378
...,192.68.191.83,137.164.23.41,137.164.22.37,...,
131.215.xxx.xxx - 6 1088221368 1088221368
...,192.68.191.146,134.55.209.1,134.55.209.6,...,1
31.215.xxx.xxx - 7 1089217384 1089615761
...,192.68.191.83,137.164.23.41,(n/a),...,131.215.
xxx.xxx - 8 1089294790 1089432163
...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a
),...,131.215.xxx.xxx
37History Channel
38AS information