Title: Les Cottrell
1 End-to-end Monitoring in Esnet/HENP the
relevance of ping
- Les Cottrell
- SLAC Stanford University
- ltcottrell_at_slac.stanford.edugt
- Presented at the NLANR organized workshop on
Challenges and opportunities - for measurements and analysis in a high
performance environment, July 1, 1999, SDSC. - Partially funded by DOE/MICS Field Work Proposal
on Internet End-to-end Performance Monitoring
(IEPM)
2Outline of talk
- Review relevance of tools, deployment
- Illustrate the type of information that is
provided and how it relates to applications, e.g.
TCP VoIP - Long term trends
- Community comparisons
- Challenges
- continued validity of ping, comparison with
Surveyor - other work coordination
- Future work
3Main tool (PingER) currently uses Ping
- Treats Internet as black box
- Provides useful real world measures of network
round trip response time, loss, reachability,
jitter - Low cost/lightweight tool
- ping universally available, easy to understand
- no software for clients to install
- no special privileges needed for monitor sites
- resources 100bps/link, 600kBytes/month/link
- Ping mature, well understood, widely available
4Examples of relevance to applications
- Relates to Web performance small files dominated
by RTT - BWTCP lt (MSS/RTT)(1/sqrt(loss))
5Scale of Measurements
- 18 Monitoring sites - 7 in US (5 ESnet, 2 vBNS),
2 in Canada, 7 in Europe (ch, de, dk, hu, it,
uk(2)), 2 in Asia (jp, tw) - 1261 monitoring-remote-site pairs
- 379 unique hosts, 272 sites
- 50 beacon sites, 27 countries
- Metrics include response, jitter, loss,
reachability - Data goes back gt 4 years
- 1 Million probes of Internet / day
6XIWT/IPERF
- 2nd instance of PingER tools deployed by XIWT
- at 10 monitoring sites (Bellsouth, CNRI,
Digital/Compaq(2), DirecPC, HP, Intel, NIST,
SLAC, Westgroup) - mainly full mesh pinging
- 150 pairs
- different community of interest - more commercial
(70 .com, 20 .org, 10 .edu)
7WebInterface
http//www.slac.stanford.edu//xorg/iepm/pinger/tab
le.html
8Effect of STAR-TAPon KEK.jp ltgtSLAC
400
Ping RTT in msec.
50
packet loss
200
September 1 to December 31 1998
0
0
9Improvement in RTT
10Improvement in packet loss
11Bandwidth improvement from ESnet sites
TCP bandwidth lt (1470/RTT) (1/sqrt(loss))
12ESnet, I2, XIWT,Euro-Labs
13Calibration of ping
- Sanity checks
- host pings itself, host pings host at same site
- high statistics between a few sites inside
site - see www.slac.stanford.edu/comp/net/wan-mon/ping-hi
-stat.html - look at subtle behaviors, e.g. RTT distribution
tails - check wire time (sniffer) vs. ping reported
times, at client server - see www.slac.stanford.edu/comp/net/wan-mon/error.h
tml - Correlate with Surveyor one-way measures
14Natural enemies of ping
- Poor choice of remote host (clustered, variable
load..) or monitoring host - Ping program problems and pathologies
- Some implementations have bugs, or are incomplete
- Spurious packets confuse ping programs (lt0.2
effect) - e.g. program sends 5 packets sees 10.
- Out of order packets (lt 0.02 effect)
- Some sites/hosts block pings
- Other sites limit pings to a certain size
- Rate limiting, e.g. some sites filter out ICMP
traffic during high usage or all the time
15Impact of limiting
16Ping limiting/blocking
- First noticed in 1996
- protect against ping odeath (OS) smurf attacks
(directed broadcasts) - Host requirement to implement ping
- but not to execute, and probably blocked at
firewall - First step for cracker scanning a site
- Identified at 2 hosts (i.e. currently a small
effect) - http//www.slac.stanford.edu/comp/net/wan-mon/path
ology.html
17Avoiding
- careful choice of host gt beacon sites
- working with remote sites ISPs
- using TCP echo or UDP echo (security), but
crackers will find them and often already blocked - new protocol designed for measurement (IPMP)
- special purpose measurement machines protocols
18Surveyor / RIPE
- Dedicated PC running Unix at key sites
- GPS for clock synchronization
- One way delay loss measurements
- Community is Internet 2 clients,
- HEP sites collaborating with Surveyor
- deployed in HENP community (CERN (Geneva), FNAL
(Chicago) SLAC (Silicon Valley - SF)) - using PingER analysis tools on Surveyor data
19Comparing PingER Surveyor
20Comparing Surveyor ping/ER results
- Took Surveyor data between SLAC, FNAL CERN,
Nov-98 thru May-99 - Reformatted into PingER format, allows viewing
with PingER tools - metrics loss, delay, unreachability,
unpredictability - hourly, daily, monthly ticks
- sorting, exporting to Excel
- Also made some high statistics ping measurements
compared with Surveyor
21Surveyor vs. ping RTT
22Surveyor vs. ping
23Surveyor vs. Ping Correlation
24PingER Surveyor
25PingER vsSurveyor
26PingER - Surveyor Complementarity
- Agree well
- Surveyor has one way measurements, PingER only
round-trip - Surveyor dedicated platforms strong central
management - experience with PingER shows this has benefits.
- PingER more parsimonious/lightweight (bandwidth,
disk space, cpu) - better for poor connectivity sites - e.g. Russia,
China - but necessarily less accurate especially at small
(hourly) time resolution on low loss links. - PingER good for looking at long term trends
grouping where statistics are less a problem.
27Work in progress 1/2
- Random scheduling of pings (in beta at 2 sites)
- Recording more information (in production at
XIWT) - Flexibility in choice of packet sizes,
frequencies (tailor to bandwidth between pair) - Look for ICMP rate limiting signatures
- Install RIPE engine at SLAC,
- Correlate AMP data (AMP running at SLAC)
- Gather historical route information for PingER
- Calibrate ping jitter against VoIP jitter
28Work in Progress 2/2
- Calibrate using ping to measure QoS effects
- setting up QoS testbed between SLAC LBNL
- Other possibilities
- Thinking about extending framework to other
apps, e.g. following IPMP work, TCP/UDP echo,
http (CERN interested, XIWT also interested,
there are also commercial tools, but expensive) - Generate alerts (HEPNRC)
29Monitoring Conclusions
- Performance is improving
- ESnet vBNS/Internet 2 well configured provide
good service within between their nets - Performance within AR networks is generally good
- Minimize ISPs crossed, peering critical
- Intercontinental performance is poor to bad
- Today need headroom, or managed bandwidth, QoS in
future - End users need monitoring to know what to expect,
write SLAs, set baselines, ID problems, plan
30More Information extra info follows
- WAN Monitoring at SLAC has lots of links
- http//www.slac.stanford.edu/comp/net/wan-mon.html
- Tutorial on WAN Monitoring (including methods,
RTT, jitter, loss QoS thresholds etc.) - http//www.slac.stanford.edu/comp/net/wan-mon/tuto
rial.html - PingER History tables
- http//www.slac.stanford.edu//xorg/iepm/pinger/tab
le.html - Internet Monitoring in the HEP Community,
SLAC-PUB-7961, presented at CHEP98, Chicago,
Aug-98 - http//www.slac.stanford.edu/pubs/slacpubs/7000/sl
ac-pub-7961.html