Title: ESNet Update Joint Techs Meeting, July 19, 2004
1ESNet UpdateJoint Techs Meeting, July 19, 2004
- William E. Johnston, ESnet Dept. Head and Senior
Scientist - R. P. Singh, Federal Project Manager
- Michael S. Collins, Stan Kluz,Joseph Burrescia,
and James V. Gagliardi, ESnet Leads - and the ESnet Team
- Lawrence Berkeley National Laboratory
2ESnet Connects DOE Facilities and Collaborators
CAnet4 CERN MREN Netherlands Russia StarTap Taiwa
n (ASCC)
PNWG
SEA HUB
ESnet IP
Abilene
Japan
Abilene
Chi NAP
NY-NAP
QWEST ATM
Abilene
MAE-E
SNV HUB
MAE-W
PAIX-E
Fix-W
PAIX-W
Euqinix
Abilene
42 end user sites
Office Of Science Sponsored (22)
International (high speed) OC192 (10G/s
optical) OC48 (2.5 Gb/s optical) Gigabit Ethernet
(1 Gb/s) OC12 ATM (622 Mb/s) OC12 OC3 (155
Mb/s) T3 (45 Mb/s) T1-T3 T1 (1 Mb/s)
NNSA Sponsored (12)
Joint Sponsored (3)
Other Sponsored (NSF LIGO, NOAA)
Laboratory Sponsored (6)
ESnet core Packet over SONET Optical Ring and
Hubs IPv6 backbone and numerous peers
peering points
hubs
SNV HUB
3CAnet4 CERN MREN Netherlands Russia StarTap Taiwa
n (ASCC)
Australia CAnet4 Taiwan (TANet2) Singaren
GEANT - Germany - France - Italy - UK - etc
SInet (Japan) KEK Japan Russia (BINP)
KDDI (Japan) France
PNW-GPOP
SEA HUB
2 PEERS
Distributed 6TAP 19 Peers
Abilene
Japan
1 PEER
CalREN2
NYC HUBS
1 PEER
LBNL
Abilene 7 Universities
SNV HUB
5 PEERS
Abilene
2 PEERS
PAIX-W
26 PEERS
MAX GPOP
MAE-W
22 PEERS
39 PEERS
20 PEERS
FIX-W
6 PEERS
3 PEERS
LANL
CENIC SDSC
Abilene
ATL HUB
TECHnet
ESnet provides complete access to the Internet by
managing the full complement of Global Internet
routes (about 150,000) at 10 general/commercial
peering points high-speed peerings w/ Abilene
and the international networks.
Commercial
ESnet Peering (connections to other networks)
University
International
Commercial
4ESnet New Architecture Goal
- MAN rings provide dual site and hub connectivity
- A second backbone ring will multiply connect the
MAN rings to protect against hub failure
5First Step SF Bay Area ESnet MAN Ring
Seattle and Chicago
- Increased reliability and site connection
bandwidth - Phase 1
- Connects the primary Office of Science Labs in a
MAN ring - Phase 2
- LLNL, SNL, andUC Merced
- Ring should not connect directly into ESnet SNV
hub (still working on physical routing for this) - Have not yet identified both legs of the mini ring
Chicago
Joint Genome Institute
LBNL
NERSC
SF BA MAN ring topology phase 1
SF Bay Area
mini ring
SLAC
Qwest /ESnet hub
Level 3hub
NLR / UltraScienceNet
Existing ESnet Core Ring
El Paso
LA and San Diego
6Traffic Growth Continues
ESnet Monthly Accepted Traffic ESnet is currently
transporting about 250 terabytes/mo.
TBytes/Month
Annual growth in the past five years has
increased from 1.7x annually to just over 2.0x
annually.
7Who Generates Traffic, and Where Does it Go?
ESnet Inter-Sector Traffic Summary,Jan 2003 /
Feb 2004 (1.7X overall traffic increase, 1.9X OSC
increase) (the international traffic is
increasing due to BABAR at SLAC and the LHC tier
1 centers at FNAL and BNL)
72/68
21/14
Commercial
14/12
DOE is a net supplier of data because DOE
facilities are used by universities and
commercial entities, as well as by DOE researchers
ESnet
25/18
17/10
RE (mostlyuniversities)
DOE sites
10/13
Peering Points
53/49
9/26
International
DOE collaborator traffic, inc.data
4/6
Note that more that 90 of the ESnet traffic is
OSC traffic ESnet Appropriate Use Policy
(AUP) All ESnet traffic must originate and/or
terminate on an ESnet an site (no transit traffic
is allowed)
Traffic coming into ESnet Green Traffic leaving
ESnet Blue Traffic between sites of total
ingress or egress traffic
8ESnet Top 20 Data Flows, 24 hrs., 2004-04-20
A small number of science users account for a
significant fraction of all ESnet traffic
SLAC (US) ? IN2P3 (FR)
1 terabyte/day
Fermilab (US) ? CERN
SLAC (US) ? INFN Padva (IT)
Fermilab (US) ? U. Chicago (US)
U. Toronto (CA) ? Fermilab (US)
CEBAF (US) ? IN2P3 (FR)
INFN Padva (IT) ? SLAC (US)
DFN-WiN (DE) ? SLAC (US)
Fermilab (US) ? JANET (UK)
SLAC (US) ? JANET (UK)
DOE Lab ? DOE Lab
Argonne (US) ? Level3 (US)
DOE Lab ? DOE Lab
Fermilab (US) ? INFN Padva (IT)
Argonne ? SURFnet (NL)
IN2P3 (FR) ? SLAC (US)
9Top 50 Traffic Flows Monitoring 24hr 2 Intl
and 2 Commercial Peering Points
10 flowsgt 100 GBy/day
More than 50 flowsgt 10 GBy/day
10Disaster Recovery and Stability
- Engineers, 24x7 Network Operations Center,
generator backed power - Spectrum (net mgmt system)
- DNS (name IP address translation)
- Eng database
- Load database
- Config database
- Public and private Web
- E-mail (server and archive)
- PKI cert. repository and revocation lists
- collaboratory authorization service
- Remote Engineer
- partial duplicate infrastructure
DNS
Remote Engineer
Duplicate Infrastructure Currently deploying full
replication of the NOC databases and servers and
Science Services databases in the NYC Qwest
carrier hub
- Remote Engineer
- partial duplicate infrastructure
- The network must be kept available even if, e.g.,
the West Coast is disabled by a massive
earthquake, etc.
- high physical security for all equipment
- non-interruptible core - ESnet core operated
without interruption through - N. Calif. Power blackout of 2000
- the 9/11/2001 attacks, and
- the Sept., 2003 NE States power blackout
- Reliable operation of the network involves
- remote Network Operation Centers (3)
- replicated support infrastructure
- generator backed UPS power at all critical
network and infrastructure locations
11Disaster Recovery and Stability
- Duplicate NOC infrastructure to AoA hub in two
phases, complete by end of the year - 9 servers dns, www, www-eng and noc5 (eng.
databases), radius, aprisma (net monitoring), tts
(trouble tickets), pki-ldp (certificates), mail
12Maintaining Science Mission Critical
Infrastructurein the Face of Cyberattack
- A Phased Response to Cyberattack is being
implemented to protects the network and the ESnet
sites - The phased response ranges from blocking certain
site traffic to a complete isolation of the
network which allows the sites to continue
communicating among themselves in the face of the
most virulent attacks - Separates ESnet core routing functionality from
external Internet connections by means of a
peering router that can have a policy different
from the core routers - Provide a rate limited path to the external
Internet that will insure site-to-site
communication during an external denial of
service attack - Provide lifeline connectivity for downloading
of patches, exchange of e-mail and viewing web
pages (i.e. e-mail, dns, http, https, ssh, etc.)
with the external Internet prior to full
isolation of the network
13Phased Response to Cyberattack
ESnet third response shut down the main peering
paths and provide only limited bandwidth paths
for specific lifeline services
ESnet second response filter traffic from
outside of ESnet
ESnet first response filters to assist a site
peeringrouter
X
X
router
ESnet
router
LBNL
attack traffic
router
X
borderrouter
- Lab first response filter incoming traffic at
their ESnet gateway router
gatewayrouter
peeringrouter
border router
Lab
gatewayrouter
Lab
- Sapphire/Slammer worm infection created a Gb/s of
traffic on the ESnet core until filters were put
in place (both into and out of sites) to damp it
out.
14Phased Response to Cyberattack
15Grid Middleware Services
- ESnet is the natural provider for some science
services services that support the practice of
science - ESnet is trusted, persistent, and has a large
(almost comprehensive within DOE) user base - ESnet has the facilities to provide reliable
access and high availability through assured
network access to replicated services at
geographically diverse locations - However, service must be scalable in the sense
that as its user base grows, ESnet interaction
with the users does not grow (otherwise not
practical for a small organization like ESnet to
operate)
16Grid Middleware Requirements (DOE Workshop)
- A DOE workshop examined science driven
requirements for network and middleware and
identified twelve high priority middleware
services (see www.es.net/research) - Some of these services have a central management
component and some do not - Most of the services that have central management
fit the criteria for ESnet support. These
include, for example - Production, federated RADIUS authentication
service - PKI federation services
- Virtual Organization Management services to
manage organization membership, member attributes
and privileges - Long-term PKI key and proxy credential management
- End-to-end monitoring for Grid / distributed
application debugging and tuning - Some form of authorization service (e.g. based on
RADIUS) - Knowledge management services that have the
characteristics of an ESnet service are also
likely to be important (future)
17Science Services PKI Support for Grids
- Public Key Infrastructure supports cross-site,
cross-organization, and international trust
relationships that permit sharing computing and
data resources and other Grid services - DOEGrids Certification Authority service provides
X.509 identity certificates to support Grid
authentication provides an example of this model - The service requires a highly trusted provider,
and requires a high degree of availability - Federation ESnet as service provider is a
centralized agent for negotiating trust
relationships, e.g. with European CAs - The service scales by adding site based or
Virtual Organization based Registration Agents
that interact directly with the users - See DOEGrids CA (www.doegrids.org)
18ESnet PKI Project
- DOEGrids Project Milestones
- DOEGrids CA in production June, 2003
- Retirement of initial DOE Science Grid CA (Jan
2004) - Black rack installation completed for DOE Grids
CA (Mar 2004) - New Registration Authorities
- FNAL (Mar 2004)
- LCG (LHC Computing Grid) catch-all near
completion - NCC-EPA in progress
- Deployment of NERSC myProxy CA
- Grid Integrated RADIUS Authentication Fabric
pilot
19DOEGrids Security
Bro Intrusion Detection
RAs andcertificate applicants
PKI Systems
Fire Wall
HSM
Internet
Secure racks
Secure Data Center
Vaulted Root CA
Building Security
LBNL Site security
20Science Services Public Key Infrastructure
- The rapidly expanding customer base of this
service will soon make it ESnets largest
collaboration service by customer count
Registration Authorities ANL LBNL ORNL DOESG (DOE
Science Grid) ESG (Climate) FNAL PPDG
(HEP) Fusion Grid iVDGL (NSF-DOE HEP
collab.) NERSC PNNL
21ESnet PKI Project (2)
- New CA initiatives
- FusionGrid CA
- ESnet SSL Server Certificate CA
- Mozilla browser CA cert distribution
- Script-based enrollment
- Global Grid Forum documents
- Policy Management Authority Charter
- OCSP (Online Certificate Status Protocol)
Requirements For Grids - CA Policy Profiles
22Grid Integrated RADIUS Authentication Fabric
- RADIUS routing of authentication requests
- Support One-Time Password initiatives
- Gateway Grid and collaborative uses standard UI
and API - Provide secure federation point with O(n)
agreements - Support multiple vendor / site OTP
implementations - One token per user (SSO-like solution) for OTP
- Collaboration between ESnet, NERSC, a RADIUS
appliance vendor, PNNL and ANL are also involved,
others welcome - White paper/report 01 Sep 2004 to support early
implementers, proceed to pilot - Project pre-proposal http//www.doegrids.org/CA/R
esearch/GIRAF.pdf
23Collaboration Service
- H323 showing dramatic increase in usage
24Grid Network Services Requirements (GGF, GHPN)
- Grid High Performance Networking Research Group,
Networking Issues of Grid Infrastructures
(draft-ggf-ghpn-netissues-3) what networks
should provide to Grids - High performance transport for bulk data transfer
(over 1Gb/s per flow) - Performance controllability to provide ad hoc
quality of service and traffic isolation. - Dynamic Network resource allocation and
reservation - High availability when expensive computing or
visualization resources have been reserved - Security controllability to provide a trusted and
efficient communication environment when required
- Multicast to efficiently distribute data to group
of resources. - Integrated wireless network and sensor networks
in Grid environment
25Priority Service
- So, practically, what can be done?
- With available tools can provide a small number
of provisioned, bandwidth guaranteed, circuits - secure and end-to-end (system to system)
- various Quality of Service possible, including
minimum latency - a certain amount of route reliability (if
redundant paths exist in the network)
- end systems can manage these circuits as single
high bandwidth paths or multiple lower bandwidth
paths of (with application level shapers) - non-interfering with production traffic, so
aggressive protocols may be used
26Guaranteed Bandwidth as an ESNet Service
allocation will probably be relatively static and
ad hoc
- A DOE Network RD funded project
bandwidthbroker
authorization
resource manager
policer
usersystem1
site A
resource manager
usersystem2
- will probably be service level agreements among
transit networks allowing for a fixed amount of
priority traffic so the resource manager does
minimal checking and no authorization - will do policing, but only at the full bandwidth
of the service agreement (for self protection)
Phase 1
resource manager
usersystem2
Phase 2
site B
27Network Monitoring System
- Alarms Data Reduction
- From June 2003 through April 2004 the total
number of NMS up/down alarms was 16,342 or 48.8
per day. - Path based outage reporting automatically
isolated 1,448 customer relevant events during
this period or an average of 4.3 per day, more
than a 10 fold reduction. - Based on total outage duration in 2004,
approximately 63 of all customer relevant events
have been categorized as either Planned or
Unplanned and one of ESnet, Site, Carrier
or Peer - Gives us a better handle on availability metric
282004 Availability by Month
Jan. June, 2004 Corrected for Planned
Outages(More from Mike OConnor)
lt99.9available
gt99.9 available
Unavailable Minutes
29ESnet Abilene Measurements
- We want to ensure that the ESnet/Abilene cross
connects are serving the needs of users in the
science community who are accessing DOE
facilities and resources from universities or
accessing university facilities from DOE labs. - Measurement sites in place
- More from Joe Metzger
- 3 ESnet Participants
- LBL
- FERMI
- BNL
- 3 Abilene Participants
- SDSC
- NCSU
- OSU
30OWAMP One-Way Delay Tests Are Highly Sensitive
- NCSU Metro DWDM reroute adds about 350 micro
seconds
ms
Fiber Re-Route
42.0 41.9 41.8 41.7 41.6 41.5
31ESnet Trouble Ticket System
- TTS used to track problem reports for the
Network, ECS, DOEGrids, Asset Management, NERSC,
and other services. - Running Remedy ARsystem server and Oracle
database on a Sun Ultra workstation. - Total external ticket 11750 (1995-2004),
approx. 1300/year - Total internal tickets 1300 (1999-2004),
approx. 250/year
32Conclusions
- ESnet is an infrastructure that is critical to
DOEs science mission and that serves all of DOE - Focused on the Office of Science Labs
- ESnet is working on providing the DOE mission
science networking requirements with several new
initiatives and a new architecture - QoS service is hard but we believe that we have
enough experience to do pilot studies - Middleware services for large numbers of users
are hard but they can be provided if careful
attention is paid to scaling