Title: EtherRake: Diagnosis and Monitoring in Data Center
1EtherRake Diagnosis and Monitoring in Data
Center Enterprise Networks
Lab for Internet and Security Technology (LIST)
Northwestern Univ.
2General Idea of EtherRake
- Problem statement
- Emerging DC and enterprise networks are mainly
comprised of large of switches which need
monitoring and diagnosis
3General Idea of EtherRake
- A centralized structure.
- Collector at each switches
- Collect Neighbors
- Collect port information
- Collect forwarding tables
- Monitor Plane
- Transmit collected information
- Processing Center
- Link the frames
- Construct Logical Topology
- Find the problems
4Collector at each switches
- Take Cisco switches for example
- Port information
- show port status (display interface ethernet0/1
for huawei) - Neighbor Information
- show CDP neighbors
- Forwarding tables (aka switch table)
- show MAC interface mapping
5Collector at each switches
- Port information
- Port Number 2 Bytes
- Status 4 bits
- Total 3 Bytes 100 300 Bytes lt 0.4KB per
switch - Neighbor Information
- Mac Address 48 bits
- Total 6Bytes 100 600 Bytes lt 0.6KB per switch
- Forwarding Tables
- To be decided. We are not using it in our
approach now. We can transfer updates only which
means normally we dont need to transfer
anything. - Total 1 KB 1024 (number of switches) 1MB in
one round.
6Collector at each switches
- Synchronization
- Cristian's algorithm (P is processing center, and
S is a collector) - P requests the time from S
- After receiving the request from P, S prepares a
response and appends the time T from its own
clock. - P then sets its time to be T RTT/2
- Multiple measurement can reduce the error.
- Accuracy. (T min) to (T RTT - min) where min
is the minimum one-way time.
7Monitor Plane
- Monitor Plane is a plane that co-exists with data
plane and control plane in the same channel. It
is used to transfer monitoring data.
8Monitor Plane
- Monitor plane is used to collect data for
monitoring data plane. - Switching in monitor plane has two methods.
- Normally, control plane will assist monitor plane
forwarding. - Under error, monitor plane will do flooding.
9Processing Center
- Collect port information, forwarding tables and
neighbor information from all the switches. - Construct the logical topology of switches based
on the port neighbor info - Detect loops in the logical topology for STP loop
problems - Check for any missing/dead switches
10Problems to Solve
- STP Error Detection
- End-to-end Error Detection
- Other Hardware/Software Errors of Switches and
Their Detection - TRILL Potential Problems
11End-to-end Connectivity Monitoring
- Based on the neighbor and port information, check
if all switches and end hosts are on a connected
ST. - End hosts are also neighbors for leaf node
switch. - Forwarding table also records info of past
connectivity
12Other Software Errors of Switches and its
Detection
- One-Way Link Problem. No backward frames.
- From EtherRakes view, interface of the other
direction is dead. - Deferred Frames. Buffer is full. Frames have to
be dropped. - Encode the buffer status (e.g., full) to the
status bit - Links between switches and routers
disabled/unactivated. - Detected by the port status bits or lack of
heartbeat - Switches down, e.g., unbootable IOS problems
- Same as above
13Limitations on Other Switch Software Errors
Detection
- Some errors have to be detected at the data plane
or application plane. - VLAN Problems. Hosts in the same VLAN cannot
communicate with each other.
14Hardware Errors of Switches and its Detection
- Switch Port Errors.
- Switch Module Errors.
- Both will be detected by the port status reports
15STP Errors (1)
- Count to Infinity when removing the root
16STP Errors (2)
- Forwarding Loops
- BPDU Loss Induced Forwarding Loops. If the
blocked port fails to receive BPDUs from its peer
bridge for an extended period of time, it may
start forwarding data.
17STP Errors (3)
- Forwarding Loops
- MaxAge Induced Forwarding Loops (MaxAge 6)
18STP Errors (4)
- Forwarding Loops
- Count to Infinity Induced Forwarding Loops
- Pollution of Forwarding Tables
19Previous STP Errors Detection
- EtherFuse (sigcomm 07)
- Plug a fuse into Ethernet
- Problem Remaining
- Where to plug it?
- How many do we need?
20Previous STP Errors Detection
- Cisco Prevention Methods
- Loop Guard. Prevent loss BPDU induced loops.
21Some Existing Solutions
- Cisco Discovery Protocol (CDP)
- Discovery cisco apparatus in neighborhood
- Monitoring aliveness of neighboring nodes
- Limitations
- No detail status report for diagnosis
- Limited by one hop.
- Cisco Unidirectional Link Detection (UDLD).
- Detect One-Way Link Problem.
22 General Monitoring Metrics for Detection
- Connectivity. Based on frames tree, EtherRake can
find the connectivity of a path. - Delay. EtherRake can link frames and calculate
the time spent on each switch. - Throughput. EtherRake can calculate throughput by
collected frames.
23TRILL Potential Problems
- Routing loops
- Caused by inconsistent views of network topology.
- Mitigated using hop count
- Scalability issue
- No clear idea on how much TRILL can scale
24Backup
25Detection of STP Errors by EtherRake
- Find STP errors by EtherRake.
- Link collected frames into traces
- Detect frame forwarding loops
- Leverage on the switch and ARP table info
- Challenges
- Scalability optimize collection of traces
- Ambiguity and accuracy frame linking
26End-to-end Connectivity Monitoring
- Diagnose Connectivity Problem from A to B by
EtherRake - Find the frames that are on the way from A to B.
- Link the frames and find a path.
- Locate the problem.
27 IP Router Errors OSPF (1)
- Network Convergence Time. The time taken by all
the OSPF routers in the network to go back to
steady state operations after there is a change
in the network state.
28 IP Router Errors OSPF (2)
- Routing Load on Processors
29 IP Router Errors OSPF (3)
- Route Flaps. Routing table changes in a router,
usually in response to a network failure or a
recovery.
30Cisco Solution
- Bi-directional Forwarding Detection (BFD)
- Try to Speed Network Convergence (three parts).
- Failure detection the speed with which a device
on the network can detect and react to a failure
of one of its own components, or the failure
of a component in a routing protocol peer. - Information dissemination the speed with which
the failure in the previous stage can be
communicated to other devices in the network - Repair the speed with which all devices on the
network-having been notified of the failure-can
calculate an alternate path through which
data can flow.
31IP Router Errors DHCP
- DHCP problem
- Configuration problem.
- Inability to acquire or renew a lease.
- How to keep the same IP address in multi-boot
machines?
32EtherFuse (1)
- A Ethernet Fuse that is plugged into the network
for monitoring the status of network.
33EtherFuse (2)
- Detection of Count to Infinity
- Detecting cost to the same root R of BPDUs
34- Detection of Forwarding Loops.
- Combination of Passive Sniffing and Active
Probing.
35Package View Switching
- Forwarding packages from the view of packages.
- Each package will have memory about the history
of the path it has already gone through and
decide which way to go based on the memory it
has. - Here is the steps. (Generally speaking, it is
deep-first searching from the view of packages.)
36Package View Switching
- Normally, when a package arrives at a switch, it
will choose the default port which is the port
that control plane provide. - If the package has already tried the default
port, it will randomly choose a new port that it
has never been to. - If the package tried every port at this switch,
it will go back to the port where it is from. - Package will be discarded when it arrived at its
origin and finds no other way to go. Or package
arrives at the destination which is the monitor
center.