Title: Application of AI and MLTechniques to FaultTolerant Routing
1Application of AI- and ML-Techniques to
Fault-Tolerant Routing
Arjun Rao CS 717 November 16 and 18, 2004
2Papers Covered
- 1 Loh, Peter K.K., Artificial Intelligence
Search Techniques as Fault-Tolerant Routing
Strategies - 2 Loh, Shaw., A Genetic-Based Fault-Tolerant
Routing Strategy for Multiprocessor Networks
3Papers Covered (cont.)
- 3 Loh, Schröder, Hsu., Fault-Tolerant Routing
on Complete Josephus Cubes (not AI-related but
interesting nevertheless) - If time permits, also
- 4 Bradley, Tyrrell., Immunotronics Hardware
Fault Tolerance Inspired by the Immune System
4The Problem of Routing
- Communication between nodes
- Servers
- Microprocessors
- Desire shortest, most efficient paths
- Multiprocessor network topologies, e.g.
hypercubes, Josephus cubes, etc. - Desire availability of paths
- What to do when links/nodes fail?
- How to remain (close to) optimal?
5Intro to Fault-Tolerant Routing
- Current algorithms adaptive but non-minimal
- Misrouting
- Routing strategies tied to specific topologies
- k-ary, n-cubes, meshes, etc. Regular structures
and symmetry - Constrained by fault number and types
- More general strategies vulnerable to deadlock
and livelock
6Turn Model Glass, Ni
- Widest application scope
- k-ary, n-cubes, nD-meshes, torus geometries, etc.
- West-First algorithm (on 2D-mesh)
- Messages prevented from turning west again
- Prevents cycles?deadlocks
- Routing along virtual channels in strictly
decreasing or increasing order
7Turn Model and Channel Numbering
8Turn Model (cont.)
- Three examples of routing
- F FAILURE
- Full adaptation w/o deadlock and livelock
requires more global info?more overhead
9AI Search Techniques
- Arbitrary topology ? Search space
- Search space ? Search tree(s)
- Adaptive but still non-minimal
- Characteristic recursion impractical on
loosely-coupled, distributed network
10AI Logical Abstraction
- Abstraction
- S Problem space
- O Set of objectives
- P Search paths
- S (O, P), where oi ? O and pj ? P, each pj
connects tuple (ok, ol), k ? l - Abstraction used to model
11Multiprocessor Network w/ Generic Topology
- Network
- N Nodes
- L Links between nodes
- G (N, L), where ni ? N and lj ? L, each lj
connects tuple (nk, nl), k ? l - Objective ? Node
- Search path ? Link
12Abstract Routing Model
- Search ?
- ?(os, ot) S x S ? S, where S (O, P) and S
(O, P) - ox,oy ? O and ox,oy ? O ? Successful search
- ox,oy ? O and ox ? O, oy ? O ? Unsuccessful ?
- Routing attempt R
- R(ns, nd) G x G ? G, where G (N, L) and G
(N, L) - ni,nj ? N and ni,nj ? N ? Complete route
- ni,nj ? N and ni ? N, nj ? N ? Incomplete ?
13Routing Analogy
- AI search equivalent to routing attempt
- Successful search ? Route between source and
destination nodes - Unsuccessful search ? Incomplete route to
destination
14Caveats of Analogy
- No specific search algorithm ? No routing
strategy - No optimality constraints
- Nothing about deadlocks/livelocks
- Nothing about fault tolerance!!
15Fault-Tolerant Routing Model
- Model considers two aspects
- Routing system configuration
- Must be generic enough!
- Message propagation protocols and policies
- Following slides introduce what is needed for AI
searches (w/ physical message backtracking)
16FT Routing Model (cont.)
17FT Routing Model (cont.)
- Eager readership of input messages
- Single input buffer to avoid polling
- Multiple output buffers to accommodate different
delivery rates - Router process
- AI/FT routing strategy implemented here
- Physical message backtracking ? Increased message
sizes - Increased message sizes/overhead ? Requires
communications router at each node
18Communications Router
19Communications Router (cont.)
- Communication router constitutes router process
and connections - Main components LCM and CP
- ROM Stores link management and routing software
- RAM Stores routing table, link status table,
associated link lists
20CR Data Structure Routing Table
21CR Routing Table
- For each node, up to n links
- For each link
- Connected with status OK and node ID of neighbor
- Not connected with status NC and node ID 1
- Link fault represented by timeout
- Status reset to NC
- Processor fault represented by timeouts in
neighbors
22CR Data Structures Link Status Table, Lists
23Message Packets
- Six fields
- Router Control (4 bits) Type of message,
including NORMAL and BACKTRACK - Destination Node ID (10 bits) Supports network
of size up to 1024 nodes - Pending Nodes (20 bytes) Stack of node IDs that
may receive packet but have not yet - Traversed Nodes (20 bytes) Stack of nodes
traversed, with most recent on top
24Message Packets (cont.)
- Traversed Nodes Index (10 bits) Index to
previous traversed nodes field. Supports
simulation of physical message backtracking - Data Field (n-bit pointer) Points to
information content of packet
25(Finally) AI Search Strategies
- Brute Force
- Depth-First Search
- Random Climbing
- Heuristic
- Hill Climbing
- Best-First Search
- A
26AI Search Strategies (cont.)
- In presence of network faults
- Prevent cycles ? No deadlocks
- Prevent more than two traversals of nodes/links ?
No livelocks and necessary for AI searches - Adaptations of search algorithms
- Problems
- Recursion? Nope (PMB)
- Overhead? Fixed (Well, mostly)
27Common Beginning
- Extracts header and disassembles it
- IF Destination Node is reached, pass packet to
host processor - ELSE
- IF Router Control is BACKTRACK
- IF Pending Nodes top node is directly linked
- Route packet to that node
- Set Router Control to NORMAL
- ELSE
- Backtrack packet to previous node in
traversed - Pop current node ID from Pending Nodes
- Push current node ID onto Traversed Nodes
28Depth-First Search
- Travel as far as possible
- Do not consider alternative paths just yet
- If fault or dead-end, backtrack to most recent
possible path
29DFS (cont.)
- Following common beginning
- Look for directly linked successor nodes
- IF they are already traversed, ignore
- ELSE IF they are in Pending Nodes, ignore
- ELSE push them onto Pending Nodes
- Read top node of Pending Nodes
- IF directly linked (no fault), route packet to it
- ELSE Set BACKTRACK and route to last traversed
node - END
30DFS Example
31DFS Example (cont.)
32Random Climbing
- Following the common beginning
-
- ELSE
- Select a successor node randomly
- Push unselected successor nodes onto Pending
Nodes
33Hill Climbing
- Heuristic Estimated remaining distance
- Following common beginning
-
- ELSE
- Sort successor nodes according to est. remaining
distance - Push sorted nodes onto Pending Nodes
34Best-First Search
- Resumes partial routes not previously considered
- Looks at immediate neighbors, neighbors of
predecessors - Sorts by est. remaining distance
- Leads to non-minimal routes!
35BFS (cont.)
-
- ELSE
- Push (directly linked successor nodes) onto
Pending Nodes - Sort Pending Nodes according to est. remaining
distance
36A
- Two heuristics
- Estimated remaining distance h
- Path length traversed g
- Partial paths sorted by f g h
- When no faults, always finds minimal route
37A (cont.)
- After current ID processing
- Record path length traversed, g
-
- ELSE
- Calculate and store f for new successor nodes
- Push them onto Pending Nodes sorted by f
-
38Performance Testing
- Simulated 125-node multiprocessor network
- Max 8 links per node (maps to many topologies)
- Faulty links and processors
- Pre-specified or dynamically generated
- Testing
- Messages between every pair of nodes
- 20 trials at 0, 5, 10, 15, 20 faulty links
- 125 x 125 x 20 x 6 1,875,000 tests (??)
39Test Results
- As faults increase, heuristic strategies fair
better (esp. gt 15) - A best search technique but slow
- Hill climbing and BFS do not consider nodes
traversed - Hill climbing considers only immediate neighbors
40Test Results (cont.)
41Main Point
- Using AI search techniques, we abstract from
routing in networks to searching in trees
(topology-independent, quantity and type of
faults irrelevant)
42Next Paper
- 1 Loh, Peter K.K., Artificial Intelligence
Search Techniques as Fault-Tolerant Routing
Strategies - 2 Loh, Shaw., A Genetic-Based Fault-Tolerant
Routing Strategy for Multiprocessor Networks
43Our Little Problem
- AI search techniques topology- and fault-type
independent - but non-minimal routes utilized
- Follow-up work shows how genetic algorithms
(combined with heuristics) can find minimal
routes in presence of network faults
44Genetic Algorithms Overview
- Optimization strategy
- Population of potential solutions evolve over
series of generations - Each element of population is chromosome each
unit of chromosome is gene - Chromosomes undergo crossover and mutation
- Most fit chromosomes selected for next
generation, based upon fitness function
45Abstract Model
- Same as before (including definitions of S and G)
- Pure abstraction suffers from same caveats as
before - Basic idea Instead of AI search for adaptive
route, optimize over population of routes to find
best
46Message Packets
47Chromosome
- Route ? Chromosome
- Node on route ? Gene in chromosome
- Length of route ? Size of chromosome
- Chromosome size directly reflects routing
performance! - Distance traversed basis of fitness
48Population Creation
49Mutation and Crossover
- Mutation Swap and/or shift
- Normal crossover destroys routes, messes with
source and destination problem w/ different
lengths - Use one-point random crossover
50Fitness Function
- F (Dmax Droute) / Dmax ?
- Dmax Maximum distance between source and
destination - Droute Distance traveled by specific route
- ? Predefined value to ensure non-zero fitness
- Higher value ? More fit
51Selection Scheme
- Roulette Wheel
- Sum of fitness values random value from 0,1
- Select chromosomes with fitness greater than
product - Tournament Selection
- Most fit chromosomes selected
- Stochastic Remainder
- Probabilities used to select route
- Which scheme has best performance selecting
optimal route?
52Reroute
53Genetic Hybrid Algorithm