Title: CA-RAM: A High-Performance Memory Substrate for Search-Intensive Applications
1CA-RAMA High-Performance Memory Substrate for
Search-Intensive Applications
- Sangyeun Cho, J. R. Martin, R. Xu,
- M. H. Hammoud and R. Melhem
Dept. of Computer Science University of Pittsburgh
2Search ops in applications
- Search (or lookup) operations represent an
important common function - Network packet processing
- For each arriving packet, determine the output
port - Given packet information, find a matching
classification rule - Each look up can incur many memory accesses
- Speech recognition
- Searching (e.g., dictionary lookup) takes up 24
of CPU cycles - Forthcoming RMS (Recognition, Mining, and
Synthesis) apps
3Search performance and power
- Search performance must match increasing line
speeds - For OC-768, up to 104M packets must be processed
per second - Network traffic has doubled every year
McKeown03 - Routing tables (200K prefixes in a core router)
are growing RIS - IPv6
- Power and thermal issue already a critical
limiting factor in network processing device
design McKeown03 - Search in battery-operated devices should be
energy-efficient - Conventional search solutions
- Software methods (tries, hash table, )
- Hardware methods (CAM, TCAM, )
4IP lookup using a trie
? Consider an IP address 0 1 0 0 0 1 1 0
- ? Software approach is flexible
- ? high memory capacity requirement
- high memory bandwidth requirement
- ? not SCALABLE
5IP lookup using TCAM
? Consider an IP address 0 1 0 0 0 1 1 0
110100 110101 110111 01000 01100 01101 11011
0100 0110 1101 10 0
- ? high bandwidth, constant time lookup
- ? TCAMs are relatively small, expensive
- power consumption very high
- ? not SCALABLE
choose the first among the matched
6CA-RAM a hybrid approach
- Can we do better than the existing conventional
schemes? - CAM-like search performance
- RAM-like cost and power
- CA-RAM combines hashing w/ hardware parallel
matching - CA-RAM design goals
- High lookup performance
- Low power consumption
- Smaller chip area per stored datum
- Straightforward system-level integration
7Talk roadmap
- What is CA-RAM?
- Prototype design
- Case study 1 IP lookup
- Case study 2 Trigram lookup for speech
recognition
8CA-RAM Content Addressable RAM
Conventional CAM/TCAM
CA-RAM
Memory cells
Match logic
- Separate match logic and memory
- Match logic for a single row, not every row
- Allows the use of dense RAM technology
- Enables highly reconfigurable match logic
- Keep keys sorted in each row, not in entire array
9Very simple, yet efficient
- Use hashing to store keys in a particular row
- To look up, hash the search key and retrieve one
row - Perform matching on entire row in parallel
- Achieve full content addressability w/o paying
overhead!
search key
Keyi1
Keyi2
Index generator
Keyj2
Keyj1
Match processor1
Match processor2
10Pipelined CA-RAM operation
Keyi1
Keyi2
Keyi3
Keyj2
Keyj1
Keyj3
Index
Key matching
Index generation
Memory access
Result forwarding
11Dealing w/ bucket overflows
- Careful design of hash function
- Increase bucket size
- Reduce load factor (?) ? of occupied entries
/ of total entries - Use chaining store overflows in subsequent
rows - Multiple accesses per lookup
- Use a small overflow CAM, accessed in parallel
- Similar to popular victim caching
- Use two-level hashing and employ multiple CA-RAM
banks
12CA-RAM reconfig. opportunities
- Reconfigurable match logic allows
- Adapting key size to apps
- Same hardware to support multiple apps or
standards
13Adapting key size
Keyi1
Keyi2
Keyi3
- ? Adapting key size is straightforward
- Will benefit supporting multiple apps/
- standards
Keyj2
Keyj1
Keyj3
Match information
14CA-RAM reconfig. opportunities
- Reconfigurable match logic allows
- Adapting key size to apps
- Same hardware to support multiple apps or
standards - Binary and ternary matching
- Some apps require ternary matching, some dont
15Supporting binary/ternary matching
Keyi1
Keyi2
Maski1
- Developed configurable comparator
- T-matching requires 2 bits / 1 symbol
- Supporting different types of matching
- in different bit positions feasible
Keyj2
Keyj1
Maskj1
Search key
Match information
16CA-RAM reconfig. opportunities
- Reconfigurable match logic allows
- Adapting key size to apps
- Same hardware to support multiple apps or
standards - Binary and ternary matching
- Some apps require ternary matching, some dont
- Storing data and keys in a CA-RAM module
- Cuts of memory accesses for a lookup by half
17Simult. key matching data access
Keyi1
Keyi2
Datai1
- Data access follows TCAM lookup
- CA-RAM supports data embedding
- Cuts memory traffic latency by half
Keyj2
Keyj1
Dataj1
Search key
Match information
Match result Data
18CA-RAM reconfig. opportunities
- Reconfigurable match logic allows
- Adapting key size to apps
- Same hardware to support multiple apps or
standards - Binary and ternary matching
- Some apps require ternary matching, some dont
- Storing data and keys in a CA-RAM module
- Cuts of memory accesses for IP lookup by half
- Providing range checking capabilities
- Beneficial for rule-based packet filtering
19Supporting range checking
Keyi1
Rangei1
- (Range checking causes troubles)
- (Entries must be expanded)
- CA-RAM can upport range checking efficiently
Rangej1
Keyj1
Search key
Match information
20CA-RAM-based memory subsystem
21Prototype implementation
- We implemented a prototype CA-RAM slice design
(w/ a degree of reconfigurability) and evaluated
its power and area advantages over
state-of-the-art TCAMs - We used a standard cell (0.16?m) based ASIC
design flow
Step cells Area, ?m2 Delay, ns
Expand search key 3,804 66,228 (0.89)
Calculate match vector 5,252 10,591 0.95
Decode match vector 899 1,970 1.91
Extract result 6,037 21,775 1.99
Total 15,992 100,564 4.85
22Area and power CA-RAM vs. TCAM
Cell area (?m2) _at_130nm CMOS
- ? CA-RAM area advantage 4.5x11x
- CA-RAM power advantage 4x14x
Power (W) 4.5Mb _at_143MHz
23Performance CA-RAM vs. (T)CAM
24Case study 1 IP lookup
25Problem description
- Given
- A set of prefixes (each prefix is associated with
output port number) - IP address
- Find a prefix that matches with input IP address
and return output port number associated with it - In the presence of multiple matching prefixes,
choose the longest - Procedure
- Find a good hash function to distribute prefixes
- Determine CA-RAM organization
26Data set and hashing method
- IP core routers table having 186,760 entries
- Bit selection scheme Zane et al. 03
- 98 of prefixes are at least 16 bits long
- Select hash bits from the first 16 bits
(low-order bits)
27Shaping CA-RAM
2,048 rows ? (32 entries)
- Consider multiple design points
4,096 rows ? (64 entries)
Design A
(? 0.47)
Design B
(? 0.40)
Design C
(? 0.36)
Design D
(? 0.36)
Design F
(? 0.36)
Design E
(? 0.24)
28Performance
Spilled entries
- ? With a properly chosen ?,
- CA-RAM achieves near-constant AMAL
(? 0.47)
(? 0.40)
(? 0.36)
(? 0.36)
(? 0.24)
(? 0.36)
Uniform traffic
Average memory access latency
Skewed traffic
29Area and power
Design B
Relative area or power
? CA-RAM advantageous over TCAM
30Case study 2 Trigram lookup in speech recognition
31Problem, data set, and hashing
- Problem
- Look up a trigram in the trigram database
- Data set
- A subset of the Sphinx trigram database
- We picked up entries having 1316 characters
- Still 5,385,231 entries or 86MB
- Hashing
- DJB, an efficient string hash function
- (Used in Sphinx)
32Result
33Data distribution
34Area comparison
Relative area
CAM
CA-RAM
35CA-RAM conclusions
- Compared w/ software methods
- Less of memory accesses higher lookup
performance - Compared w/ CAM or TCAM
- Higher density matching that of DRAM ? large
lookup table - Competitive performance
- Low power a critical advantage for
cost-effective system design - Reconfigurable
- Can accommodate apps having different key/record
sizes, binary vs. ternary searching requirements,
range checking, - Can adopt new standards much more easily, e.g.,
IPv6 - Two case studies show the efficacy of the CA-RAM
approach - 35 improvement in area and power, compared with
CAM/TCAM
36CA-RAMA High-Performance Memory Substrate for
Search-Intensive Applications