Title: HardwareSoftware Interface CoSynthesis and Latency Insensitive Design
1Hardware/Software Interface Co-Synthesisand
Latency Insensitive Design
- Marty Nicholes / Ra Roath
- EEC282
2Presentation
- HW/SW Interface for HW/SW codesign
- Allows simulation of a hw/sw system
- Eases design space exploration effort
- Decreases risk of interface overdesign
- Interconnect Wire Delay Issues
- With Deep Sub-Micron process technologies
- Delay of long wire plays a big role
3HW/SW Interface Basics
- Connect processor with devices
- Allocate processor ports
- Design glue logic for device access
- Meet timing/throughput constraints
41st Paper
- Interface Co-Synthesis Techniques for Embedded
Systems - Pai Chou, Ross B. Ortega, Gaetano Borriello
- Summary
- Set of techniques for HW/SW interface synthesis
- Generate communication links
- Minimize glue-logic required
- Meet timing contraints
5IO Types
- Direct
- Simple connection, no glue logic
- Indirect
- Needs glue logic
- When used
- Insufficient IO resources
- Fast device signaling requirements
- Offload work from main processor
6Design Flow
System Behavioral Description
Processor Library
Device Library
Processor List
Device List
Interface Co-synthesis
HW GlueLogic (Verilog)
Processor Software
Hardware Netlist
7Design Flow (cont.)
The behavioral description is a high-level,
imperative language program written by the user
describing the necessary components of the
circuit and its functionality. This program has
a declarative section and an operational
section. The declarative section allocates static
storage for data and instantiates peripheral
devices. The operational section computes
functions and communicates with the peripheral
devices via driver calls. This file is a
Verilog hybrid, allowing a device listing in the
structural part, and a behavioral part describing
the device interaction. The Chinook HW/SW
Cosynthesis System, Chou et. al.
System Behavioral Description
processing
CFGhw
CFGsw
processing
HW SEQs
HW Access Routines
8Chinook Flow
9Processor Library Information
- IO Resources
- IO Ports with direction, addressability
- Serial controller (I2C, UART)
- Access routines
- Port expander templates
- Memory bus description
10Device Library Info
- Ports
- Guarded (can isolate)
- Not guarded
- Interface properties
- Low level access routine info - SEQs
- Processor independent format
- Represents signaling for processor comm.
11Design Data
- Control Flow Graphs
- Produced from behavioral description
- Default is CFGsw
- Designer or tool can mark for CFGhw
- Output formats
- Hardware connections as a netlist
- Processor software including access routines
- Interface glue logic in Verilog
- Main Algorithm
- Synthesize HW access routines
- Allocate IO resources
- IO ports first
- Generate device drivers
12IO Port Allocation
- N device ports in decreasing size
- Guarded can share a port
- Unguarded requires dedicated port
- Not enough ports?
- Make unguarded share
- Forced sharing
- Add latch/tri-state glue logic
- Costs glue logic and a control bit
- Encoding
- address Decode to provide an address for
dedicated pins
13Port Splitting
- Algorithm assumes device port smaller than
processor port - Split guarded ports
- Un-guarded not able to split
14MMIO
- Used if IO port allocation FAILS
- Requires glue logic
- Algorithm
- Assume all devices can share
- Use forced sharing if needed
- Assign bits on data and address buses
- Allocate address bits for device selection
- One hot single address bit for a device
- Binary n address bits for 2n devices
- Huffman encoding variable length address
- Address fields
- IO prefix used to specify an IO access vs.
memory - Device select used for guard control
- Device control for non-guard devices
15MMIO updated SEQ example
- SEQ updated with MMIO access code.
16MMIO Example
17IO Sequencer
- Created from CFGs marked for hw
- Communication on processor behalf
- Generator uses CFGsw and CFGhw
- Outputs
- HW description of sequencer
- SW routines in assembly
- Minimizes pins and hardware
18IO Sequencer Generation
- Protocol synthesis
- One device SEQ is hw, then all hw
- Limits pins with bandwidth calculation
- W port width, Pe minimum time to pass,
Se data size. Make sure W Pe gt Se - FSM generation
- CFGhw is translated into FSM
- Connections made from FSM to device
- CFGsw is updated to talk to FSM
19IO Sequencer Template
20Summary for 1st paper
- These techniques can produce
- Glue logic to interconnect processor and devices
- Device drivers
- Meets the need to assisting with design space
exploration - Allows design to try hw/sw partitions without
designing the interface - Issues
- Manual marking of CFG for hardware
- Requires extended device/processor libraries
- Great idea that the device access routines are
ISA independent
212nd Paper
- A Methodology for Correct-by-Construction Latency
Insensitive Design, Luca P. Carloni et. Al. - (Presented by Ra Roath)
22Overview
- Latency-Insensitivity Protocol
- Implementation of the protocol
- Channels
- Relay Stations
- Module shells
23Introduction
- Advent of Deep Sub-Micron process tech.
- Generated concerns/predictions of inevitable
dominance of wire delay. - Unanimity that long wire will play significant
role in logic synthesis optimization. - How to rectify this? Interconnect optimization
techniques. - Interconnect Topology Optimization
- Optimal Buffer Insertion
- Optimal Wire Sizing
- When Delay(wire) gt Delay(gates)?
24Papers purpose
- Implement a latency insensitive communication
protocol - Given a synchronous design composed of
communicating modules ? synchronous design that
tolerates arbitrary communication latency. - No need to think of digital system in a
completely different way(e.g. asynchronous
design).
25The Methodology
- Given Complete synchronous specification of
system and collection of Modules - Communication channels with relay stations
- Encapsulate each module with a shell
- Layout obtained by standard PlaceRoute tools
- Post-Layout Optimization. Necessary number of
Relay Stations inserted into each critical
channel.
26Latency Insensitive vs. Asynchronous
- Delay insensitive circuit operates correctly
regardless of delays on gates and wires - Arbitrary delay is a multiple of the clock
period - A specified synchronous system
- Not asynchronous hand-shaking
- Asynchronous systems require designer to think
digital systems completely differently
27Latency Insensitive Protocols
- Is a protocol that governs the exchange of
information in a patient system - Patient system
- A synchronous system of functions that depends on
the order of events, not on their timings - Onto Implementation ?
28Channels
- Channels are point-to-point unidirectional links
- Source/Sink Modules
- Packet Fields
- Payload
- Void
- True Packets
29Channel Example
30Channels cont.
- Data transmitted by packets.
- Source
- Puts true packet(void0) or void1 packet on
channel. - Sink
- Decides to store/discard(based on void)
- If stalling, sends a stop flag
- Stop flag tells source that packet cannot be
received.
31Relay Stations
- Packets
- Payload
- Void
- StopOut
- StopIn
- Latches
32Relay Stations cont.
- At each clock cycle t
- Takes as input, packetIn, stopIn
- Outputs packetOut, stopOut
- Decides whether packetOut packetIn (stalling
0) - Or if packetOut packetOut_prev (stalling1)
- Internal storage capacity 2 packets
33Shells
- A shell is a wrapper that encapsulates module M
- Interfaces with channels so that M becomes a
patient process - To do so, make M stallable
- Guarantee input synchronization Internal
computation fired only if all inputs have arrived - Output Propagation Send true packets
34Shells and Modules
- CX Channels
- MX Modules
- SX Shells
- ? refer to diagram
35Shells cont.
- Shells
- Get incoming packets from input channels, filters
void packets - After all input values are received, passes to M
and fires computation - Gets results of M
- If no stop flag is received, sends result
36Back to Wire Segmentation
- Why is this related to wire interconnects?
- Every wire with latency greater than Clk period
can be segmented - Use relay stations to buffer wire
- Pipelining a wire
37Procedure
- Start with collection of synchronous modules
- Synthesize layout
- Segment every wire with latency greater than
Clock period, and add relay stations - Build shell around each module to obtain patient
processes - Patient processes interact with relay stations
38Conclusions of 2nd paper
- Interconnect Delay
- Alleviated by segmenting interconnect wire
- Add Relay stations to segmented wire
- Add shells to modules to interact with Relay
Stations and other shelled modules - Questions?
39The End.