Title: Lecture 19: Case Study of SoC Design
1ECE 412 Microcomputer Laboratory
- Lecture 19 Case Study of SoC Design
2Outline
- Web server example
- MP3 example
3Example Embedded web server application
- Basic web server capable of responding to simple
HTTP requests - Simple CGI requests for dynamic HTML
- Read a timer peripheral before, during, and after
servicing an HTTP request to log throughput
calculations, which are then displayed on a
dynamically generated web page - Simple read only file system was implemented
using flash memory to store static web pages and
JPEG images
4Throughput calculations
- Transmission throughput
- Reflects the latency between starting to send the
first TCP packet containing the HTTP response
until the file was completely sent - Could theoretically reach a maximum of 10Mbps
- Raw network speed that the CPU and TCP/IP stack
are capable of sustaining. - HTTP server throughput
- Takes into account all delay between the incoming
HTTP connection request and file send completion - Includes the transmission latency above
- Also measures the time the HTTP server took to
open a TCP connection to the host
5Baseline system
- Web server put to test to serve up JPEG images of
varying sizes across the LAN to a host PC - During each transfer several snapshots of the
timer peripheral were taken
6Baseline system dataflow
NIOs CPU
Instruction Master
Data Master
Data flow
Avalon Bus
FLASH
Ethernet MAC
UART, IO, Timer, etc.
SRAM
The Nios CPUs data master port is used to read
data memory (SRAM) and write to the Ethernet
MAC. This would occur for each packet transmitted
in the baseline system.
7Performance optimization
- Using a DMA to transfer data from incoming
packets into memory without the intervention of
the microprocessor - The use of a custom peripheral to do the checksum
calculation - The combination of the two
- Optimization of the slave-arbitration priority
for the memories to provide maximum data
throughput
8Dataflow enhancement with DMA
NIOs CPU
DMA Controller
Instruction Master
Data Master
Read Master
Write Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
Ethernet MAC
SRAM
- Using DMA to transfer packets between Ethernet
MAC and data memory - CPU higher priority for any conflicts with the
DMA - During DMA, CPU is free to access other
peripherals - For access to the shared SRAM, arbitration is
performed
9Performance improvement
Transmission throughput is doubled compared to
baseline The entire HTTP server throughput is
about 2.5X that of the baseline 36 increase of
logic resource usage (3600 logic elements)
10TCP checksum
- Checksum calculations can be regarded as a
necessary evil in dataflow-sensitive applications - For a 1300-byte payload, it takes 33,000 clock
cycles - At a 33 Mhz clock speed it requires 1ms latency
for each maximum size packet - In the benchmark, the largest file (60KB) breaks
down into 46 maximum-sized packets - 46ms out of 156ms transmission latency in the
baseline - The inner loop of TCP/IP stack checksum performs
a 16-bit ones complement checksum calculation - Adding up data repeatedly is a simple task for
hardware - A Verilog implementation can be designed
- The checksum peripheral operates
- Reading the payload contents directly out of data
memory - Performing the checksum calculation
- Storing the result in a CPU-addressable register
- It takes 386 clock cycles now
- Speedup of 90X over the software version
11Checksum peripheral
NIOs CPU
Checksum Peripheral
Instruction Master
Data Master
Read Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
SRAM
- Again, for access to the shared SRAM, arbitration
is performed
12Performance boost
Transmission latency decreased by 44ms Average
transmission throughput increase of 40 and
average HTTP throughput increase of 25 over the
baseline Resource usage 22 increase over the
baseline (3250 logic elements)
13Putting it all together
14Embedded uP systems in Xilinx FPGA
Traditional embedded microprocessor system as
implemented on a platform FPGA
Co-processor Architecture with multiple hardware
accelerators 1. Start with developing for the
first architecture 2. Automatically generating
the second architecture under the control of the
user
15Profiling results
DCT32 and IMDCT36 perform the discrete cosine
transform and inverse discrete cosine transform
respectively. The other functions are
multiply-accumulate functions of various sizes.
These functions comprise over 90 of the total
application execution time on the host.
16Design automation
- Implement co-processor accelerators to meet
performance requirements. - Using the tagging facilities in Xilinx design
environment to mark the functions for hardware
acceleration. - Compile for target
- The tool chain will create an implementation that
includes a MicroBlaze processor and interfaces
the same as before - Augmented with three hardware accelerators that
implement the multiplications, DCT and inverse
DCT. - The creation of the hardware accelerator blocks
is done automatically - The use of an advanced C to hardware compiler
optimized for Platform FPGAs. - The stitching of the accelerators into the new
co-processing architecture. - Handling the movement of the appropriate data to
and from the accelerators.
17New architecture
18Final results
Enables the mp3 application to run in real time
at a system clock rate of 67.5MHz.
19A simple summary
- Platform-based design involves hardware/software
codesign - Right design decisions can provide significant
amount of performance improvement - Need careful tradeoff between performance,
resource usage, cost and design time - Platform FPGAs are a convenient/low cost platform
for such a task
20Overview of the Rest of the Semester
- This is the last formal lecture
- If we havent covered it already, we cant really
expect you to use it on your projects - Final project proposal is this Thursday
- Two teams. Each team has 20-25 minutes
- Proposal presentations can be sent to me through
email before class or brought in using a flash
memory - Initial report due on 4/20
- Three-pages (four at most)
- May contain introduction, background,
motivation, impact, block diagram, and workload
partition among team members - Goal give us enough information that we can
provide feedbacks about project complexity and
suggestions - From now on, Ill have office hours during class
meeting times to discuss final project-related
issues - Final Project Presentation 5/12 (sometime
between 1 to 4 pm?) - Final Project Report/Demo Due 5/15, no-cost
extension to 5/18 - Details referring to Lecture 14