Lecture 19: Case Study of SoC Design - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Lecture 19: Case Study of SoC Design

Description:

Simple read only file system was implemented using flash memory to store static ... The combination of the two ... give us enough information that we can ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 21

Provided by: nichola67

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 19: Case Study of SoC Design

1
ECE 412 Microcomputer Laboratory

Lecture 19 Case Study of SoC Design

2
Outline

Web server example
MP3 example

3
Example Embedded web server application

Basic web server capable of responding to simple
HTTP requests
Simple CGI requests for dynamic HTML
Read a timer peripheral before, during, and after
servicing an HTTP request to log throughput
calculations, which are then displayed on a
dynamically generated web page
Simple read only file system was implemented
using flash memory to store static web pages and
JPEG images

4
Throughput calculations

Transmission throughput
Reflects the latency between starting to send the
first TCP packet containing the HTTP response
until the file was completely sent
Could theoretically reach a maximum of 10Mbps
Raw network speed that the CPU and TCP/IP stack
are capable of sustaining.
HTTP server throughput
Takes into account all delay between the incoming
HTTP connection request and file send completion
Includes the transmission latency above
Also measures the time the HTTP server took to
open a TCP connection to the host

5
Baseline system

Web server put to test to serve up JPEG images of
varying sizes across the LAN to a host PC
During each transfer several snapshots of the
timer peripheral were taken

6
Baseline system dataflow
NIOs CPU
Instruction Master
Data Master
Data flow
Avalon Bus
FLASH
Ethernet MAC
UART, IO, Timer, etc.
SRAM
The Nios CPUs data master port is used to read
data memory (SRAM) and write to the Ethernet
MAC. This would occur for each packet transmitted
in the baseline system.
7
Performance optimization

Using a DMA to transfer data from incoming
packets into memory without the intervention of
the microprocessor
The use of a custom peripheral to do the checksum
calculation
The combination of the two
Optimization of the slave-arbitration priority
for the memories to provide maximum data
throughput

8
Dataflow enhancement with DMA
NIOs CPU
DMA Controller
Instruction Master
Data Master
Read Master
Write Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
Ethernet MAC
SRAM

Using DMA to transfer packets between Ethernet
MAC and data memory
CPU higher priority for any conflicts with the
DMA
During DMA, CPU is free to access other
peripherals
For access to the shared SRAM, arbitration is
performed

9
Performance improvement
Transmission throughput is doubled compared to
baseline The entire HTTP server throughput is
about 2.5X that of the baseline 36 increase of
logic resource usage (3600 logic elements)
10
TCP checksum

Checksum calculations can be regarded as a
necessary evil in dataflow-sensitive applications
For a 1300-byte payload, it takes 33,000 clock
cycles
At a 33 Mhz clock speed it requires 1ms latency
for each maximum size packet
In the benchmark, the largest file (60KB) breaks
down into 46 maximum-sized packets
46ms out of 156ms transmission latency in the
baseline
The inner loop of TCP/IP stack checksum performs
a 16-bit ones complement checksum calculation
Adding up data repeatedly is a simple task for
hardware
A Verilog implementation can be designed
The checksum peripheral operates
Reading the payload contents directly out of data
memory
Performing the checksum calculation
Storing the result in a CPU-addressable register
It takes 386 clock cycles now
Speedup of 90X over the software version

11
Checksum peripheral
NIOs CPU
Checksum Peripheral
Instruction Master
Data Master
Read Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
SRAM

Again, for access to the shared SRAM, arbitration
is performed

12
Performance boost
Transmission latency decreased by 44ms Average
transmission throughput increase of 40 and
average HTTP throughput increase of 25 over the
baseline Resource usage 22 increase over the
baseline (3250 logic elements)
13
Putting it all together
14
Embedded uP systems in Xilinx FPGA
Traditional embedded microprocessor system as
implemented on a platform FPGA
Co-processor Architecture with multiple hardware
accelerators 1. Start with developing for the
first architecture 2. Automatically generating
the second architecture under the control of the
user
15
Profiling results
DCT32 and IMDCT36 perform the discrete cosine
transform and inverse discrete cosine transform
respectively. The other functions are
multiply-accumulate functions of various sizes.
These functions comprise over 90 of the total
application execution time on the host.
16
Design automation

Implement co-processor accelerators to meet
performance requirements.
Using the tagging facilities in Xilinx design
environment to mark the functions for hardware
acceleration.
Compile for target
The tool chain will create an implementation that
includes a MicroBlaze processor and interfaces
the same as before
Augmented with three hardware accelerators that
implement the multiplications, DCT and inverse
DCT.
The creation of the hardware accelerator blocks
is done automatically
The use of an advanced C to hardware compiler
optimized for Platform FPGAs.
The stitching of the accelerators into the new
co-processing architecture.
Handling the movement of the appropriate data to
and from the accelerators.

17
New architecture
18
Final results
Enables the mp3 application to run in real time
at a system clock rate of 67.5MHz.
19
A simple summary

Platform-based design involves hardware/software
codesign
Right design decisions can provide significant
amount of performance improvement
Need careful tradeoff between performance,
resource usage, cost and design time
Platform FPGAs are a convenient/low cost platform
for such a task

20
Overview of the Rest of the Semester

This is the last formal lecture
If we havent covered it already, we cant really
expect you to use it on your projects
Final project proposal is this Thursday
Two teams. Each team has 20-25 minutes
Proposal presentations can be sent to me through
email before class or brought in using a flash
memory
Initial report due on 4/20
Three-pages (four at most)
May contain introduction, background,
motivation, impact, block diagram, and workload
partition among team members
Goal give us enough information that we can
provide feedbacks about project complexity and
suggestions
From now on, Ill have office hours during class
meeting times to discuss final project-related
issues
Final Project Presentation 5/12 (sometime
between 1 to 4 pm?)
Final Project Report/Demo Due 5/15, no-cost
extension to 5/18
Details referring to Lecture 14