Lecture 19: Case Study of SoC Design - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Lecture 19: Case Study of SoC Design

Description:

Simple read only file system was implemented using flash memory to store static ... The combination of the two ... give us enough information that we can ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 21
Provided by: nichola67
Category:
Tags: soc | case | design | lecture | register | study

less

Transcript and Presenter's Notes

Title: Lecture 19: Case Study of SoC Design


1
ECE 412 Microcomputer Laboratory
  • Lecture 19 Case Study of SoC Design

2
Outline
  • Web server example
  • MP3 example

3
Example Embedded web server application
  • Basic web server capable of responding to simple
    HTTP requests
  • Simple CGI requests for dynamic HTML
  • Read a timer peripheral before, during, and after
    servicing an HTTP request to log throughput
    calculations, which are then displayed on a
    dynamically generated web page
  • Simple read only file system was implemented
    using flash memory to store static web pages and
    JPEG images

4
Throughput calculations
  • Transmission throughput
  • Reflects the latency between starting to send the
    first TCP packet containing the HTTP response
    until the file was completely sent
  • Could theoretically reach a maximum of 10Mbps
  • Raw network speed that the CPU and TCP/IP stack
    are capable of sustaining.
  • HTTP server throughput
  • Takes into account all delay between the incoming
    HTTP connection request and file send completion
  • Includes the transmission latency above
  • Also measures the time the HTTP server took to
    open a TCP connection to the host

5
Baseline system
  • Web server put to test to serve up JPEG images of
    varying sizes across the LAN to a host PC
  • During each transfer several snapshots of the
    timer peripheral were taken

6
Baseline system dataflow
NIOs CPU
Instruction Master
Data Master
Data flow
Avalon Bus
FLASH
Ethernet MAC
UART, IO, Timer, etc.
SRAM
The Nios CPUs data master port is used to read
data memory (SRAM) and write to the Ethernet
MAC. This would occur for each packet transmitted
in the baseline system.
7
Performance optimization
  • Using a DMA to transfer data from incoming
    packets into memory without the intervention of
    the microprocessor
  • The use of a custom peripheral to do the checksum
    calculation
  • The combination of the two
  • Optimization of the slave-arbitration priority
    for the memories to provide maximum data
    throughput

8
Dataflow enhancement with DMA
NIOs CPU
DMA Controller
Instruction Master
Data Master
Read Master
Write Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
Ethernet MAC
SRAM
  • Using DMA to transfer packets between Ethernet
    MAC and data memory
  • CPU higher priority for any conflicts with the
    DMA
  • During DMA, CPU is free to access other
    peripherals
  • For access to the shared SRAM, arbitration is
    performed

9
Performance improvement
Transmission throughput is doubled compared to
baseline The entire HTTP server throughput is
about 2.5X that of the baseline 36 increase of
logic resource usage (3600 logic elements)
10
TCP checksum
  • Checksum calculations can be regarded as a
    necessary evil in dataflow-sensitive applications
  • For a 1300-byte payload, it takes 33,000 clock
    cycles
  • At a 33 Mhz clock speed it requires 1ms latency
    for each maximum size packet
  • In the benchmark, the largest file (60KB) breaks
    down into 46 maximum-sized packets
  • 46ms out of 156ms transmission latency in the
    baseline
  • The inner loop of TCP/IP stack checksum performs
    a 16-bit ones complement checksum calculation
  • Adding up data repeatedly is a simple task for
    hardware
  • A Verilog implementation can be designed
  • The checksum peripheral operates
  • Reading the payload contents directly out of data
    memory
  • Performing the checksum calculation
  • Storing the result in a CPU-addressable register
  • It takes 386 clock cycles now
  • Speedup of 90X over the software version

11
Checksum peripheral
NIOs CPU
Checksum Peripheral
Instruction Master
Data Master
Read Master
Data flow
Data flow
Avalon Bus
Avalon Bus
Arbitrator
UART, IO, Timer, etc.
FLASH
SRAM
  • Again, for access to the shared SRAM, arbitration
    is performed

12
Performance boost
Transmission latency decreased by 44ms Average
transmission throughput increase of 40 and
average HTTP throughput increase of 25 over the
baseline Resource usage 22 increase over the
baseline (3250 logic elements)
13
Putting it all together
14
Embedded uP systems in Xilinx FPGA
Traditional embedded microprocessor system as
implemented on a platform FPGA
Co-processor Architecture with multiple hardware
accelerators 1. Start with developing for the
first architecture 2. Automatically generating
the second architecture under the control of the
user
15
Profiling results
DCT32 and IMDCT36 perform the discrete cosine
transform and inverse discrete cosine transform
respectively. The other functions are
multiply-accumulate functions of various sizes.
These functions comprise over 90 of the total
application execution time on the host.
16
Design automation
  • Implement co-processor accelerators to meet
    performance requirements.
  • Using the tagging facilities in Xilinx design
    environment to mark the functions for hardware
    acceleration.
  • Compile for target
  • The tool chain will create an implementation that
    includes a MicroBlaze processor and interfaces
    the same as before
  • Augmented with three hardware accelerators that
    implement the multiplications, DCT and inverse
    DCT.
  • The creation of the hardware accelerator blocks
    is done automatically
  • The use of an advanced C to hardware compiler
    optimized for Platform FPGAs.
  • The stitching of the accelerators into the new
    co-processing architecture.
  • Handling the movement of the appropriate data to
    and from the accelerators.

17
New architecture
18
Final results
Enables the mp3 application to run in real time
at a system clock rate of 67.5MHz.
19
A simple summary
  • Platform-based design involves hardware/software
    codesign
  • Right design decisions can provide significant
    amount of performance improvement
  • Need careful tradeoff between performance,
    resource usage, cost and design time
  • Platform FPGAs are a convenient/low cost platform
    for such a task

20
Overview of the Rest of the Semester
  • This is the last formal lecture
  • If we havent covered it already, we cant really
    expect you to use it on your projects
  • Final project proposal is this Thursday
  • Two teams. Each team has 20-25 minutes
  • Proposal presentations can be sent to me through
    email before class or brought in using a flash
    memory
  • Initial report due on 4/20
  • Three-pages (four at most)
  • May contain introduction, background,
    motivation, impact, block diagram, and workload
    partition among team members
  • Goal give us enough information that we can
    provide feedbacks about project complexity and
    suggestions
  • From now on, Ill have office hours during class
    meeting times to discuss final project-related
    issues
  • Final Project Presentation 5/12 (sometime
    between 1 to 4 pm?)
  • Final Project Report/Demo Due 5/15, no-cost
    extension to 5/18
  • Details referring to Lecture 14
Write a Comment
User Comments (0)
About PowerShow.com