Lab 2 Parallel processing using NIOS II processors - PowerPoint PPT Presentation

About This Presentation
Title:

Lab 2 Parallel processing using NIOS II processors

Description:

Go through the steps of the tutorial 'Creating Multiprocessor Nios II System tutorial' ... program from http://www.altera.com/literature/tt/hello_world_multi.c ... – PowerPoint PPT presentation

Number of Views:497
Avg rating:3.0/5.0
Slides: 18
Provided by: Miod
Category:

less

Transcript and Presenter's Notes

Title: Lab 2 Parallel processing using NIOS II processors


1
Lab 2 Parallel processing using NIOS II
processors
  • CEG 4131 Computer Architecture III
  • Miodrag Bolic

2
Overview
  • You will learn how to
  • Design multiprocessing systems that use shared
    memories
  • Partition sequential program so that it can be
    implemented on multi-processors
  • Synchronize multiprocessing system
  • Time 3 weeks
  • Point 115 (There is an optional task)

3
Overview
  • Part 1
  • Design a multiprocessing system by following the
    steps from the tutorial. Run and debug the
    program that comes with the tutorial.
  • Part 2
  • Use the same hardware designed in part 1
  • Develop a program for parallel matrix
    multiplication and run it on the multiprocessing
    system
  • Compute the speedup of the program when it runs
    on a single processor and on a multiprocessing
    system

4
Part 1
  • Copy the project C\altera\kits\nios2\examples\vhd
    l\niosII_stratix_1s10\standard
  • to your home directory
  • Go through the steps of the tutorial Creating
    Multiprocessor Nios II System tutorial. You can
    download the tutorial from tt_nios2_multiprocessor
    _tutorial.pdf and a program from
    http//www.altera.com/literature/tt/hello_world_mu
    lti.c
  • Modification On page 30 of the tutorial, choose
    NIOS II/s core for CPU3 instead of NIOS II/e. All
    three cores have to be NIOS II/s. Change the
    instruction cache size for all 3 of them to
    4kBytes.
  • Before generating and compiling on page 36 of the
    tutorial, do the following
  • Add performance counter in the same way as in Lab
    1. Connect performance_counter only to the data
    master of the CPU1.
  • Add on-chip Memory block and configure it as
    shown in the next page. Connect s1 port to
    cpu1/data_master and cpu2/data_master. Connect s2
    port to cpu3/data_master.
  • Continue with the tutorial.

5
On-chip memory configuration
6
Task 1 Demonstration and Questions
  • Show to the TA that the program is working (20
    points)
  • Questions
  • Describe the program in details.
  • Why do we need mutex?
  • If processor 1 gets a mutex for the memory
    messsage_buffer_ram, can processor 2 write to
    this memory before processor 1 releases the
    mutex?
  • Can processor 1 store two messages in the buffer?

7
Part 2
  • In this part, the same hardware configuration
    will be used.
  • You will design a program for parallel matrix
    multiplication.
  • Problem
  • There is an input/output module which receives
    and stores data in matrices in matrices M1 and
    M2. We will simulate this module using
    shared_memory module that we added in the first
    part of the Lab. Our program multiplies these two
    matrices and stores the result C in the same
    module (memory).

8
Sequential solution
  • Program the Altera chip using the same
    configuration from part 1.
  • Modify the matrix_performance.c file so that
    matrices M1, M2 and C are transferred to the
    shared_memory. Do this step before activating the
    performance counter. Change the number of
    iterations in matrix multiplication from 100 to
    1000.
  • Change the C/C options in your project and
    syslib project from Debug to Release.
  • Run the code and present the performance count
    results and matrix C that is obtained in the
    iteration 1000.
  • Demonstration show the result to the TA.

9
Parallel solution
  • CPU 1 will be used for synchronization and for
    I/O operations, while CPU 2 and 3 are used for
    multiplication. CPU 2 and 3 function in single
    program multiple data SPMD mode. This means that
    they start the iterations at the same time and
    they execute the same code but on different data.
    After they finish the multiplication, they signal
    to CPU1. The program will repeat the
    multiplication of matrices 1000 times.

10
Parallel matrix multiplication
  • CPU1 transfers M1 and M2 to the shared_memory.
  • Algorithm
  • The sequential program is show bellow. In
    parallel implementation, CPU 2 will execute i
    loop from 0 to 4, and CPU 3 will execute i loop
    from 5 to 9. CPU 2 and 3 will perform their
    operations at the same time
  • for (i0ilt9i)
  • for (j0jlt9j)
  • Cij 0
  • for (k0klt9k)
  • CijM1ikM2kj

11
Synchronization
  • Variables status_start and status_done will be
    shared variables used for synchronization. All
    three processor will access these variable using
    the mutex. They will be stored in
    message_buffer_ram memory.
  • It is extremely important that both CPU2 and CPU3
    start matrix multiplication at the same time.
    This will not happen automatically since they are
    booted from the same memory. So, CPU1 has to
    assure that both CPU2 and CPU3 start at the same
    time. Shared variable status_start will be used
    for that. CPU1 has to set this variable to 1 and
    CPU2 and CPU3 have to increment this variable
    before they start matrix multiplication. When
    status_start is 3 then CPU2 and CPU3 will start
    matrix multiplication and CPU1 will initiate
    measurement of time using the performance_counter.
  • At the beginning, CPU 1 will set status_done to
    1. After CPU 2 and CPU 3 finish 1000 iterations
    of 10x10 matrix multiplication, they each
    increment the status_done. CPU 1 is periodically
    reading the variable status_done, and when it is
    3, the program is over. CPU1 stops the
    performance_count and print performance_count
    result and matrix C from 1000th iteration on the
    terminal.

12
Task 2 - Questions
  • What is speedup if we compare sequential and
    parallel implementation? Comment the speed-up
    result.
  • Why can we design a program for matrix
    multiplication without using mutexes (except for
    synchronization)?

13
Demonstration (40)
  • Send matrix C of 1000th iteration of the matrix
    multiplication algorithm to the terminal through
    JTAG UART. Send also the number of clock cycles
    from the performance counter.
  • Show this result to the TA. Explain to the TA how
    your parallel matrix multiplication program works
    and how you achieved synchronization. You will
    get 10 points less if speedup is less than 1.

14
Optional part- Synchronization
  • If our program emulates real system, then CPU1
    should synchronize both CPU1 and CPU2 after 1
    iteration of 10x10 matrix multiplication and not
    after 1000 of them. So, in a real program after
    each 10x10 matrix multiplication, the CPU1 will
    perform some operations on the computed matrix C
    and initialize new iteration of 10x10 matrix
    multiplication if matrices M1 and M2 are ready.
  • In this part of the lab, you will use
    iteration_done variable to notify CPU1 that one
    iteration of 10x10 matrix multiplication is done.
    Additional shared variable is needed for the
    start of next iteration. Lets call it
    start_next_iteration.
  • The program works as follows. At the beginning
    CPU1 sets start_next_iteration . After 10x10
    multiplication iteration starts, CPU2 and CPU3
    resets this variable. After CPU2 and CPU3 are
    done with the execution of their part of 10x10
    matrix multiplication, they increment
    iteration_done and wait for start_next_iteration
    to be set. CPU1 checks if iteration_done is equal
    3 and if it is, CPU1 sets start_next_iteration.
    The new iteration of 10x10 matrix multiplication
    can start then.

15
Optional part Demonstration and Questions
  • Question
  • What is the speedup of this program?
  • Demonstration (10 optional points)
  • Send the sum of the elements of matrix C of each
    iteration of 10x10 matrix multiplication
    algorithm to the terminal through JTAG UART. Send
    also the number of clock cycles from the
    performance counter.
  • Show this result to the TA. Explain to them how
    you achieved synchronization.

16
What to submit
  • Report contains the following (30 points)
  • Title page
  • Description of your system with the picture of
    SOPC Builder System Components
  • Detailed description of your solution of the
    algorithm for parallel matrix multiplication and
    synchronization.
  • Answers to the questions from task 1-2.
  • Conclusions
  • Page 17 of this document signed by the TA.
  • Soft copies of the report and source code of the
    programs for sequential and parallel
    multiplication with basic comments (.c files)
    and quartus II files .sof and .ptf (10 points).
  • Optional Description of the synchronization
    method and speedup for the optional part as apart
    of the report. Softcopy of the algorithm for
    matrix multiplication. (5 points)

17
Lab 2 Signature page
  • Student name
  • Student name

Demonstrated (TAs signature) Performance_counter result - Time Points
Part 1 / ____/20
Part 2 sequential
Part 2 parallel ____/40
Part 2 optional ____/10
Total / / ____
Write a Comment
User Comments (0)
About PowerShow.com