CS 162 Computer Architecture Lecture 2: Introduction - PowerPoint PPT Presentation

About This Presentation

Title:

CS 162 Computer Architecture Lecture 2: Introduction

Description:

Parallelism at the Instruction Level is limited because of data dependency ... How about employing multiple processors to execute the loops = Parallel ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 21

Provided by: davep173

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 162 Computer Architecture Lecture 2: Introduction

1
CS 162 Computer Architecture Lecture 2
Introduction Pipelining

Instructor L.N. Bhuyan
www.cs.ucr.edu/bhuyan/cs162

2
Review of Last Class

MIPS Datapath
Introduction to Pipelining
Introduction to Instruction Level Parallelism
(ILP)
Introduction to VLIW

3
What is Multiprocessing

Parallelism at the Instruction Level is limited
because of data dependency gt Speed up is
limited!!
Abundant availability of program level
parallelism, like Do I 1000, Loop Level
Parallelism. How about employing multiple
processors to execute the loops gt Parallel
processing or Multiprocessing
With billion transistors on a chip, we can put a
few CPUs in one chip gt Chip multiprocessor

4
Memory Latency Problem

Even if we increase CPU power, memory is the real
bottleneck. Techniques to alleviate memory
latency problem
Memory hierarchy Program locality, cache
memory, multilevel, pages and context switching
Prefetching Get the instruction/data before the
CPU needs. Good for instns because of sequential
locality, so all modern processors use prefetch
buffers for instns. What do with data?
Multithreading Can the CPU jump to another
program when accessing memory? Its like
multiprogramming!!

5
Hardware Multithreading

We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex Multitasking
Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching).
Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access.
How about both multiprocessors and multithreading
on a chip? gt Network Processor

6
Architectural Comparisons (cont.)
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
7
Intel IXP1200 Network Processor

Initial component of the Intel Exchange
Architecture - IXA
Each micro engine is a 5-stage pipeline no ILP,
4-way multithreaded
7 core multiprocessing 6 Micro engines and a
Strong Arm Core
166 MHz fundamental clock rate
Intel claims 2.5 Mpps IP routing for 64 byte
packets
Already the most widely used NPU
Or more accurately the most widely admitted use

8
IXP1200 Chip Layout

StrongARM processing core
Microengines introduce new ISA
I/O
PCI
SDRAM
SRAM
IX PCI-like packet bus
On chip FIFOs
16 entry 64B each

9
IXP1200 Microengine

4 hardware contexts
Single issue processor
Explicit optional context switch on SRAM access
Registers
All are single ported
Separate GPR
1536 registers total
32-bit ALU
Can access GPR or XFER registers
Standard 5 stage pipe
4KB SRAM instruction store not a cache!

10
Intel IXP2400 Microengine (New)

XScale core replaces StrongARM
1.4 GHz target in 0.13-micron
Nearest neighbor routes added between
microengines
Hardware to accelerate CRC operations and Random
number generation
16 entry CAM

MIPS Pipeline
Chapter 6 CS 161 Text

12
Review Single-cycle Datapath for MIPS
Stage 5
Instruction Memory (Imem)
Data Memory (Dmem)

Use datapath figure to represent pipeline

13
Stages of Execution in Pipelined MIPS

5 stage instruction pipeline
1) I-fetch Fetch Instruction, Increment PC
2) Decode Instruction, Read Registers
3) Execute Mem-reference Calculate
Address R-format Perform ALU Operation
4) Memory Load Read Data from Data Memory
Store Write Data to Data Memory
5) Write Back Write Data to Register

14
Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
Program Flow

To simplify pipeline, every instruction takes
same number of steps, called stages
One clock cycle per stage

15
Datapath Timing Single-cycle vs. Pipelined

Assume the following delays for major functional
units
2 ns for a memory access or ALU operation
1 ns for register file read or write
Total datapath delay for single-cycle
In pipeline machine, each stage length of
longest delay 2ns 5 stages 10ns

Insn Insn Reg ALU Data Reg TotalType Fetch Read O
per Access Write Time beq 2ns 1ns 2ns 5ns R-for
m 2ns 1ns 2ns 1ns 6ns sw 2ns 1ns 2ns 2ns 7nslw
2ns 1ns 2ns 2ns 1ns 8ns
16
Pipelining Lessons

Pipelining doesnt help latency (execution time)
of single task, it helps throughput of entire
workload
Multiple tasks operating simultaneously using
different resources
Potential speedup Number of pipe stages
Time to fill pipeline and time to drain it
reduces speedup
Pipeline rate limited by slowest pipeline stage
Unbalanced lengths of pipe stages also reduces
speedup

17
Single Cycle Datapath (From Ch 5)
M u x
a d d
4
ltlt 2
PCSrc
MemWrite
2521
ReadReg1
Read Addr
P C
Readdata
Readdata1
Zero
ReadReg2
310
2016
A L U
Instruc- tion
Address
Readdata2
M u x
MemTo- Reg
WriteReg
M u x
Dmem
Imem
Regs
ALU- con
WriteData
WriteData
1511
M u x
RegDst
ALU- src
RegWrite
MemRead
150
ALUOp
18
Required Changes to Datapath

Introduce registers to separate 5 stages by
putting IF/ID, ID/EX, EX/MEM, and MEM/WB
registers in the datapath.
Next PC value is computed in the 3rd step, but we
need to bring in next instn in the next cycle
Move PCSrc Mux to 1st stage. The PC is
incremented unless there is a new branch address.
Branch address is computed in 3rd stage. With
pipeline, the PC value has changed! Must carry
the PC value along with instn. Width of IF/ID
register (IR)(PC) 64 bits.

19
Changes to Datapath Contd.

For lw instn, we need write register address at
stage 5. But the IR is now occupied by another
instn! So, we must carry the IR destination field
as we move along the stages. See connection in
fig.
Length of ID/EX register (Reg132)(Reg232)(of
fset32) (PC32) (destination register5)
133 bits
Assignment What are the lengths of EX/MEM, and
MEM/WB registers

20
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t

2
R
e
a
d
n
o
r
e
g
i
s
t
e
r

1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a

1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r

2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a

2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits

Write a Comment

User Comments (0)