Title: CS 162 Computer Architecture Lecture 2: Introduction
1CS 162 Computer Architecture Lecture 2
Introduction Pipelining
- Instructor L.N. Bhuyan
- www.cs.ucr.edu/bhuyan/cs162
2Review of Last Class
- MIPS Datapath
- Introduction to Pipelining
- Introduction to Instruction Level Parallelism
(ILP) - Introduction to VLIW
3What is Multiprocessing
- Parallelism at the Instruction Level is limited
because of data dependency gt Speed up is
limited!! - Abundant availability of program level
parallelism, like Do I 1000, Loop Level
Parallelism. How about employing multiple
processors to execute the loops gt Parallel
processing or Multiprocessing - With billion transistors on a chip, we can put a
few CPUs in one chip gt Chip multiprocessor
4Memory Latency Problem
- Even if we increase CPU power, memory is the real
bottleneck. Techniques to alleviate memory
latency problem - Memory hierarchy Program locality, cache
memory, multilevel, pages and context switching - Prefetching Get the instruction/data before the
CPU needs. Good for instns because of sequential
locality, so all modern processors use prefetch
buffers for instns. What do with data? - Multithreading Can the CPU jump to another
program when accessing memory? Its like
multiprogramming!!
5Hardware Multithreading
- We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex Multitasking - Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching). - Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access. - How about both multiprocessors and multithreading
on a chip? gt Network Processor
6Architectural Comparisons (cont.)
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
7Intel IXP1200 Network Processor
- Initial component of the Intel Exchange
Architecture - IXA - Each micro engine is a 5-stage pipeline no ILP,
4-way multithreaded - 7 core multiprocessing 6 Micro engines and a
Strong Arm Core - 166 MHz fundamental clock rate
- Intel claims 2.5 Mpps IP routing for 64 byte
packets - Already the most widely used NPU
- Or more accurately the most widely admitted use
8IXP1200 Chip Layout
- StrongARM processing core
- Microengines introduce new ISA
- I/O
- PCI
- SDRAM
- SRAM
- IX PCI-like packet bus
- On chip FIFOs
- 16 entry 64B each
9IXP1200 Microengine
- 4 hardware contexts
- Single issue processor
- Explicit optional context switch on SRAM access
- Registers
- All are single ported
- Separate GPR
- 1536 registers total
- 32-bit ALU
- Can access GPR or XFER registers
- Standard 5 stage pipe
- 4KB SRAM instruction store not a cache!
10Intel IXP2400 Microengine (New)
- XScale core replaces StrongARM
- 1.4 GHz target in 0.13-micron
- Nearest neighbor routes added between
microengines - Hardware to accelerate CRC operations and Random
number generation - 16 entry CAM
11- MIPS Pipeline
- Chapter 6 CS 161 Text
12Review Single-cycle Datapath for MIPS
Stage 5
Instruction Memory (Imem)
Data Memory (Dmem)
- Use datapath figure to represent pipeline
13Stages of Execution in Pipelined MIPS
- 5 stage instruction pipeline
- 1) I-fetch Fetch Instruction, Increment PC
- 2) Decode Instruction, Read Registers
- 3) Execute Mem-reference Calculate
Address R-format Perform ALU Operation - 4) Memory Load Read Data from Data Memory
Store Write Data to Data Memory - 5) Write Back Write Data to Register
14Pipelined Execution Representation
Time
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
IFtch
Dcd
Exec
Mem
WB
Program Flow
- To simplify pipeline, every instruction takes
same number of steps, called stages - One clock cycle per stage
15Datapath Timing Single-cycle vs. Pipelined
- Assume the following delays for major functional
units - 2 ns for a memory access or ALU operation
- 1 ns for register file read or write
- Total datapath delay for single-cycle
- In pipeline machine, each stage length of
longest delay 2ns 5 stages 10ns
Insn Insn Reg ALU Data Reg TotalType Fetch Read O
per Access Write Time beq 2ns 1ns 2ns 5ns R-for
m 2ns 1ns 2ns 1ns 6ns sw 2ns 1ns 2ns 2ns 7nslw
2ns 1ns 2ns 2ns 1ns 8ns
16Pipelining Lessons
- Pipelining doesnt help latency (execution time)
of single task, it helps throughput of entire
workload - Multiple tasks operating simultaneously using
different resources - Potential speedup Number of pipe stages
- Time to fill pipeline and time to drain it
reduces speedup - Pipeline rate limited by slowest pipeline stage
- Unbalanced lengths of pipe stages also reduces
speedup
17Single Cycle Datapath (From Ch 5)
M u x
a d d
4
ltlt 2
PCSrc
MemWrite
2521
ReadReg1
Read Addr
P C
Readdata
Readdata1
Zero
ReadReg2
310
2016
A L U
Instruc- tion
Address
Readdata2
M u x
MemTo- Reg
WriteReg
M u x
Dmem
Imem
Regs
ALU- con
WriteData
WriteData
1511
M u x
RegDst
ALU- src
RegWrite
MemRead
150
ALUOp
18Required Changes to Datapath
- Introduce registers to separate 5 stages by
putting IF/ID, ID/EX, EX/MEM, and MEM/WB
registers in the datapath. - Next PC value is computed in the 3rd step, but we
need to bring in next instn in the next cycle
Move PCSrc Mux to 1st stage. The PC is
incremented unless there is a new branch address. - Branch address is computed in 3rd stage. With
pipeline, the PC value has changed! Must carry
the PC value along with instn. Width of IF/ID
register (IR)(PC) 64 bits.
19Changes to Datapath Contd.
- For lw instn, we need write register address at
stage 5. But the IR is now occupied by another
instn! So, we must carry the IR destination field
as we move along the stages. See connection in
fig. - Length of ID/EX register (Reg132)(Reg232)(of
fset32) (PC32) (destination register5)
133 bits - Assignment What are the lengths of EX/MEM, and
MEM/WB registers
20Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode
Execute Memory
Write Back
0
M
u
x
1
IF/ID
EX/MEM
ID/EX
MEM/WB
A
d
d
A
d
d
4
A
d
d
r
e
s
u
l
t
S
h
i
f
t
l
e
f
t
2
R
e
a
d
n
o
r
e
g
i
s
t
e
r
1
i
A
d
d
r
e
s
s
P
C
t
R
e
a
d
c
u
d
a
t
a
1
r
t
R
e
a
d
s
Z
e
r
o
n
r
e
g
i
s
t
e
r
2
I
A
L
U
R
e
a
d
A
L
U
0
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
1
d
a
t
a
2
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
M
Imem
u
Regs
u
W
r
i
t
e
x
x
d
a
t
a
1
0
W
r
i
t
e
Dmem
d
a
t
a
3
2
1
6
S
i
g
n
e
x
t
e
n
d
5
69 bits
64 bits
133 bits
102 bits