Title: SUN ULTRASPARC-III ARCHITECTURE
1SUN ULTRASPARC-III ARCHITECTURE
- CMPE 511 PRESENTATION
- Prepared byBalkir Kayaalti
2Introduction
- SPARC stands for a Scalable Processor
ARChitecture. - It is an open processor architecture.(i.e. Member
companies to the SPARC community can freely
produce the processor) - SUN ULTRA SPARCv9 is a robust RISC architecture
with - -64 bit integer address and data
- -Superscalar implementations
- -Extremely fast trap handling and context
switching. - The presentation will look in detail to the SUN
Microsystems Ultra SPARC III v9 architecture.
3Major Architectural units
- The processors micro-architecture design
- has six major functional units that perform
- relatively independently
- Instruction issue unit (IIU)
- Floating point unit (FPU)
- Integer execution unit (IEU)
- Data cache unit (DCU)
- External memory unit (EMU)
- System interface unit (SIU)
- The units communicate requests and results among
themselves through well-defined interface
protocols, as the next figure
4Communication paths between architectural units
5Instruction issue unit
- This unit feeds the execution pipelines with the
instructions. - It independently predicts the control flow
through a program and fetches the predicted path
from the memory system. - Fetched instructions are staged in a queue before
forwarding to the two execution units integer
and floating point - This unit includes
- 32-Kbyte, four-way associative Instruction
cache - The instruction address translation buffer
- A 16 K-entry branch predictor
6Ultra SPARC-III pipeline and physical data
- Pipeline feature Parameter
- Instruction issue 4 integer
- 2 float point
- 2 graphics
- Level-one(L1) caches Data 64-Kbyte, 4-way
- Instructions 32-Kbyte, 4-way
- Prefetch 2-Kbyte,4-way
- Write 2-Kbyte,4-way
- Level-two(L2) cache Unified (data and
instructions) - 4- and 8-Mbyte,1-way
- On-chip tagsoff chip data
7Pipeline
8Pipeline blocks
- Stage Function
- A Generate instruction fetch addresses,
generate pre-decoded instruction bits on - P Fetch first cycle of instructions from cache
access first cycle of branch prediction - F Fetch second cycle of instructions from cache
access second cycle of branch prediction
translate virtual-to- physical address - B Calculate branch target addresses decode
first cycle of instructions - I Decode second cycle of instructionsenqueue
instructions into the queue - J Steer instructions to execution units
- R Read integer register file operands check
operand dependencies - E Execute integers for arithmetic, logical, and
shift instructions read, and check dependency
of, first cycle of data cache access
floating-point register file
9Pipeline blocks2
- Stage Function
- C Access second cycle of data cache, and forward
load data for word and doubleword loads execute
first cycle of floating-point instructions - M Load data alignment for half-word and byte
loads execute second cycle of floating-point
instructions - W Write speculative integer register file
execute third cycle of floating-point
instructions - X Extend integer pipeline for precise
floating-point traps execute fourth cycle of
floating-point instructions - T Report traps
- D Write architectural register file
10Pipeline
- The instruction issue unit Stages A-J
- The execution unit Stages R-D
- data cache E, C, M, and W stages of the pipe in
parallel with integer execution unit stages - Floating point unit Side pipeline parallel E
through D stages of the integer pipeline
11Pipeline
12Instruction issue unit cont.
- To increase the performance high level of
instruction parallelism is desired. - Ultra SPARC is a static speculation machine.
- - Dynamic speculation machines require very
high fetch bandwidths to fill an instruction
window and find instruction-level parallelism. - - In a static speculation machine the compiler
can make the speculated path sequential,
resulting in fewer requirements on the
instruction fetch unit.
13Instruction issue unit
Stage A Address lines enter to the instruction
cache. All fetch address generation and
selection occurs. Stage P,F Instruction cache
access. Branch prediction Instruction
address translation access
14- By the time the instructions are available from
the cache in the B - stage, we also have the physical address from the
translator and a - prediction for any branch that was fetched.
- The processor uses all this information in the B
stage to - determine whether to follow a sequential or
taken-branch path
15Branch prediction
- The processor also determines whether the
instruction cache access was a hit or miss. If
the processor predicts a taken branch in the B
stage, the processor sends back the target
address for the branch to the A stage to redirect
the fetch stream. - Waiting until the B stage to redirect the fetch
stream lets us use a large, accurate branch
predictor. - Branch predictor uses a G-share algorithm with
16K 2-bit saturating up/down counters - Predictor is pipelined since it is big.
16Instruction buffer (queue)
- There are 2 instruction queues designed
(instruction queue and miss queue) - The 20-entry instruction queue decouples the
fetch unit from the execution units, allowing
each to proceed at its own rate - If a branch is taken at the two cycles that
should pass for filling the queue with right
instructions , immediately instructions in the
miss queue can be used.
17Integer execute unit
- Execution pipelines can support concurrent launch
up to six instructions which can consist of - -two integer operations,A0/A1 pipelines
- -two FP operations, FP pipelines
- -one memory operation (load/store), MS pipeline
- -one special purpose memory operation (
prefetch cache load only) - -one control transfer instruction (CTI), BR
pipeline - However only four Instructions per cycle (IPC)
can be executed in a sustain manner.
18Working and Architectural Register File (WARF)
- Physically it is a one block but logically it can
be seen as two separate register files. (working
register file and architectural) - SPARC architectures use register files and
windowing techniques. - Any time 8 global registers can be reached g0
g7 - Global register g0 is always 0.
- At any time, an instruction can access the 8
global and a 24-register window into the
registers. A register window comprises the 8 in
and 8 local registers of a particular register
set, ttogether with the 8 in registers of an
adjacent register set, which are addressable from
the current window as out registers.
19Register windows
20WARF
- WRF consist of 32 64-bit registers (each of
with 3 write,7 read ports and 32642048 minus 64
1984 bit write port to transport data from
Architectural register file - ARF has 160 entries (Total 8 register windows)
- 8x864 for local registers in the window
- 8x864 registers for 16 IN/OUT shared
registers. - 28 register for 4 set of 8 global registers.
- The WRF manages as single window updated as
results computed
21- The processor accesses the WRF in the pipelines
R stage and supplies integer operands to the
execution units. - Most integer operations complete in one cycle ,
so result can be written immediately at C stage. - If an exceptional event occurs, results written
must be undone so original copies of integer
registers are copied using broadside copy of all
integer files from appropriate ARF window. - The place where to architecture register file is
written at the end of the pipeline since all
exceptions should be resolved. - ARF fills 16 WRF entries after a window change
- On an exception 31 nonzero registers of WRF
should be updated.
22On chip memory system
Chache diagram used in the architecture
23On chip memory system
- Level-one(L1) caches Data 64-Kbyte, 4-way
- Instructions 32-Kbyte, 4-way
- Prefetch 2-Kbyte,4-way
- Write 2-Kbyte,4-way
- Level-two(L2) cache Unified (data and
instructions) - 4- and 8-Mbyte,1-way
- On-chip tags off chip data
- average latency L1 hit time L1 miss rate
L1miss time - L2 miss rate L2 miss time
24Prefetch cache
- Performance is highly increased by using a
Prefetch Cache in parallel with the L1 data
cache. - By issuing up to eight in-flight prefetches to
main memory, the prefetch cache enables program
to utilize 100 of the available main memory
bandwidth without incurring a slow-down due to
the main memory latency.
25Prefetch cache
- The prefetch cache 2-Kbyte SRAM organized as 32
entries of 64 bytes and using four-way
associativity with an LRU replacement policy. - A multi-port SRAM design let us achieve a very
high throughput. - Data can be streamed through the prefetch cache
in a manner similar to stream buffers. - On every cycle, each of two independent read
ports supply 8 bytes of data to the pipeline
while a third write port fills the cache with 16
bytes.
26Prefetch cache
- Some early processors like Ultra Sparc II uses
prefetch instructions. - Autonomous stride prefetch engine that tracks the
program counters of load instructions and detects
when a load instruction is striding through
memory . - When the prefetch engine detects a striding
load, the prefetch engine issues a hardware
prefetch independent of any software prefetch. - This allows the prefetch cache to be effective
even on codes that do not include prefetch
instructions.
27Write cache
- Write-caching is an excellent way to reduce the
bandwidth due to store traffic. - A write cache is used in SPARC-III to reduce the
store traffic bandwidth to the off-chip L2 data
cache - Size is 2Kbyte -4 way associative
- Advantage of using it is being the sole source
of on-chip dirty data, the write cache easily
handles both multiprocessor and on-chip cache
consistency. - Error recovery also becomes easier with the
write cache, since the write cache keeps all
other on-chip caches clean and simply invalidates
them when an error is detected.
28Write chaching
- A byte validate policy is used on the write
cache. Rather than reading the data from the L2
cache for the bytes within the line that are not
being overwritten, we just keep an individual
valid bit for each byte. Not performing the
read-on-allocate saves considerable L2 cache
bandwidth by postponing a read-modify-write until
the write cache evicts a line. Frequently, by
eviction time the entire line has been written so
the write cache can eliminate the read. - Write cache is included in the L2 data cache and
write-cache data can supersede read data from the
L2 data cache . We handle this by a byte-merging
multiplexer on the incoming L2 cache data bus
that can choose either writecache data or L2
cache data for each byte.
29Floating point unit
- This unit contains data paths and control logic
to execute floating point and partitioned
fixed-point data type instructions. - Three data paths concurrently execute floating
point or graphics instructions, one each per
cycle from the following classes - -Divide/multiply (single or double precision or
partitioned) - -Add/subtract/compare (single or double
precision or partitioned) - -An independent division datapath which lets
non-pipelined divide proceed concurrently with
the full pipelined multiply and adder paths. - In order to meet the cycle time of the floating
point operations latency cycles must be added. - With using advanced circuit techniques for
floating point add multiply units a latency cycle
will be enough.
30External memory interface
- External memory consist of a large L2 cache built
off chip and a main memory built off chip using
synchronous DRAMs. - Size of L2 caches 4 or 8 Mbyte
- Latency 12 clock cycles to support 32 byte line
to L1 - Tags for the L2 is placed on-chip to early detect
L2 miss - (L2 cache controller accesses on-chip tags
parallel with the start of the off-chip SRAM
access and provide a way select signal to a late
select address pin on the off-chip SRAMs)
31- L2 caches are Wave-pipelined and operate at
600MHz., - Main memory DRAM controller is on chip, reducing
memory latency and scales the memory bandwidth
with the number of processor. - The memory controller supports up to 4 Gbytes of
SDRAM memory organized as four independent banks.
32Trap stage in the pipeline
- In this architecture classical stall signal(
which freezes the state of the pipeline is
eliminated for performance purposes) - Instead a trap stage is put at the end of the
pipeline to restore a state when an unexpected
event occurs. - Its handled like a trapthe instructions that
are in the pipeline will be refetched from Stage
A.
33Conclusion
- One of the advanced RISC microprocessor is the
Sun Microsystems UltraSPARC.It finds many
application in desktops, network systems ,
scientific calculation machines. - The internal architecture of the UltraSPARC-III.
is represented . - Various parts of the processor is examined like
instruction issue, execution, on chip and
external memory.
34References
- 1) Ultra Sparc IIIDesigning Third -Generation
64-Bit performance ,IEEE Micro ,June 1999 - 2)Design Decisions Influencing Ultra SPARCs
Instruction Fetch Architecture, 29th annual
IEEE/ACM International Symposium on
Microarchitecture ,p178-190,1996 Paris - 3)Ultra SPARC III v9 Manual,Sun Microsystems.
35