Title: A Study on HyperThreading
1A Study on Hyper-Threading
- Vimal Reddy
- Ambarish Sule
- Aravindh Anantaraman
2Microarchitectural trends
- Higher degrees of instruction-level parallelism
- Different generations
- I. Serial Processors Fetch and execute each
instruction back to back - II. Pipelined Processors Overlap different
phases of instruction processing for higher
throughput - III. Superscalar Processors Overlap different
phases of instruction processing and issue and
execute multiple instructions in parallel for IPC
gt 1 - IV. ???
3Superscalar limits
- Limitations with superscalar approach
- - Amount of ILP in most programs is limited
- - Nature of ILP in programs can be bursty
- - Bottom-line Resources can be utilized
better
4Simultaneous Multithreading
- Finds parallelism at thread level
- Executes multiple instructions from multiple
threads each cycle - No significant increase in chip area over a
superscalar processor
5Multiple PCs
Replicate architectural state
- Thread selection
- Replicate RAS
- BTB thread ids
Fetch Unit
Data Cache
FP Registers
FP queue
Instruction Cache
Selective squash
Decode
Int. Registers
Int. queue
Int. load/store units
Register Renaming
Selective squash
Replicate architectural state
Per-thread disambiguation
- Multiple rename map tables
- Multiple arch. map tables
- Multiple active lists
From ece721 notes, Prof. Eric Rotenberg, NCSU
6Hyper-Threading
- Brings goodness of Simultaneous Multi-Threading
(SMT) to Intel Architecture - Motivation (Same as that for SMT)
- High processor utilization
- Better throughput (by exploiting thread level
parallelism - TLP) - Power efficient due to smaller processor cores
compared to CMP
7Hyper-Threading Contd.
- 2 Logical processors (2 threads in SMT
terminology) - Shared Instruction Trace Cache and L1 D-Cache
- 2 PCs and 2 register renamers
- Other resources partitioned equally between 2
threads - Recombines shared resources when single threaded
(no degradation of single thread
performance) -
Intel NetBurst Microarchitecture Pipeline With
Hyper-Threading Technology
8Project Goal
- Measure performance of micro-benchmarks (kernels)
on Pentium-4. Form workloads to utilize different
processor resources and study behavior.
9Pentium4 Functional Units
3 Integer ALU units (2 double speed) 1 unit
for Floating point computation Separate address
generator units for loads and stores
10Micro-benchmarks
- Created 3 types of kernels
- Floating Point intensive kernel (flt)
- Performs FP Add, Sub, Multiply, Divide operations
a large number of times - Targets single FP unit
- Integer intensive kernel (int)
- Performs integer Add, Subtract and Shift a large
number of times - Targets integer units (2 double speed and 1 slow)
- Memory intensive kernel (mem, mem_s)
- Dynamically allocates a linked list larger than
L1 D and parses it - Targets shared data cache and memory hierarchy as
such
11Micro-benchmarks (contd.)
Integer kernel
Floating Point kernel
Memory intensive kernel
12Workbench
- Machine Pentium4 Northwood 2.53-2.66 GHz. with
Hyper-Threading - Operating System Linux 2.4.18-SMP kernel. OS
views each thread as a processor - BIOS setting to turn HT On/Off
- PERL script to fork processes at the same time
- top (Linux utility) to monitor processes
(processor and memory utilization) - time utility to get timing statistics for each
program - Ran each experiment 10 times and took the average
execution time
13Methodology
- Run different workload combinations.
- fltflt 2 Floating point kernels
- mem_smem_s 2 small memory intensive kernels
- intflt 1 integer and 1 float kernel and so on
.. - Run in 3 modes
- 1. back-to-back Run each program individually
- 2. HT Off No Hyper-Threading. But OS context
switching - 3. HT On Hyper-Threading on and OS context
switching - Find Contending workloads Compete for
resources and degrade performance (increase
execution time with HT on) - Find Complementary workloads Utilize idle
resources and increase performance (decrease
execution time with HT on)
14Experiments Single thread performance
- Hyper-Threading does not degrade single thread
performance
15Experiments (Contd.)
- Contention for single FP unit increases
execution time - Contention for data cache can lead to thrashing
16Experiments (Contd.)
- Integer workloads perform well 3 integer units
- (2 double speed) are well utilized
- Workloads with complementary resource
requirements - perform well (intflt, memint)
- OS plays important role when number of programs
gt number - of hardware contexts available
17Experiments (Contd.)
18Experiments (contd.)
- Execution time with 3 kernel workload is less
than that for 2! - Scheduling important!
- intfltflt - int kernel has 100 of 1 thread,
5050 between flt - and flt
- fltfltint - flt kernel has 100 of 1 thread,
5050 between int - and flt. Has higher execution time!
19Project Goal
- Model Hyper-Threading on a simulator. Vary key
parameters and study first order effects
20Simulator details
- Execution driven, cycle accurate simulator based
on SimpleScalar toolset - Extended the simulator to model SMT and
Hyper-Threading - Resource sharing by tagging thread id (I, D)
- Resource replication through multiple
instantiation (PC, Map tables, Branch history,
RAS) - Resource partitioning by having separate
instances but imposing a global limit on entries
( Active list, Load/store buffers, IQs) - Stop simulation after completion of all threads
21Simulator details
22Simulator SMT/HT validation
23Experiment Modeling L1 data cache interference
24Experiment Modeling issue queue partitioning
25Experiment Modeling total issue queue size with
partitioning
26Experiment Varying Load/Store buffer sizes
(Pentium4 48 Load, 24 Store)
27Experiment Comparison of fetch policies
28References
- 1 Prof. Eric Rotenberg, Course Notes, ECE 792E
Advanced Microarchitecture, Fall 2002 NC State
University. - 2 Deborah T. Marr et al. Hyper-Threading
Technology Architecture and Microarchitecture,
Intel Technology Journal 1st Qtr 2002 Vol 6 Issue
1. - 3 Vimal Reddy, Ambarish Sule, Aravindh
Anantaraman Hyperthreading on the Pentium 4,
ECE792E Project, Fall 2002 http//www.tinker.ncsu.
edu/ericro/ece721/student_projects/avananta.pdf - 4 D. M. Tullsen, et al. Exploiting Choice
Instruction Fetch and Issue on an Implementable
Simultaneous Multithreading Processor, 23rd
Annual ISCA, pp. 191-202, May 1996.
29Questions