Title: Microprocessor system architectures IA32 advanced features and rests
1Microprocessor system architectures IA32
advanced features and rests
2Multiple-processor management
- Mechanisms
- Support for atomic operations on system memory
- Serializing instructions
- APIC
- L2 and L3 caches
- Hyper-threading
- Aims
- Maintain system memory coherence
- Maintain cache coherence
- Predictable ordering of writes to memory
- Distribute interrupt handling among processors
- Increase system performance by exploiting
multi-threaded OSs and applications
3Locked atomic operations
- Three independent mechanisms
- Guaranteed atomic operations
- Bus locking using LOCK or instruction prefix
LOCK - Cache coherency protocols insuring cache
coherency for atomic operations on cached data
(cache lock) (Pentium Pro)
4Guaranteed atomic operations
- i486
- R/W a byte
- R/W a word (2B) aligned on a word
- R/W a dword (4B) aligned on a dword
- Pentium
- R/W a qword (8B) aligned on a qword
- R/W a word from/to uncached memory within 32-bit
bus - Pentium Pro
- Unaligned word, dword, qword R/W from/to cached
memory within a cache line
5Bus locking
- Automatic locking
- XCHG with memory
- Setting B (busy) flag of a TSS descriptor
- Updating descriptors (e.g. A flag)
- Updating page tables
- Interrupt acknowledgement
- Software controlled locking (prefix LOCK)
- Automatically assumed for XCHG
- BTS, BTC, BTR
- XADD, CMPXCHG, CMPXCHG8B
- INC, DEC, NOT, NEG, ADD, ADC, SUB, SBB, AND, OR,
XOR - Otherwise UD exception (invalid opcode)
- Memory access can be unaligned
- Pentium Pro serializes locked operations
6Self-modifying code
- Option 1
- Write modified code using data segment
- Jump to new code or an intermediate location
- Execute the new code
- Option 2
- Write modified code using data segment
- Execute a serializing instruction
- Execute the new code
- Required for Pentium Pro
- Performance penalty
- Cross-modifying code
- One CPU changes a code and the second one
executes it - Synchronize CPUs and execute a serializing
instruction
7Memory ordering
- Program-ordering
- Alias strong-ordering
- R/W issued on the bus in the order they occur in
the instruction stream under all circumstances - i386
- Processor-ordering
- Alias speculative-ordering or weak-ordering
- Allows increased instruction execution speed,
while maintaining memory coherency - The exact behavior depends on a model Pentium
Pro - Pentium and i486
- They use processor-ordering
- In most cases they behave as program-ordered
- R miss goes ahead of W, when all buffered W are
cache hits - I/O always in the order of instruction stream
(strong-ordering)
8Processor-ordering I.
- Single-processor and WB memory
- R can be carried out speculatively and in any
order - R can pass buffered W, but the CPU is
self-consistent - W to memory are always carried out in program
order, excluding instructions CLFLUSH, MOVNTI,
MOVNTQ, MOVNTDQ, MOVNTPS, MOVNTPD - W can be buffered
- W are not speculative performed only for really
executed (retired) instructions - Data from buffered W can be passed to waiting R
within the CPU - R/W cannot pass I/O, locked or serializing
instructions - R cannot pass LFENCE and MFENCE
- W cannot pass SFENCE and MFENCE
- Multiple CPUs
- Individual CPUs behave as single-processor
- Writes by a single CPU are observed in the same
order by all CPUs - Writes from the individual CPUs on the bus are
NOT ordered with respect to each other
9Processor-ordering II.
10Fast string operation
- Fast string
- Pentium Pro
- MOVS or STOS
- CPU works with cache lines
- Reads are not performed during cache line writes
- Interrupts only on the cache line border
- Conditions
- EDI and ESI aligned to 8B (PIII), EDI aligned to
8B (P4) - Ascending order (DF0)
- Initial counter ECXgt64
- Source and target most not overlap by less then
one cache line (64B for P4, 32B other) - Memory type WC or WB
11Strengthening or weakening memory ordering
- Strengthening
- I/O instructions, locked instructions, LOCK and
serializing instructions - SFENCE (PIII), LFENCE and MFENCE (P4)
- SFENCE all W finished before this instruction
- LFENCE all R finished before this instruction
- MFENCE all R and W finished before this
instruction - PAT (Page Attribute Table) strengthens ordering
for pages (PIII) - Weakening or strengthening
- MTRR (Memory Type Range Registers) weaken or
strengthen ordering for physical memory regions
(Pentium Pro)
12Serializing instructions
- CPU finishes all flags, registers and memory
changes - CPU clears all buffered W
- Pentium
- Privileged instructions
- MOV CRx, MOV DRx, WRMSR, INVD, INVLPG, WBINVD,
LGDT, LIDT, LTR - Non-privileged instructions
- CPUID, IRET, RSM
- Non-privileged for memory ordering
- LFENCE, SFENCE, MFENCE
13Propagation of page table entry changes
- TLB shootdown
- Simple method
- Send IPI to all CPUs
- Stop all CPUs excluding one (spin-lock)
- Active CPU makes the changes (invalidates page
tables in memory) and resumes all CPUs - All CPUs invalidates their TLB (selectively or
all entries) - All CPUs return from IPI
- Complicated and faster methods can be developed
- Different TLB mappings are not used on different
CPUs during the update - The OS must be prepared for a situation where
CPUs use stale mapping during the update
14MPS 1.4
- Multiprocessor Specification
- Controlled booting of multiple CPUs without a
dedicated HW - HW can initiate a boot without a dedicated signal
or a predefined boot CPU - All IA-32 CPUs have the same boot protocol
(including HT) - Different mechanisms for different CPU models (P4
x Xeon older x Xeon newer) - BSP Bootstrap Processor
- AP Application Processor
15Detecting hyper-threading or multi-core
- Hardware Multi-Threading feature flag
- CPUID.1EDX28 1
- Logical processors per Package
- CPUID.1EBX2316
- Cores per Package
- Only when CPUID works with EAX4, otherwise it
has 1 core - CPUID.(EAX4,ECX0)EAX31261
16Hyper-threading I
- One core is able to execute 2 or more instruction
streams - Some parts of a core are private for each logical
processor, some parts are shared among logical
processors
17Hyper-threading II
- Private state of a logical processor
- General purpose registers EAX-ESP (RAX-RSP,
R8-R15) - Segment registers CS-SS
- EFLAGS and EIP (RIP)
- x87 (ST0-ST7), MMX (MM0-MM7), SSE
(XMM0-XMM7/XMM15) and their control and status
registers - Control registers CRx, GDTR, IDTR, LDTR, IA32_EFER
- Debug registers DRx
- Time stamp
- Most of MSRs (including PAT)
- Local APIC
- Instruction TLB
- Shared state
- MTRR
- Data TLB
- Cache, the bus
- Some MSRs
18Multi-Core
19Programming MT-capable CPUs I
- Requires support from OS
- Using PAUSE instruction in spin-lock
- Encoded as REP NOP
- Older IA-32 CPUs interpret PAUSE as NOP
- Older AMD CPUs do NOT understand it
- Using HLT
- Idle logical processor must use HLT and must not
actively wait - Using MONITOR/MWAIT
- SSE3, check CPUID.1.ECX3 1, available only
for CPL0 - MONITOR sets up a memory range monitored for W
- MWAIT places the processor in an optimized state
until a W to the monitored range occurs
20Programming MT-capable CPUs II
- Scheduling
- Dispatch tasks to logical processors 0 for all
cores, then to logical processors 1, etc. - Use thread affinity
- Do not measure the speed of a CPU by an active
loop - One lock or semaphore should be placed aligned
into 128B block of memory
21APIC (Advanced Programmable Interrupt Controller)
- Local APIC
- Internal in CPUs
- Receives interrupts from CPUs interrupt pins,
from internal sources and from an external I/O
APIC - Sends and receives IPI (InterProcessor Interrupt)
- I/O APIC
- Part of a chipset
- Receives external interrupts and relays them to a
local APIC - Possibility of IPI distribution among CPUs
- xAPIC
- Newer architecture
- EXtended APIC
- P4 and Xeons
22APIC xAPIC
- xAPIC system (P4 and Xeon)
23APIC traditional APIC
- APIC system (Pentium and Pentium Pro)
24Local APIC structure
25Internal cache
- Cache structure of P4 and Xeon
26Characteristics of caches
27Cache terminology
- Cache use MESI protocol for maintaining coherency
- Cache line fill
- An operand is read from cacheable memory
- The entire cache line is read
- Cache hit
- An operand is in a cache
- An access uses a value from a cache
- Cache miss
- An operand is not in a cache
- Write hit
- If a valid cache line exists, CPU can write into
the cache - If a write misses a cache, cache line fill occurs
- Snooping
- CPU checks memory accesses on the bus with its
cache lines
28MESI
- Each cache line has 2 status bits
- Transparent for programs
- Instruction L1 has only SI
- Transition by snooping
- CPU detects W to the line with M
- Cancel transaction
- W line directly to the other CPU with branch to
the memory - Moving to the I state
29Cache control
- CR0CD
- 0 caching enabled for the whole of system
memory, can be restricted for regions or pages - 1 caching disabled for Pentium, for other
restricted - CR0NW
- 0 WB enabled, can be restricted
- 1 WB disabled
- PCD and PWT in the page tables and directories
- Disable caching/WB for pages or page directories
- PCD and PWT in the CR3
- Disable caching/WB for page directories
- G in the page tables (Pentium Pro)
- Does not flush TLB entry during implicit flushing
(task switch, mov cr3,eax) - CR4PGE (Pentium Pro)
- Enables G in page tables
- MTRR (Pentium Pro)
- Memory types for regions of physical memory
- PAT (PIII)
- Memory types for pages
30Store buffers
- IA-32 stores temporarily each W to memory in a
store buffer - CPU continues without waiting on the memory or a
cache - Transparent for software
- Draining store buffers
- An interrupt or an exception
- Serializing instruction (Pentium Pro)
- I/O operation
- LOCK operation
- BINIT operation (Pentium Pro) (machine check)
- SFENCE instruction (PIII)
- MFENCE instruction (P4)
31Memory types an overview
- Pentium has UC, WT, WB
- Control using NW, CD
- UC- from PIII with PAT
32Memory types I
- Strong uncacheable (UC)
- The system memory is not cached
- All R/W have strong-ordering, no speculation
- Useful for memory-mapped I/O
- Greatly reduces system performance
- Uncacheable (UC-)
- Like UC, can be overridden to WC using MTRR
- Only PIII using PAT
- Write Combining (WC)
- The system memory is not cached
- No coherency protocol
- Speculative R enabled, W ordering is NOT ensured
- W delayed and combined in WC buffers
- Useful for video frame buffers
33Memory types II
- Write Through (WT)
- R/W from/to the system memory cached
- R comes from a cache on cache hit cache line
fills on cache miss speculative R - W writes to a cache and the main memory on cache
hit does not write to the cache on cache miss - WC enabled
- Useful for video frame buffers or devices without
snooping - Write Back (WB)
- R/W from/to the system memory cached
- R comes from a cache on cache hit cache line
fills on cache miss speculative R - W writes to a cache and the main memory on cache
hit cache line fill on cache miss - Cache coherency protocol
- Write Protected (WP)
- R comes from a cache on cache hit cache line
fills on cache miss speculative R - W directly propagated on the system bus
34MTRR (Memory Type Range Registers)
- Assigning memory types to the physical memory
regions - Checking MTRR presence using CPUID
- MSR R/O registr IA32_MTRRCAP
- Support for fixed ranges
- Number of variable ranges (Pentium Pro)
- Support for WC type
- Default type
- MSR IA32_MTRR_DEF_TYPE defines memory type for
physical memory not covered by fixed and variable
ranges - Fixed ranges
- 8 ranges of 64K size in the lowest 512K
(00000000-0007FFFF) - 16 ranges of 16K size in the next 256K
(00080000-000BFFFF) - 64 ranges of 4K size in the next 256K
(000C0000-000FFFFF) - Variable ranges
- Address PHYSMASKn PHYSBASEn PHYSMASKn
- When a variable range overlaps with a fixed
range, the fixed range wins
35PAT (Page Attribute Table)
- Assigning memory type to the ranges of linear
address space - Checking PAT presence using CPUID
- MSR IA32_CR_PAT defines 8 types
- The type for a page is selected from IA32_CR_PAT
by an index created from PAT(4), PCD(2), PWT(1)
bits in page tables - It is always switched on
- The initial setting after RESET is backward
compatible with PCD and PWT 2 (WB, WT, UC-,
UC)
36Memory types restrictions
- If CR0CD1, then caching is disabled
- If CR0CD0, then caching restricted using PAT
(or PCD and PWT) and MTRR - Always selected the most restrictive type
- WT wins over WB
- WC wins over WT and WB
37Reset
- Sets a CPU to the well known state
- CPU in the real mode
- Internal caches, TLB and BTB invalidated
- CPU model dependent behavior
- Pentium Pro
- All CPUs start initialization protocol, on of
them is chosen as BSP and continues in an OS
initialization, all other APs halt and wait for
an IPI Wait for Startup - i486 and Pentium
- HW knows, which CPU is BSP, other APs halt and
wait on SIPI - INIT
- Like RESET
- Internal caches, MSR, MTRR, x87, SSE do not
change - Move to the real mode
38CPU state after RESET, INIT and power-up
39Microcode update
- Pentium Pro has an interface for uploading
microcode block with patches to the CPU - Microcode block is supplied by Intel directly to
the BIOS vendors - Microcode block has a header with CPU model
specification - Checking CPU model in the microcode header with
current CPU - A microcode must be uploaded before L2 is enabled
and lot of other constraints (e.g. segment limit
exceeding)
40Virtual machine extensions (VMX)
- Two classes of software
- Virtual machine monitor (VMM)
- Acts like a host
- Full control of HW
- Presents abstract HW to guests
- Guest software
- Guest software environment with OS and
applications
41Virtual-machine control data structure (VMCS) I
- VMX non-root operation and VMX transitions
controlled by a VMCS - Access through the VMCS pointer (one per logical
CPU) - Changing the pointer using VMPTRST and VMPTRLD
instructions - VMCS configuration using VMREAD, VMWRITE, VMCLEAR
instructions - VMM could use a different VMCS for each virtual
CPU - Each logical CPU associates a physical memory
region (one 4KB frame) with each VMCS
42Virtual-machine control data structure (VMCS) II
- VMCS state
- Inactive
- after VMCLEAN
- Active
- Memory region after VMPTRLD
- Maintains CPU state
- Current
- VMPTRLD loads current VMCS
- VMLAUNCH, VMPTRST, VMREAD, VMRESUME and VMWRITE
operate with current VMCS
43Virtual-machine control data structure (VMCS)
III
- VMCS data
- Guest-state area
- CPU state is saved on VM exits and loaded from
there on VM entries - Host-state area
- CPU state is loaded on VM exits
- VM-execution control fields
- VM-exit control fields
- VM-entry control fields
- VM-exit information fields
44Guest-state area
- Registers
- CR0, CR3, CR4
- RSP, RIP, RFLAGS
- CS, DS, ES, FS, GS, SS, LDTR, TR
- Selector and part of internal cache
- GDTR, IDTR
- MSRs
- IA32_DEBUGCTL, IA32_SYSENTER_CS,
IA32_SYSENTER_ESP, IA32_SYSENTER_EIP - Activity state
- Active, HLT, shutdown, wait-for-SIPI
- Interruptibility state
- Blocking by STI, MOV SS, NMI, SMI
- Pending debug exceptions
- VMCS link pointer
45Host-state area
- Registers
- CR0, CR3, CR4
- RSP, RIP
- CS, DS, ES, FS, GS, SS, TR
- Base address for FS, GS, TR, GDTR, IDTR
- MSRs
- IA32_SYSENTER_CS, IA32_SYSENTER_ESP,
IA32_SYSENTER_EIP
46VM-execution control fields
- Pin-based VM-execution controls
- VM-exits on external interrupt or NMI
- CPU-based VM-execution controls
- Instructions and events causing VM-exits
- Exception bitmap
- I/O-bitmap addresses
- Guest/host masks and read shadows for CR0 and CR4
- CR3 target controls
- 4 target addressescounter
- CR8 access control
- MSR bitmap address
47VM-exit control fields
- VM-exit controls
- Basic operation of VM-exit
- VM-exit controls for MSRs
- List of MSRs stored and loaded on VM-exit
48VM-entry control fields
- VM-entry controls
- Basic operation on VM-entry
- VM-entry controls for MSRs
- List of MSRs to be loaded on VM-entry
- Event injection
- Executed before the first guest-mode
instruction - Interrupts, exceptions including error-code
49VM-exit information fields
- Basic VM-exit information
- Exit reason, exit qualification
- Vectored events
- Interrupts, exceptions
- VM-exits during event delivery
- VM-exits due to instruction execution
- Instruction address, length, detailed information
50VMXON region
- Physical memory region (4KB frame) for VMX
operation - Operand of VMXON instruction
51Using VMCS
- VMCLEAR should be executed before VM-entry
- VMLAUNCH should be used for the first VM-entry
using VMCS after VMCLEAR - VMRESUME should be used for any subsequent
VM-entry
52VMX non-root operation
- Instructions, which cause VM-exit
- Unconditionally CPUID, INVD, MOV from CR3, all
VMX instructions - Conditionally CLTS, HLT, IN/OUT, INVLPG, LMSW,
MONITOR, MOV CR8, MOV to CR0, MOV to CR3, MOV to
CR4, MOV DR, MWAIT, PAUSE, RDMSR, RDPMC, RDTSC,
RSM, WRMSR - Other causes
- Exceptions, interrupts, INIT signals, start-up
IPI, task switches, system-management interrupts