Windows - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Windows

Description:

Blocking is expensive spinning or context swaps. Context switches ... Limited spinning usually better than blocking. Interlock operations are least expensive ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 33

Provided by: downloadM

Category:

more less

Transcript and Presenter's Notes

Title: Windows

1
Part2 Windows ??? ? ?? Windows DDK MVP ???
2
Agenda

Introduction
Microprocessor Level
Cache
Instruction Pipe Line
Algorithm and Data Structure
Platform Level
Memory Access
Thread Sync
Windows Loader DLL
Multi Processor
SMP
NUMA
Question and Answer

3
From Microprocessor To Memory

??? ??
?? ?? ?? ????

CPU

L2 Cache
L1 Cache
System Memory
Disk
4
Micro Processor Level
5
Memory Access Cost

Memory Access Costs increase with distance

6
Pentium cache

Cache Options in the x86 family

7
Cache Practice

Cache Optimizing - Locality

A
B
8
Cache Practice

Cache Optimizing Prefetch instruction ( gt
Pentium 3)
Loading the memory into the cache before it is
needed
When bus bandwidth is available
Prefetch instruction works best when loading data
far enough ahead of time (gt 100clock)

9
Cache Practice

Cache Optimizing Alignment
Aligned Data is one of the easiest ways to gain a
performance improvement
Unaligned data can be a major headache for the
processor in two ways
Pack frequently-accessed read data into common
cache lines

10
Instruction Pipe Line

Mispredicted branch
Mispredicted branch loses all the speculated
instructions, so pipeline is restarted
Reduce the number of branches

11
Cache Practice

Algorithm Data Structure Considered Cache
Algorithm
A well chosen algorithm is a lot better than
trying to tune a bad on
Dont use a list where a tree is needed
Dont use a list when an array is sufficient
Data layout
Pack frequently-accessed read data into common
cache lines
Avoid placing hot but independent read and write
data in same cache line

12
Platform Level
13
System Calls Kernel Functions

Consider reducing the number of kernel function
calls
Get infrequently changing system information at
startup and cache
Dont repeatedly query registry
Consider cheaper alternatives to a
general/generic expensive call
TransmitFile instead of ReadFile and socket Send
Some system calls are very expensive such as
File creation and open
Socket creation
WaitForSingleObject, ReleaseSemaphore, SetEvent,
SetThreadPriority
VirtualAlloc and MapViewOfFile
Reuse of Handle

14
Memory
15
Memory

Minimize Virtual Alloc/Free
Reserves 64k, page granularity, Zeroing cost
Application Shouldnt Page out
Use Stack Area
Use Lookaside Lists for frequently-accessed pool
packets of same type and size
Use Non-page pool

16
Memory Back Ground
17
Memory Page Fault Handler
18
Memory

Page table miss can be the worst performance
killers
Prefetch opcodes do nothing to prevent these
The best tactic is not to move your code and data
accesses around much
Split search keys from data and pack closer
together and share cache line

19
Memory Sharing Data

Data Copying
Data copy Disrupts cache locality
Avoid copying if possible
Pass pointers
Just Do Mapping !

20
PE File
21
Sharing DLL
22
DLL Guideline

Rebase to minimize page table pages used
Bind to decrease load time costs
Arrange data (read vs write) to minimize
Copy-On-Write Pages
Statically Initialize globals if possible
Avoid setting global var to value it is already
initialized to
Minimize Number of DLLs
One large DLL is preferable to many small
ones(demand paging, unused page fragments, seek
latencies)
Statically link code that is not shared
Avoid thread callback by using DisableThreadLibrar
yCalls

23
Thread
24
Thread Scheduling

Schedules Threads, not Processes(readylist per
priority, top 16 are RealTime)
IntervalTimer based Quantum charge
Fully preemptive (except anti-starvation)
take 1st from highest priority ready list
IdealCPU guides search
assigned round-robin to threads in process
HardAffinity API overrules IdealCPU hint

25
Thread Status
26
Thread DISPATCHER vs Spinlock
27
Question
28
Thread - Synchronization

Partition hot shared data by CPU or thread or
natural data boundary
Partitioned data-structures may not need locking
Keep hold times low to reduce contention and wait
time
Consider designing for concurrent access
Reader/writer locks(ex, PushLocks) useful when
read is predominant
Blocking is expensive ? spinning or context swaps
Context switches disrupt cache-locality
Limited spinning usually better than blocking
Interlock operations are least expensive
InterlockedIncrement/Decrement
InterlockedExchangeAdd
InterlockedCompareExchange

29
Multi Processor (SMP system)

Dont make processors share cache lines for
heavily-used shared write data
Items that are frequently updated by different
processors should be on separate cache lines
Sharing cache lines for read-only or read-mostly
data is good

30
Non-Uniform Memory Access (NUMA)

Remote memory may be 1.5-3.0 times the latency
of local memory

31
Multi Processor (SMP system)

OS assigns local cache if possible
Improves cache hit rate if thread is activated
again before its data is completely flushed out
of the processor cache

32
Question And Answer-www.andyjung.com-

Write a Comment

User Comments (0)