Windows - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Windows

Description:

Blocking is expensive spinning or context swaps. Context switches ... Limited spinning usually better than blocking. Interlock operations are least expensive ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 33
Provided by: downloadM
Category:
Tags: spinning | windows

less

Transcript and Presenter's Notes

Title: Windows


1
Part2 Windows ??? ? ?? Windows DDK MVP ???
2
Agenda
  • Introduction
  • Microprocessor Level
  • Cache
  • Instruction Pipe Line
  • Algorithm and Data Structure
  • Platform Level
  • Memory Access
  • Thread Sync
  • Windows Loader DLL
  • Multi Processor
  • SMP
  • NUMA
  • Question and Answer

3
From Microprocessor To Memory
  • ??? ??
  • ?? ?? ?? ????

CPU

L2 Cache
L1 Cache
System Memory
Disk
4
Micro Processor Level
5
Memory Access Cost
  • Memory Access Costs increase with distance

6
Pentium cache
  • Cache Options in the x86 family

7
Cache Practice
  • Cache Optimizing - Locality

A
B
8
Cache Practice
  • Cache Optimizing Prefetch instruction ( gt
    Pentium 3)
  • Loading the memory into the cache before it is
    needed
  • When bus bandwidth is available
  • Prefetch instruction works best when loading data
    far enough ahead of time (gt 100clock)

9
Cache Practice
  • Cache Optimizing Alignment
  • Aligned Data is one of the easiest ways to gain a
    performance improvement
  • Unaligned data can be a major headache for the
    processor in two ways
  • Pack frequently-accessed read data into common
    cache lines

10
Instruction Pipe Line
  • Mispredicted branch
  • Mispredicted branch loses all the speculated
    instructions, so pipeline is restarted
  • Reduce the number of branches

11
Cache Practice
  • Algorithm Data Structure Considered Cache
  • Algorithm
  • A well chosen algorithm is a lot better than
    trying to tune a bad on
  • Dont use a list where a tree is needed
  • Dont use a list when an array is sufficient
  • Data layout
  • Pack frequently-accessed read data into common
    cache lines
  • Avoid placing hot but independent read and write
    data in same cache line

12
Platform Level
13
System Calls Kernel Functions
  • Consider reducing the number of kernel function
    calls
  • Get infrequently changing system information at
    startup and cache
  • Dont repeatedly query registry
  • Consider cheaper alternatives to a
    general/generic expensive call
  • TransmitFile instead of ReadFile and socket Send
  • Some system calls are very expensive such as
  • File creation and open
  • Socket creation
  • WaitForSingleObject, ReleaseSemaphore, SetEvent,
    SetThreadPriority
  • VirtualAlloc and MapViewOfFile
  • Reuse of Handle

14
Memory
15
Memory
  • Minimize Virtual Alloc/Free
  • Reserves 64k, page granularity, Zeroing cost
  • Application Shouldnt Page out
  • Use Stack Area
  • Use Lookaside Lists for frequently-accessed pool
    packets of same type and size
  • Use Non-page pool

16
Memory Back Ground
17
Memory Page Fault Handler
18
Memory
  • Page table miss can be the worst performance
    killers
  • Prefetch opcodes do nothing to prevent these
  • The best tactic is not to move your code and data
    accesses around much
  • Split search keys from data and pack closer
    together and share cache line

19
Memory Sharing Data
  • Data Copying
  • Data copy Disrupts cache locality
  • Avoid copying if possible
  • Pass pointers
  • Just Do Mapping !

20
PE File
21
Sharing DLL
22
DLL Guideline
  • Rebase to minimize page table pages used
  • Bind to decrease load time costs
  • Arrange data (read vs write) to minimize
    Copy-On-Write Pages
  • Statically Initialize globals if possible
  • Avoid setting global var to value it is already
    initialized to
  • Minimize Number of DLLs
  • One large DLL is preferable to many small
    ones(demand paging, unused page fragments, seek
    latencies)
  • Statically link code that is not shared
  • Avoid thread callback by using DisableThreadLibrar
    yCalls

23
Thread
24
Thread Scheduling
  • Schedules Threads, not Processes(readylist per
    priority, top 16 are RealTime)
  • IntervalTimer based Quantum charge
  • Fully preemptive (except anti-starvation)
  • take 1st from highest priority ready list
  • IdealCPU guides search
  • assigned round-robin to threads in process
  • HardAffinity API overrules IdealCPU hint

25
Thread Status
26
Thread DISPATCHER vs Spinlock
27
Question
28
Thread - Synchronization
  • Partition hot shared data by CPU or thread or
    natural data boundary
  • Partitioned data-structures may not need locking
  • Keep hold times low to reduce contention and wait
    time
  • Consider designing for concurrent access
  • Reader/writer locks(ex, PushLocks) useful when
    read is predominant
  • Blocking is expensive ? spinning or context swaps
  • Context switches disrupt cache-locality
  • Limited spinning usually better than blocking
  • Interlock operations are least expensive
  • InterlockedIncrement/Decrement
  • InterlockedExchangeAdd
  • InterlockedCompareExchange

29
Multi Processor (SMP system)
  • Dont make processors share cache lines for
    heavily-used shared write data
  • Items that are frequently updated by different
    processors should be on separate cache lines
  • Sharing cache lines for read-only or read-mostly
    data is good

30
Non-Uniform Memory Access (NUMA)
  • Remote memory may be 1.5-3.0 times the latency
    of local memory

31
Multi Processor (SMP system)
  • OS assigns local cache if possible
  • Improves cache hit rate if thread is activated
    again before its data is completely flushed out
    of the processor cache

32
Question And Answer-www.andyjung.com-
Write a Comment
User Comments (0)
About PowerShow.com