Title: CS 2200 Lecture 27 Review
1CS 2200 Lecture 27Review
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
2Logistics Grade Breakdown
- Course highlights
- Test 1 and Test 2 2/5/04, 3/25/04
- Final exam TBD, please consult OSCAR
- Grade breakdown
- Homeworks 10
- Projects 30
- Test 1 20
- Test 2 20
- Final Exam 20
- (FYI grade breakdown is essentially the same for
both sections) - (Tests, HWs, projects, etc. will also be the
same) - Course website
- http//www.cc.gatech.edu/classes/AY2004/cs2200_spr
ing
3Logistics Grade Cutoffs
- Grade cutoffs are calculated as follows
- (sigma is the standard deviation)
- A average sigma
- B average (i.e. above average is a B)
- C average - sigma
- D average - 2sigma
4Instruction Set Architecture
Instruction Set Architectures
C
Fortran
Ada
etc.
Basic
Java
Compiler
Compiler
Assembly Language
Byte Code
Assembler
Interpreter
Executable
HW Implementation 1
HW Implementation N
HW Implementation 2
5Pros and cons for each ISA type
6ISAs What about memory addresses?
- Usually instruction sets are byte addressed
- Provide access for bytes (8 bits), half-words (16
bits), words (32 bits), double words (64 bits) - Two different ordering types big/little endian
Little Endian
31
23
15
7
0
Puts byte w/addr. xx00 at least significant
position in the word
Big Endian
31
23
15
7
0
Puts byte w/addr. xx00 at most significant
position in the word
7ISAs Endianess
- No, were not making this up.
- at word address 100 (assume a 4-byte word)
- long a 0x11223344
- big-endian (MSB at word address) layout
- 100 101 102 103
- 100 11 22 33 44
- 0 1 2 3
- little-endian (LSB at word address) layout
- 103 102 101 100
- 11 22 33 44 100
- 3 2 1 0
8ISAs Procedures Assembly Language
- Software conventions (Why?)
- Reserve some number of registers for parameters,
return values, and return address - e.g. LC2200
- 5 for params, 1 for return values, one for return
address - JALR ltproc-addr in reggt, ra ra is
return-addr - JALR ra, zero Where does this go?
- What if we have more params or return values?
- Common use stack/memory
- Registers used in procedures
- Temporary registers
- Caller does not expect value to be preserved upon
return - LC2200 a0 to a4
- Saved registers
- Caller does expect value to be preserved on
return - LC2200 s0 to s3 (simplifies amount of state to
be saved)
9ISAs Example LC2200 Registers
Recall
10ISAs Caller/Callee Mechanics
who does what when?
foo() bar(int a)
int temp 3 bar(42)
... ...
return(temp a)
2. callee at entry
1. caller at call time
4. caller after return
3. callee at exit
11ISAs Caller/Callee Conventions
do most work at callee entry/exit
- Caller at call time
- put arguments in a0..a4
- save any caller-save temporaries
- jalr ..., ra
- Callee at entry
- allocate all stack space
- save ra s0..s3 if necessary
- Callee at exit
- restore ra s0..s3 if used
- deallocate all stack space
- put return value in v0
- Caller after return
- retrieve return value from v0
- restore any caller-save temporaries
most of the work
12ISAs Instruction Formatting
- Human Readable
- add s0, s1, a2
- Machine Readable
- 0 9 5 0000 b16
-
- 0000 1001 0101 0000 0000 0000 0000 1011
13Metrics So, how do we compare?
- Best to stick with execution time! (more later)
- If we say X is faster than Y, we mean the
execution time is lower on X than on Y. - Alternatively
Execution timeY
X is n times faster than Y
n
Execution timeX
1
i.e. 200 MHz
Execution timeY
PerformanceX
PerformanceY
n
1
Execution timeX
Performancey
PerformanceX
i.e. 50 MHz
50/200 ¼, therefore x 4 times slower than y
14Metrics Amdahls Law
- Qualifies performance gain
- Amdahls Law defined
- The performance improvement to be gained from
using some faster mode of execution is limited by
the amount of time the enhancement is actually
used. - Amdahls Law defines speedup
Perf. for entire task using enhancement when
possible
Speedup
Perf. For entire task without using enhancement
Or
Execution time for entire task without enhancement
Speedup
Execution time for entire task using
enhancement when possible
15Metrics Amdahls Law and Speedup
- Speedup tells us how much faster the machine will
run with an enhancement - 2 things to consider
- 1st
- Fraction of the computation time in the original
machine that can use the enhancement - i.e. if a program executes in 30 seconds and 15
seconds of exec. uses enhancement, fraction ½
(always lt 1) - 2nd
- Improvement gained by enhancement (i.e. how much
faster does the program run overall) - i.e. if enhanced task takes 3.5 seconds and
original task took 7, we say the speedup is 2
(always gt 1)
16Metrics Amdahls Law Equations
Fractionenhanced
Execution timenew
Execution timeold x
(1 Fractionenhanced)
Speedupenhanced
1
Execution Timeold
Speedupoverall
Execution Timenew
Fractionenhanced
(1 Fractionenhanced)
Speedupenhanced
Use previous equation, Solve for speedup
Please, please, please, dont just try to
memorize these equations and plug numbers into
them. Its always important to think about the
problem too!
17Metrics Amdahls Law Example
- A certain machine has a
- Floating point multiply that runs too slow
- It adversely affects benchmark performance.
- One option
- Re-design the FP multiply hardware to make it run
15 times faster than it currently does. - However, the manager thinks
- Re-designing all of the FP hardware to make each
FP instruction run 3 times faster is the way to
go. - FP multiplies account for 10 of execution time.
- FP instructions as a whole account for 30 of
execution time. - Which improvement is better?
18Metrics Amdahls Law Example (cont.)
- The speedup gained by improving the multiply
instruction is - 1 / (1-0.1) (0.1/15) 1.10
- The speedup gained by improving all of the
floating point instructions is - 1 / (1-0.3) (.3/3) 1.25
- Believe it or not, the manager is right!
- Improving all of the FP instructions despite the
lesser improvement is the better way to go
19More CPU metrics
- Instruction count also figures into the mix
- Can affect throughput, execution time, etc.
- Interested in
- instruction path length and instruction count
(IC) - Using this information and the total number of
clock cycles for program can determine clock
Cycles Per Instruction (CPI)
20Metrics The Bigger Picture
- Recall
- We can see CPU performance dependent on
- Clock rate, CPI, and instruction count
- CPU time is directly proportional to all 3
- Therefore an x improvement in any one variable
leads to an x improvement in CPU performance - But, everything usually affects everything
Clock Cycle Time
Instruction Count
Hardware Tech.
Compiler Technology
Organization
ISAs
CPI
21Dataflow the single-bus datapath
y a bx cx2
Ex. A2, B4, C6, x2
Part 1 C ? x D ? x
Part 2 D ? x2 C ? 6
Part 3 A ? Cx2
Part 4 C ? 4 D ? x
Part 5 B ? Bx
Part 6 B ? Bx Cx2 (or RegA RegB)
Part 7 A ? A
Part 8 Y ? RegA RegB
22Dataflow A Single BusAny (and all) functional
units can access bus
Functional Unit
Functional Unit
Functional Unit
Functional Unit
(Remember, we need to assert the right control
signals, at the right time, in the right order)
23The beginnings of a generic dataflow
- Abstract / Simplified View
- Two types of signals data and control
- clocking strategy
- All storage elements are clocked by the same
clock edge.
Data
Address
PC
Ra
Instruction
Address
Rb
A
L
U
Instruction Memory
Register File
Rw
Data Memory
Data
24Dataflow Ex. MIPS Instruction Formats
- R-type format (i.e. ADD, SUB, OR, etc.)
- I-type format (i.e. ADDI, LW, SW, BEQ)
- J-type (i.e. JUMP)
(Remember, opcodes/function codes used to
generate control signals)
25Single cycle MIPS dataflow
26A pipelined dataflow
Need to carry control signals too
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
M u x
Data must be stored from one stage to the next in
pipeline registers/latches. hold temporary values
between clocks and needed info. for execution.
Sign Extend
16
32
27Dataflow Execution Sequence Summary (MIPS)
IR ? MemoryPC
PC ? PC 4
A ? RegIR(2521)
B ? RegIR(2016)
ALUOut ? PC SignEx(IR(150) ltlt 2)
3 cycles
3 cycles
4 cycles
5 cycles
28Dataflow FSM (MIPS Machine)
Instruction Fetch
MemRead ALUSrcA 0 IorD 0 IRWrite ALUSrcB
01 ALUOp 00 PCWrite PCSource 00
Instruction decode/ Register fetch
1
0
ALUSrcA 0 ALUSrcB 11 ALUOp 00
start
8
9
Branch Completion
Memory address computation
Jump Completion
2
6
Execution
ALUSrcA 1 ALUSrcB 00 ALUOp
01 PCWriteCond PCSource 01
ALUSrcA 1 ALUSrcB 10 ALUOp 00
ALUSrcA 1 ALUSrcB 00 ALUOp 10
PCWrite PCSource 10
Memory access
5
Memory access
RegDst 1 RegWrite MemToReg 0
MemRead IorD 1
MemRead IorD 1
3
Tells us what values are needed and during what
step
R-type completion
7
RegDst 0 RegWrite MemToReg 1
4
Memory read completion
29Interrupts
Address Bus
Processor
Data Bus
Int
Inta
Device 1
Device 2
If the processor decides to handle the interrupt
it asserts the inta (interrupt acknowledege) line
30Example Device Interrupt(Say, arrival of
network message)
Save registers ? lw r1,20(r0) lw r2,0(r1) addi
r3,r0,5 sw 0(r1),r3 ? Restore registers Clear
current Int RETI
? add r1,r2,r3 subi r4,r1,4 slli
r4,r4,2 Hiccup(!) lw r2,0(r4) lw r3,4(r4) add r2
,r2,r3 sw 8(r4),r2 ?
(callee save)
External Interrupt
Interrupt Handler
code to handle int.
(callee restore)
(reset bit)
(return from interrupt)
31Program Execution w/Protection ( w/IO)
I/O (kernel) space
a loop
user space
PC (mem. addr.)
a system call
kernel space
an interrupt
time
32Pipelining Lessons
- Multiple tasks operating simultaneously
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Also, need time to fill and drain the
pipeline.
6 PM
7
8
9
Time
T a s k O r d e r
33(Pipelining) More on throughput
- All pipe stages are connected so everything must
move from one to another at the same time - How fast this can occur is a function of the time
it takes for the slowest stage to finish - Example If a laundry takes 30 min. to wash but
40 min. to dry, its going to sit in the washer
for 10 min. idle - In a mP, this is the machine cycle time (usually
1 clock) - If a each pipe stage is perfectly balanced time
wise - Time/Instruction Time on unpipelined/ of pipe
stages - Therefore speedup from pipelining of pipe
stages - But of course nothings perfect!
34Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1. W/microcode,
unpipelined CPI pipeline depth
Single-cycle HW would have a slow clock
35The hazards of pipelining
- Pipeline hazards prevent the next instruction
from executing during its designated clock cycle - There are 3 classes of hazards
- Structural Hazards
- Arise from resource conflicts when HW cannot
support all possible combinations of instructions - Data Hazards
- Occur when a given instruction depends on data
from an instruction ahead of it in the pipeline - Control Hazards
- Result from branch type and other instructions
that change the flow of the program (i.e. the PC)
36(Pipelining) Stalls and performance
- Stalls impede the progress of a pipeline and
result in the deviation of 1 instruction
executing each clock cycle - Recall that pipelining can be viewed to
- Decrease the CPI or clock cycle time for an
instruction - Lets see what affect stalls have on CPI
- CPI pipelined
- Ideal CPI Pipeline stall cycles per instruction
- 1 Pipeline stall cycles per instruction
- Ignoring overhead and assuming stages are
balanced
37Process States
New
Terminated
Ready
Running
A longer example using these states later on in
lecture
Waiting
38Process Control Block
Pointer
Process State
Process Number
Program Counter
Another question for the class Why do we need
each one of these elements?
Registers
Scheduling Info
Memory Limits
I/O Status Info
Accounting Info
39Process Scheduling
CPU
Ready Queue
I/O
I/O Request
I/O Queue
Time Slice Expired
Child Executes
Fork a Child
Wait for an Interrupt
Interrupt Occurs
40(Process) Scheduling Algorithms
- First-Come, First-Served
- Shortest-Job-First
- Priority
- Round-Robin
- Multilevel Queue
- Multilevel Feedback Queue
41Average Memory Access Time
AMAT HitTime (1 - h) x MissPenalty
- Hit time basic time of every access.
- Hit rate (h) fraction of access that hit
- Miss penalty extra time to fetch a block from
lower level, including time to replace in CPU
42The Full Memory Hierarchyalways reuse a good
idea
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
faster
CPU Registers 100s Bytes lt10s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache K Bytes 10-100 ns 1-0.1 cents/bit
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 200ns- 500ns .0001-.00001
cents /bit
Memory
OS 4K-16K bytes
Pages
Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10
cents/bit
Disk
-5
-6
user/operator Mbytes
Files
Larger
Tape infinite sec-min 10
Tape
Lower Level
-8
43(Memory) Caches where we put data
Fully Associative
Direct Mapped
Set Associative
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
Cache
Set 0
Set 1
Set 2
Set 3
Block 12 can go anywhere
Block 12 can go only into Block 4 (12 mod 8)
Block 12 can go anywhere in set 0 (12 mod 4)
1 2 3 4 5 6 7 8 9..
Memory
12
44Ex. the Alpha 20164 data and instruction cache
(Memory)
Block Addr.
Block Offset
lt21gt
lt8gt
lt5gt
1
CPU Address
Tag
Index
4
Data in
Data out
Valid lt1gt
Tag lt21gt
Data lt256gt
(256 blocks)
2
...
41 Mux
3
?
Lower level memory
45Ex. Cache Math (Memory)
- First, the address coming into the cache is
divided into two fields - 29-bit block address and a 5-bit block offset
- Block offset further divided into
- An address tag and a cache index
- The cache index selects the tag to be tested to
see if the desired block is in the cache - Size of the index depends on cache size, block
size, and set associativity - So, the index is 8-bits wide and the tag is 29-8
21 bits
46Memory access equations
- Using what we defined on the previous slide, we
can say - Memory stall clock cycles
- Reads x Read miss rate x Read miss penalty
- Writes x Write miss rate x Write miss penalty
- Often, reads and writes are combined/averaged
- Memory stall cycles
- Memory access x Miss rate x Miss penalty
(approximation) - Its also possible to factor in instruction count
to get a complete formula
47A clear explanation of pages vs. segments!
(Virtual Memory)
Segments
Paging
(Much like a cache!)
(Pages no longer the same size)
Virtual Address
Now, 2 fields of the virtual address have
variable length
Page
Offset
- One specifies the segment
- Other offset wi/page
Segment
Offset
Segment
Offset
Offset
Frame
The Problem This length can vary
Can concatenate physical address w/offset as
all pages the same size
Offset takes you to specific word
48Address Translation in a Paging System (Virtual
Memory)
49Translation in a Segmentation System (Virtual
Memory)
This could use all 32 bits cant just
concatenate
50A Memory Hierarchy Flow Chart
TLB access
Virtual Address
Yes
No
TLB Hit?
Yes
No
Try to read from page table
Write?
Try to read from cache
Set in TLB
Page fault?
Cache/buffer memory write
Yes
Yes
No
No
Cache hit?
Replace page from disk
TLB miss stall
Cache miss stall
Deliver data to CPU
51A disk, pictorially (I/O)
- When accessing data we read or write to a sector
- All sectors the same size, outer tracks just less
dense - To read or write, moveable arm with read/write
head moves over each surface - Cylinder all tracks under the arms at a given
point on all surfaces - To read or write
- Disk controller moves arm over proper track a
seek - The time to move is called the seek time
- When sector found, data is transferred
52Disk Terminology (I/O)
Cylinder Track 'x' on all platters/surfaces
53Disk Device Terminology (I/O)
- Several platters, with information recorded
magnetically on both surfaces (usually)
- Bits recorded in tracks, which in turn divided
into sectors (e.g., 512 Bytes)
- Actuator moves head (end of arm,1/surface) over
track (seek), select surface, wait for sector
rotate under head, then read or write - Cylinder all tracks under heads
54Ex. average disk access time (I/O)
- What is the average time to read or write a
512-byte sector for a typical disk? - The average seek time is given to be 9 ms
- The transfer rate is 4 MB per second
- The disk rotates at 7200 RPM
- The controller overhead is 1 ms
- The disk is currently idle before any requests
are made (so there is no queuing delay) - Average disk access time
- average seek time average rotational delay
transfer time controller overhead
55Allocation Strategies (I/O)
- Fixed contiguous regions
- Contiguous regions with overflow areas
- Linked allocation
- File Allocation Table (FAT) MS-DOS
- Indexed Allocation
- Multilevel Indexed Allocation
- Hybrid (BSD Unix)
56Disk Scheduling Algorithms (I/O)
- First come, First served
- Shortest seek time first
- SCAN (elevator algorithm)
- C-SCAN
- Look
- C-Look
Same as SCAN but reverse direction if no more
requests in the scan direction Leads to better
performance than SCAN
57Speedup (Parallel Processing) metric for
performance on latency-sensitive applications
- Time(1) / Time(P) for P processors
- note must use the best sequential algorithm for
Time(1) -- the parallel algorithm may be
different.
linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
58Speedup Challenge (Parallel Processing)
- To get full benefit of parallelism need to be
able to parallelize the entire program! - Amdahls Law
- Timeafter (Timeaffected/Improvement)Timeunaffec
ted - Example We want 100 times speedup with 100
processors - Timeunaffected 0!!!
59Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming
- Memory centralized with uniform access time
(UMA) and bus interconnect, I/O - Examples Dell Workstation 530, Sun Enterprise,
SGI Challenge
- typical
- 1 cycle to local cache
- 20 cycles to remote cache
- 100 cycles to memory
60Shared-Memory Hardware (2)
- Variation memory is not centralized. Called
non-uniform access time (NUMA) - Shared memory accesses are converted into a
messaging protocol (usually by hardware) - Examples DASH/Alewife/FLASH (academic), SGI
Origin, Compaq GS320, Sequent (IBM) NUMA-Q
61Multiprocessor Cache Coherency
- Means that values in cache and memory are
consistent or that we know they are different and
can act accordingly - Considered to be a good thing.
- Becomes more difficult with multiple processors
and multiple caches! - Popular technique Snooping!
- Write-invalidate
- Write-update
62Cache coherence protocols (Multiprocessors)
- Directory Based
- Whether or not some physical memory location is
shared or not is recorded in 1 central location - Called the directory
- Snooping
- Every cache with entries from the centralized
main memory also have that particular blocks
sharing status - No centralized state is kept
- Caches are connected to the shared memory bus
- Whenever there is bus traffic, the cache check
(or snoop) to see if they have the block being
transferred on the bus
63What is a Thread?
- Basic unit of CPU utilization
- A lightweight process (LWP)
- Consists of
- Program Counter
- Register Set
- Stack Space
- Shares with peer threads
- Code
- Data
- OS Resources
- Open files
- Signals
64Threads
Recall from board code, data, files shared No
process context switching
- Can be context switched more easily
- Registers and PC
- Not memory management
- Can run on different processors concurrently in
an SMP - Share CPU in a uniprocessor
- May (Will) require concurrency control
programming like mutex locks.
This is why we talked about critical sections,
etc. 1st
65Classic CS Problem Producer Consumer (Threads)
- Producer
- If (! full)
- Add item to buffer
- empty FALSE
- if(buffer_is_full)
- full TRUE
- Consumer
- If (! empty)
- Remove item from buffer
- full FALSE
- if(buffer_is_empty)
- empty TRUE
66Example Producer Threads Program
- while(forever)
- // produce item
- pthread_mutex_lock(padlock)
- while (full)
- pthread_cond_wait(non_full, padlock)
- // add item to buffer
- buffercount
- if (buffercount BUFFERSIZE)
- full TRUE
- empty FALSE
- pthread_mutex_unlock(padlock)
- pthread_cond_signal(non_empty)
67Example Consumer Threads Program
- while(forever)
- pthread_mutex_lock(padlock)
- while (empty)
- pthread_cond_wait (non_empty, padlock)
- // remove item from buffer
- buffercount--
- full false
- if (buffercount 0)
- empty true
- pthread_mutex_unlock(padlock)
- pthread_cond_signal(non_full)
- // consume_item
68Things to know? (Threads)
- The reason threads are around?
- 2. Benefits of increased concurrency?
- 3. Why do we need software controlled "locks"
(mutexes) of shared data? - 4. How can we avoid potential deadlocks/race
conditions. - 5. What is meant by producer/consumer thread
synchronization/communication using pthreads? - 6. Why use a "while" loop around a
pthread_cond_wait() call? - 7. Why should we minimize lock scope (minimize
the extent of code within a lock/unlock block)? - 8. Do you have any control over thread
scheduling?
69Locks and condition variables (Threads)
- A semaphore really serves two purposes
- Mutual exclusion protect shared data
- Always a binary semaphore
- Synchronization temporally coordinate events
- One thread waits for something, other thread
signals when its available - Idea
- Provide this functionality in two separate
constructs - Locks Provide mutual exclusion
- Condition variables Provide synchronization
- Like semaphores, locks and condition variables
are language-independent, and are available in
many programming enviroments
70Locks and condition variables (Threads)
- Locks
- Provide mutually exclusive access to shared data
- A lock can be locked or unlocked
- (Sometimes called busy or free)
- Can be implemented
- Trivially by binary semaphores
- (create a private lock semaphore, use P and V)
- By lower-level constructs, much like semaphores
are implemented
71Locks and condition variables (Threads)
- Example conventions
- Before accessing shared data, call
LockAcquire() on a specific lock - Complain (via ASSERT) if a thread ties to acquire
a lock it already has - After accessing shared data, call
Lock()Release() on the same lock - Example
- Thread A Thread B
- milk?Acquire() milk?Acquire()
- if(noMilk) if(noMilk)
- buy milk buy milk
- milk?Release() milk?Release()
72Locks and condition variables (Threads)
- Consider the following code
- QueueAdd() QueueRemove()
- lock?Acquire() lock?Acquire()
- add item if item on queue, remove item
- lock?Release() lock?Release()
- return item
-
- QueueRemove will only return an item if theres
already one in the queue
73Locks and condition variables (Threads)
- If the queue is empty, it might be more desirable
for QueueRemove to wait until there is
something to remove - Cant just go to sleep
- If it sleeps while holding the lock, no other
thread can access the shared queue, add an item
to it, and wake up the sleeping thread - Solution
- Condition variables will let a thread sleep
inside a critical section - By releasing the lock while the thread sleeps
74Locks and condition variables (Threads)
- Condition Variables coordinate events
- Example (generic) syntax
- Condition (name) create a new instance of
class Condition (a condition variable) with the
specified name - After creating a new condition, the programmer
must call LockLock() to create a lock that will
be associated with that condition variable - ConditionWait(conditionLock) release the lock
and wait (sleep) when the thread wakes up,
immediately try to re-acquire the lock return
when it has the lock - ConditionSignal(conditionLock) if threads are
waiting on the lock, wake up one of those threads
and put it on the ready list - Otherwise, do nothing
75Locks and condition variables (Threads)
- ConditionBroadcast(conditionLock) if threads
are waiting on the lock, wake up all of those
threads and put them on the ready list otherwise
do nothing - IMPORTANT
- A thread must hold the lock before calling Wait,
Signal, or Broadcast - Can be implemented
- Carefully by higher-level constructs (create and
queue threads, sleep and wake up threads as
appropriate) - Carefully by binary semaphores (create and queue
semaphores as appropriate, use P and V to
synchronize) - Carefully by lower-level constructs, much like
semaphores are implemented
76Locks and condition variables (Threads)
- Associated with a data structure is both a lock
and a condition variable - Before the program performs an operation on the
data structure, it acquires the lock - If it needs to wait until another operation puts
the data structure into an appropriate state, it
uses the condition variable to wait - (see next slide for example)
77Locks and condition variables (Threads)
- Unbounded-buffer producer-consumer
- Lock lk int avail 0
- Condition c
- / producer / / consumer /
- while(1) while(1)
- lk?Acquire() lk?Acquire()
- produce next item if(avail 0)
- avail c?Wait(lk)
- c?Signal(lk) consume next item
- lk?Release() avail--
- lk?Release()
-
78Locks and condition variables (Threads)
- Semaphores and condition variables are pretty
similar perhaps we can build condition
variables out of semaphores - Does this work?
- ConditionWait() ConditionSignal()
- sema?P() sema?V()
-
- NO! Were going to use these condition
operations inside a lock. What happens if we use
semaphores inside a lock?
79Locks and condition variables (Threads)
- How about this?
- ConditionWait() ConditionSignal()
- lock?Release() sema?V()
- sema?P()
- lock?Acquire()
-
- How do semaphores and condition variables differ
with respect to keeping track of history?
80Locks and condition variables (Threads)
- Semaphores have a value, CVs do not!
- On a semaphore signal (a V), the value of the
semaphore is always incremented, even if no one
is waiting - Later on, if a thread does a semaphore wait (a
P), the value of the semaphore is decremented and
the thread continues - On a condition variable signal, if no one is
waiting, the signal has no effect - Later on, if a thread does a condition variable
wait, it waits (it always waits!) - It doesnt matter how many signals have been made
beforehand
81(Networks) Performance parameters
- Bandwidth
- Maximum rate at which interconnection network can
propagate data once a message is in the network - Usually headers, overhead bits included in
calculation - Units are usually in megabits/second, not
megabytes - Sometimes see throughput
- Network bandwidth delivered to an application
- Time of Flight
- Time for 1st bit of message to arrive at receiver
- Includes delays of repeaters/switches length /
m (speed of light) (m determines property of
transmission material) - Transmission Time
- Time required for message to pass through the
network - size of message divided by the bandwidth
82(Networks) More performance parameters
- Transport latency
- Time of flight transmission time
- Time message spends in interconnection network
- But not overhead of pulling out or pushing into
the network - Sender overhead
- Time for mP to inject a message into the
interconnection network including both HW and SW
components - Receiver overhead
- Time for mP to pull a message out of
interconnection network, including both HW and SW
components - So, total latency of a message is
83Metrics graphically
84(Networks) An example
- Consider a network with the following parameters
- Network has a bandwidth of 10 Mbit/sec
- Were assuming no contention for this bandwidth
- Sending overhead 230 mSec, Receiving overhead
270 mSec - We want to send a message of 1000 bytes
- This includes the header
- It will be sent as 1 message (no need to split it
up) - Whats the total latency to send the message to a
machine - 100 m apart (assume no repeater delay for
this) - 1000 km apart (not realistic to assume no
repeater - delay)
85(Networks) An example, continued
- Well use the facts that
- The speed of light is 299,792.5 km/s
- Our m value is 0.5
- This means we have a good fiber optic or coaxial
cable (more later) - Lets use
- If the machines are 100 m apart
- If the machines are 1000 km apart
86(Networks) Some more odds and ends
- Note from the example (with regard to longer
distance) - Time of flight dominates the total latency
component - Repeater delays would factor significantly into
the equation - Message transmission failure rates rise
significantly - Its possible to send other messages with no
responses from previous ones - If you have control of the network
- Can help increase network use by overlapping
overheads and transport latencies - Can simplify the total latency equation to
- Total latency Overhead (Message
size/bandwidth) - Leads to
- Effective bandwidth Message size/Total latency
87(Networks) Switched vs. shared
Node
Node
Node
Shared Media (Ethernet)
Node
Node
Switched Media (ATM)
(A. K. A. data switching interchanges,
multistage interconnection networks, interface
message processors)
Switch
Node
Node
88(Networks) Connection-Based vs. Connectionless
- Telephone operator sets up connection between
the caller and the receiver - Once the connection is established, conversation
can continue for hours - Share transmission lines over long distances by
using switches to multiplex several conversations
on the same lines - Time division multiplexing divide B/W
transmission line into a fixed number of slots,
with each slot assigned to a conversation - Problem lines busy based on number of
conversations, not amount of information sent - Advantage reserved bandwidth
(see board for ex.)
89(Networks) Connection-Based vs. Connectionless
- Connectionless every package of information must
have an address gt packets - Each package is routed to its destination by
looking at its address - Analogy, the postal system (sending a letter)
- also called Statistical multiplexing
- Note Split phase buses are sending packets
90(Networks) Store and Forward vs. Cut-Through
- Store-and-forward policy each switch waits for
the full packet to arrive in switch before
sending to the next switch (good for WAN) - Cut-through routing or worm hole routing switch
examines the header, decides where to send the
message, and then starts forwarding it
immediately - In worm hole routing, when head of message is
blocked, message stays strung out over the
network, potentially blocking other messages
(needs only buffer the piece of the packet that
is sent between switches). - Cut through routing lets the tail continue when
head is blocked, accordioning the whole message
into a single switch. (Requires a buffer large
enough to hold the largest packet).
91(Networks) Broadband vs. Baseband
- A baseband network has a single channel that is
used for communication between stations. Ethernet
specifications which use BASE in the name refer
to baseband networks. - BASE refers to BASE BAND signaling. Only
Ethernet signals are carried on the medium - A broadband network is much like cable
television, where different services communicate
across different frequencies on the same cable. - Broadband communications would allow a Ethernet
network to share the same physical cable as voice
or video services. 10BROAD36 is an example of
broadband networking.
92(Networks) Ethernet
- The various Ethernet specifications include a
maximum distance - What do we do if we want to go further?
- Repeater
- Hardware device used to extend a LAN
- Amplifies all signals on one segment of a LAN and
transmits them to another - Passes on whatever it receives (GIGO)
- Knows nothing of packets, addresses
- Any limit?
93(Networks) Bridges
- We want to improve performance over that provided
by a simple repeater - Add functionality (i.e. more hardware)
- Bridge can detect if a frame is valid and then
(and only then) pass it to next segment - Bridge does not forward interference or other
problems - Computers connected over a bridged LAN don't know
that they are communicating over a bridge
94(Networks) Network Interface Card
- NIC
- Sits on the host station
- Allows a host to connect to a hub or a bridge
- Hub merely extends multiple segments into
single LAN do not help with performance only 1
message can transmit at a time - If connected to a hub, then NIC has to use
half-duplex mode of communication (i.e. it can
only send or receive at a time) - If connected to a bridge, then NIC (if it is
smart) can use either half/full duplex mode - Bridges learn Media Access Control (MAC) address
and the speed of the NIC it is talking to.
95(Networks) Routers
- Routers
- Devices that connect LANs to WANs or WANs to WANs
- Resolve incompatible addresses (generally slower
than bridges) - Divides interconnection networks into smaller
subnets which simplifies manageability and
security - Work much like bridges
- Pay attention to the upper network layer
protocols - (OSI layer 3) rather than physical layer (OSI
layer 1) protocols. - (This will make sense later)
- Will decide whether to forward a packet by
looking at the protocol level addresses (for
instance, TCP/IP addresses) rather than the MAC
address. - (This will make sense later)
96(Protocols) Recall
- A protocol is the set of rules used to describe
all of the hardware and (mostly) software
operations used to send messages from Processor A
to Processor B - Common practice is to attach headers/trailers to
the actual payload forming a packet or frame.
97(Protocols) Layering Advantages
- Layering allows functionally partitioning the
responsibilities (similar to having procedures
for modularity in writing programs) - Alows easily integrating (plug and play) new
modules at a particular layer without any changes
to the other layers - Rigidity is only at the level of the interfaces
between the layers, not in the implementation of
these interfaces - By specifying the interfaces judiciously
inefficiencies can be avoided
98(Protocols) ISO Model Examples
7
User program
Presentation
6
Session
- Sockets open/close/read/write interface
5
Kernel Software
Transport
- TCP reliable infinite-length stream
4
Network
- IP unreliable datagrams anywhere in
world
3
- Ethernet unreliable datagrams on local segment
Data Link
2
Hardware
- 10baseT ethernet spec twisted pair w/RJ45s
Physical
1
99(Protocols) Layering Summary
- Key to protocol families is that communication
occurs logically at the same level of the
protocol, called peer-to-peer, - but is implemented via services at the next lower
level - Encapsulation carry higher level information
within lower level envelope - Fragmentation break packet into multiple smaller
packets and reassemble - Danger is each level increases latency if
implemented as hierarchy (e.g., multiple check
sums)
100Techniques Protocols Use
- Sequencing for Out-of-Order Delivery
- Sequencing to Eliminate Duplicate Packets
- Retransmitting Lost Packets
- Avoiding Replay Caused by Excessive Delay
- Flow Control to Prevent Data Overrun
- Mechanism to Avoid Network Congestion
- Name Resolution (external to protocol really)
101Internetworking (Protocols)
- Different networking solutions exist
- Why? No single networking technology is best for
all needs - Universal service
- System where arbitrary pairs of computers can
communicate - Increases productivity
- Networks, by themselves, are incompatible with
universal service - Solution Internetworking or an internet
Literally Communicating between networks of the
same and/or different types
102Encapsulate universal packetsin (any) local
network frame format
Used to send msg. from 1 network to another (or
wi/the same)but we want a uniform standard.
Frame Header
Frame Data
Used to communicate within 1 network
103Physical Network Connection (Protocols)
Router
Router facilitates communication
between networks
Individual Networks
Each cloud represents arbitrary network
technology LAN, WAN, ethernet, token ring, ATM,
etc.
104Layered Model (Protocols)
TCP/IP Model
Application
5
Transport
4
Internet
3
Network Interface
2
Physical
1
105(Protocols) Layer upon layer upon layer...
- Layer 1 Physical
- Basic network hardware (same as ISO model Layer
1) - Layer 2 Network Interface
- How to organize data into frames and how to
transmit over network (similar to ISO model Layer
2) - Layer 3 Internet
- Specify format of packets sent across the
internet as well as forwarding mechanisms used by
routers - Layer 4 Transport
- Like ISO Layer 4 specifies how to ensure reliable
transfer - Layer 5 Application
- Corresponds to ISO Layers 6 and 7. Each Layer 5
protocol specifies how one application uses an
internet
106IP Addressing
- Each host in the internet must have a unique
address - Users, application programs and software
operating in the higher layers of the protocol
stack use these addresses - In the IP protocol each host is assigned a unique
32 bit address. Any packet destined for a host on
the internet will contain the destination IP
address.
107IP Address Hierarchy
- Addresses are broken into a prefix and a suffix
for routing efficiency - The Prefix is uniquely assigned to an individual
network. - The Suffix is uniquely assigned to a host within
a given network
1
1
2
Network 1
Network 2
3
3
5
108Five Classes of IP Address
Primary Classes
109Computingthe Class (IP)
110Classes and Dotted Decimal (IP)
- Range of Values
- 0 through 127
- 128 through 191
- 192 through 223
- 224 through 239
- 240 through 255
Does this mean there are 64 Class B networks?
Does this mean there are 32 Class C networks?
(on the board)
111Division of the Address Space (IP)
Address Class
Bits in Prefix
Maximum Number of Networks
Bits in Suffix
Maximum Number of Hosts per Network
A B C
7 14 21
128 16384 2097152
24 16 8
16777216 65536 256
(on the board)
112Special IP Addresses
- Network Address
- Directed Broadcast Address
- Limited Broadcast Address
- This Computer Address
- Loopback Address
- Berkeley Broadcast Address Form
113Routers and IP Addressing
- Each host has an address
- Each router has two (or more) addresses!
- Why?
- A router has connections to multiple physical
networks - Each IP address contains a prefix that specifies
a physical network - An IP address does not really identify a specific
computer but rather a connection between a
computer and a network. - A computer with multiple network connections
(e.g. a router) must be assigned an IP address
for each connection
114Example (IP)
Ethernet 131.108.0.0
Token Ring 223.240.129.0
131.108.99.5
223.240.129.2
223.240.129.17
78.0.0.17
WAN 78.0.0.0
Note!
115How to Resolve Addresses (IP)
- Table Lookup
- Store bindings/mapping in table which software
can search - Closed-form Computation
- Protocol addresses are chosen to allow
computation of hardware address from protocol
address using basic boolean and arithmetic
operations - Message Exchange
- Computers exchange messages across a network to
resolve addresses. One computer sends a message
requesting a translation and another computer
replies
(more detail about items 1-3 on earlier slide)
116Use of Default Routes (IP)
Simplified more
Node 1
Node 2
Node 3
Node 4
Dest
Next Hop
Dest
Next Hop
Dest
Next Hop
Dest
Next Hop
1
-
1
(2,1)
3
-
3
(4,3)
(1,2)
2
-
4
(3,4)
4
-
3
(2,3)
(3,2)
(4,2)
4
(2,4)
117Error Reporting (ICMP)
- TCP/IP includes a protocol used by IP to send
messages when problems are detected Internet
Control Message Protocol - IP uses ICMP to signal problems
- ICMP uses IP to send messages
- When IP detects an error (e.g. corrupt packet) it
sends an ICMP packet
118ICMP Message Transport
119Services Provided by TCP
- Connection Orientation
- Point-To-Point Communication
- Complete Reliability
- Full Duplex Communication
- Stream Interface
- Reliable Connection Startup
- Graceful Connection Shutdown
120End to End Services (TCP)
- TCP provides a connection from one application on
a computer to an application on a remote computer - Connection is virtual - provided by software
passing messages - TCP messages are encapsulated in IP Datagrams
- Upon arrival IP passes the TCP message on to the
TCP layer. - TCP exists at both ends of the connection but not
at intermediate points (routers).
121Network Nodes
Application Layer
RPC Layer
Network Layer
Remote Procedure Call
122Remote Procedure Call
- Client can run other processes while waiting for
server to service request - Natural interface... just like procedure call
- Implementation issues?
123client
messages
server
User calls kernel to send RPC message to
procedure X
Remote Procedure Call
kernel sends message to nameserver to find port
number
To server From client Port Nameserver Re
Address for RPC X
Nameserver receives message, looks up answer
kernel places port P in user RPC message
To client From server Port kernel Re RPC
X Port P
Nameserver replies to client with Port P
kernel sends RPC
From client To server Port P ltContentsgt
daemon listening to port P receives message
kernel receives reply, passes it to user
From RPC Port P To client Port kernel ltOutputgt
daemon processes request and processes send output
124RPC
- User
- No different from any program making use of
procedure calls - User stub
- Marshall arguments
- Identify target procedure
- Hand over to RPC runtime to deliver to callee
- Server stub
- Unmarshall the arguments
- Make a normal procedure call in server using the
arguments - User and server code part of distributed app
- Stubs generated automatically
125RPC
RPC runtime
user
user stub
server stub
server
call return
pack args unpack result
unpack args pack result
xmit wait rcv
rcv xmit
call exec return
Caller (Client) Node
Callee (Server) Node
Five pieces of program involved