Title: Chapter 8: Part II
1Chapter 8 Part II
- Storage, Network and Other Peripherals
2Performance Analysis Sync. vs. Async.
- Synchronous bus clock time50ns, each
transaction takes one clock cycle - Asynchronous bus 40 ns per handshake
- Data portion32 bits
- Question Find the bandwidth of each bus when
performing one-word reads from a 200ns memory.
3Sync. vs. Async. Buses (I)
- For the synchronous bus
- Send the address to memory50 ns
- Read the memory 200 ns
- Send the data to the device 50 ns
- Total time 300 ns, bandwidth4bytes/300ns13.3
MB/sec
4Sync. vs. Async. Buses (II)
- For the asynchronous bus
- Step 1 40 ns
- Step 2,3,4 max(3x40, 200ns)200ns
- Step 5,6,7 3x40ns 120ns
- Total time 360 ns, maximum bandwidth
4bytes/360ns 11.1 MB/s
5Increasing Bus Bandwidth
- Data bus width
- Separate versus multiplexed address and data
lines - Block transfers
6Performance Analysis of Two Bus Schemes
- Given a system with
- a memory and bus system supporting block access
of 4 to 16 words - a 64-bit synchronous bus clocked at 200MHz, with
each 64-bit transfer taking 1 clock cycle, and 1
clock cycle to send an address to memory - two clock cycles needed between each bus
operation - memory access for first 4 words takes 200ns, each
additional set of 4 words requires 20ns
7Question
- Find the sustained bandwidth and latency for a
read of 256 words for transfers using 4-word
blocks and 16-word blocks. - Find the effective number of bus transactions for
each case.
84-Word Block Transfer
- 1 clock cycle to send address to memory
- 200ns/(5ns/cycle) 40 cycles to read memory
- 2 cycles to send data from memory
- 2 idle cycles
- Total 45 cycles
- 256 words requires 45x64 2880 cycles
94-Word Block Transfer
- Latency 2880 cycles x 5ns/cycle 14400 ns
- Number of bus transactions 64 x 1s/14400ns
4.44M transactions/s - Bandwidth (256x4 bytes)x 1/14400ns 71.11 MB/s
1016-Word Block Transfer
- 1 clock cycle to send address to memory
- 40 cycles to read first 4 words from memory
- 2 cycles to send data, during which the read of
the next 4 words is started. - 2 idle cycles between transfers, during which the
read of the next block is completed. - Need to repeat the last two steps 3 times to read
a total of 16 words.
1116-Word Block Transfer
- Total cycles required 1 40 4x(22) 57
cycles - 256/1616 transactions are required
- Total number of cycles required for 256 word
16x57 912 cycles, latency 4560 ns - Number of bus transactions 16 x 1s/4560ns
3.51M transactions/s - Bandwidth (256x4 bytes)x 1/4560ns 224.56 MB/
12Bus Arbitration
- Daisy chain arbitration (not very fair)
- Centralized arbitration (requires an arbiter),
e.g., PCI - Self selection, e.g., NuBus used in Macintosh
- Collision detection, e.g., Ethernet
13Bus Standards
- PCI ( a general purpose backplane bus)
- SCSI (Small Computer System Interface)
- IEEE 1394 (Firewire)
- USB 2.0
Characteristic Firewire(1394) USB 2.0
Bus width 4 2
Clocking asynchronous asynchronous
Peak bandwidth 50MB/s (Firewire 400) 100MB/s (Firewire 800) 0.2 MB/s 1.5 MB/s 60 MB/s
Hot pluggable Yes Yes
Max of devices 63 127
Max. Bus length 4.5M 5M
14Interfacing I/O Devices
- How is a user I/O request transformed into a
device command and communicated to the device? - How is data actually transferred to or from a
memory location? - What is the role of the operating system?
15Role of the OS
- The OS plays a major role in handling I/O, in
that - I/O system is shared by multiple programs using
the processor - I/O system often use interrupts (cause transfer
to supervisor mode) - low-level control of I/O is complex
16Communications between OS and I/O Devices
- The OS must be able to give commands to I/O.
- The I/O must be able to notify the OS when
operation is completed or error has occurred. - Data must be transferred between memory and an
I/O device.
17Giving Commands to I/O
- To give a command, the processor must be able to
address the device and to supply command words - memory-mapped I/O portions of the address space
is assigned to I/O devices - special I/O dedicated I/O instructions in the
processor.
18Communicating with the Processor
19Polling
- Polling processor periodically checks the status
of I/O. - Overhead of polling in an I/O system
- Example 1 mouse
- Example 2 floppy disk
- Example 3 hard disk
20Mouse
- Assume the number of clock cycles for a polling
operation, including transferring to the polling
routine, accessing the device, and restarting the
user program, is 400, with a 500 MHz clock. - The mouse must be polled 30 times a second to
ensure that no user movement is missed. - Fraction of CPU time 30x400/(500x106) 0.002
21Floppy Disk
- The floppy disk transfers data to the processor
in 16-bit units and has a data rate of 50KB/s. - Polling rate (50KB/s)/(2 Bytes/polling) 25K
polling/sec - Fraction of CPU time 25Kx400/(500x106) 2
22Hard Disk
- Transfer in 4-word blocks
- transfer rate 4MB/s
- Polling rate (4MB/s)/(4x4 Bytes/polling) 250K
polling/sec - Fraction of CPU time 250Kx400/(500x106) 20
23Overhead of Polling
- Can do the polling only when the device is
active, thus reducing the overhead. - However, the overhead is still significant,
resulting in another design called
interrupt-driven I/O.
24Overhead of Interrupt-Driven I/O
- Assume the overhead for each transfer, including
the interrupt, is 500 cycles. - Cycles per second for disk 250Kx500 125x106
cycles - Fraction of processor consumed
125x106/(500x106) 25 - Assuming disk is transferring data 5 of the
time, fraction of CPU on average 25x51.25
25Direct Memory Access(DMA)
- If disk is transferring data most of the time,
the overhead for interrupt-driven I/O is still
high. - For high-bandwidth device, let the device
controller transfer data directly to or from the
memory without involving the processor, known as
direct memory access. - Interrupt is used to signal the completion of I/O
transfer or error. - Note How does it affect cache design?
26Overhead of I/O Using DMA
- Assume initial setup of DMA transfer takes 1000
cycles, handling of interrupt at DMA completion
takes 500 cycles, average transfer from disk is
8KB - Each DMA transfer takes 8KB/(4MB/s) 2x10-3s
- If the disk is constantly transferring data, it
requires (1000500)/(2x10-3) 750x103 cycles - Fraction of CPU time 750x103/(500x106) 0.15
27I/O System Design
- Latency constraints ensuring the latency to
complete and I/O operation is bounded. - Bandwidth constraints
- Performance Analysis techniques queuing
theory simulation analysis
28I/O System Design- Example
- CPU 3 BIPS, average 100,000 instructions in the
OS per I/O operation - backplane bus transfer rate 1000 MB/s
- SCSI-Ultra 320 controller with transfer rate
320 MB/s, accommodating up to 7 disks - Disk bandwidth 75MB/s, seekrotational
latency6 ms - Workload 64-KB reads, user program need 200,000
instructions per I/O
29Example
- Find
- the maximum sustainable I/O rate
- the number of disks and SCSI controller required.
30Real Stuff Buses and Network of P4
31Intel P4 I/O Chip Sets
32A Digital Camera
33SoC (System on a chip)