Title: Optical Microprocessor Chips
1Optical Microprocessor Chips
2Understanding an Optical processor
- Optical surfaces can be connected with strong
bonds using specific types of optical solutions.
Rigidity is 100 times higher than supper glue. - 2. Once the surfaces are glued together, the
surface is treated with water via flow pump in
order to verify the connections. - 3. Then the substrate of placed on the Opaline
hetero photonic crystal and then the layers are
transferred into the layer transfer bin.
3Why Optical connections?
- Optical transmission in and between microchips
has been considered superior to the traditional
electrical transmission technology. - The reason is that the electrical technology can
be slower and has a power cost that rises with
the distance traveled whereas, optical
transmission can be faster and has a fixed power
cost to connect to any point within the system.
4Optical Microprocessors
- Studies are being done on how one can use the
optical signals inside the processor to send the
information around at much higher speed. - Sun Microsystems, who has won 8.1 million dollar
over five years contract from military, believes
that they have the knowhow that necessary in
order to make these type of chips as they are
very sophisticatedly put together using optical
components.
5Optical Microprocessors continued
- The idea is to use electrical chip-to-chip
input/output technology to construct arrays of
low-cost chips into a single virtual macrochip. - This assemblage would perform as one very large
chip and would eliminate the conventional
soldered-chip interconnections. - Long connections across the macrochip would
leverage the low latency, high bandwidth and low
power of silicon-made optics.
6Advantages of Optical Transmission over
Electrical Ones
- Lower material cost
- Lower cost of transmitters and receivers
- Capability to carry electrical power as well as
signals (in specially-designed cables)
7Disadvantages of Optical Transmission over
Electrical Ones
- Optical Fibers are more difficult and expensive
to splice. - At higher optical powers, Optical Fibers and
other optics are susceptible to fiber fuse
wherein a bit too much light meeting with an
imperfection can destroy several meters per
second. - The installation of fiber fuse detection circuit
at the transmitter can break the circuit and halt
the failure to minimize damage.
8Final Notes
- IBM builds optical switch for multi-core chips
- The new nanotech switch, which is 100 times
smaller than a human hair, is designed to enable
researchers to build future chips that will have
greater performance but use less energy. - They are also working on a project to shrink
supercomputers down to the size of a laptop by
replacing the electrical wiring that now connects
multiple cores inside a microprocessor with
pulses of light. - The company said the laptop supercomputers should
be ready by 2020. Calling it a "breakthrough" in
chip design.
9Final Notes
- They also said optical communication between the
cores would dramatically cut a processor's energy
needs and the heat it emits. - The new chips would require the energy needed to
power a light bulb, while today's supercomputers
need enough energy to power hundreds of homes,
the company noted. - http//www.linuxworld.com.au/index.php/id19819404
10 - http//www.photonics.com/Content/ReadArticle.aspx?
ArticleID33505
10Multiprocessors
11Multiprocessor motivation
- Many scientific applications take too long to run
on a single processor - These are parallel applications which consist of
loops which operate on independent data. Need
multiprocessor machine with each loop iteration
running on a different processor, operating on
independent data - Many multi-user environments require more compute
power than available from a single processor
machine (airline reservation system, inventory
system, file server) These consist of largely
parallel transactions which operate on
independent data. - Multiprocessor machines should not be confused
with multi-core processors, although some
functionality is similar.
12Multiprocessor performance
- They assure high throughput for independent tasks
- Or a single program can be running on several
processors (parallel processing) but programming
is more difficult and there is no portability of
code. - The processors can be on a single bus or can be
connected on a LAN (up to 256 processors). - Which is better?
13Multiprocessor examples
- Multiprocessors can be found in low-end PCs such
as dual-processor Xeons or Macs
14Multiprocessor history
15Multiprocessor history
Sun Fire x4150 1U server
16Multiprocessor history
Sun Fire x4150 1U server
4 cores each
16 x 4GB 64GB DRAM
17I/O System Design Example
- Given a Sun Fire x4150 system with
- Workload 64KB disk reads
- Each I/O op requires 200,000 user-code
instructions and 100,000 OS instructions - Each CPU 109 instructions/sec
- FSB 10.6 GB/sec peak
- DRAM DDR2 667MHz 5.336 GB/sec
- PCI-E 8 bus 8 250MB/sec 2GB/sec
- Disks 15,000 rpm, 2.9ms avg. seek time,
112MB/sec transfer rate - What I/O rate can be sustained?
- For random reads, and for sequential reads
18Design Example (cont)
- I/O rate for CPUs
- Per core 109/(100,000 200,000) 3,333
- 8 cores 26,667 ops/sec
- Random reads, I/O rate for disks
- Assume actual seek time is average/4
- Time/op seek latency transfer 2.9ms/4
4ms/2 64KB/(112MB/s) 3.3ms - 303 ops/sec per disk, 2424 ops/sec for 8 disks
- Sequential reads
- 112MB/s / 64KB 1750 ops/sec per disk
- 14,000 ops/sec for 8 disks
19Design Example (cont)
- PCI-E I/O rate
- 2GB/sec / 64KB 31,250 ops/sec
- DRAM I/O rate
- 5.336 GB/sec / 64KB 83,375 ops/sec
- FSB I/O rate
- Assume we can sustain half the peak rate
- 5.3 GB/sec / 64KB 81,540 ops/sec per FSB
- 163,080 ops/sec for 2 FSBs
- Weakest link disks
- 2424 ops/sec random, 14,000 ops/sec sequential
- Other components have ample headroom to
accommodate these rates
20Questions
- How do parallel processors share data?
- single address space - communication through lw
and sw which needs synchronization - uniform memory access - all memory takes the same
time to access (UMA) vs. - Non-uniform memory access (NUMA) which scales to
larger sizes up to 256 processors - private memory - communication through message
passing up to 256 processors - How do parallel processors coordinate?
synchronization (locks, semaphores),built into
send/receive primitives, operating system
protocols - How are they implemented? connected by a single
bus, or connected by a network
21Multiprocessors on a single bus
- Up to 32 processors can share a single bus.
- Each processor has its own cache, but share same
memory space - Each cache stores the same data. This reduces
latency and reduces bus traffic. - They communicate through shared memory and have
UMA - Use a single copy of the OS
- But difficult to scale to large number of
processors.
22Shared memory multi-processors
- Major design issues is cache coherency ensuring
that stores to cached data are seen by other
processors - Coherent reading If a cache misses, another
cache can supply the data, - Coherent writing when one processor writes data
into a shared block, all other copies of that
block located in other caches need to either be
invalidated or updated (depending on protocol). - Synchronization the coordination among
processors accessing shared data - Memory consistency definition of when a
processor must observe a write from another
processor
23Cache coherency problem
- Two write-back caches becoming incoherent
(1) CPU 0 reads block A
(2) CPU 1 reads block A
(3) CPU 0 writes block A
24Snooping
- Cache controllers need to monitor or snoop on
the bus to see if their cache has a copy of the
block being written to by another CPU- a snoop
tag. - Each cache has a duplicate of the address bits
and a second read port on the bus.
25Two Snooping Protocols
- Write-invalidate protocol
- Processor has exclusive data access before the
write operation to a shared block. - Before the write the CPU sends an invalidation
signal to all other caches gt they will miss on
next read - Most common protocol, with reduced bus traffic
which allows more processors on the bus. - Write-update (Write-broadcast)
- Processor continuously sends updated copies of
writes to all other caches - Has the advantage of reduced latency
- Very infrequent - high bandwidth requirements due
to large bus traffic.
26Write-Invalidate Protocol
- Simultaneous writes - the bus arbiter decides
which processor is allowed - First CPU to obtain the bus will invalidate the
line in the cache of the other one - Then the second CPU does the same to the first.
- Write doesnt complete until the bus access is
obtained - How do we locate the data on cache miss?
- In write-through caches memory
- In write-back more tricky, so we will deal with
this in more detail (MESI protocol) - Write with no interleaved activity by other CPUs
very efficient (no bus activity)
27Cache coherence problem revisited
(1) CPU 0 reads block A
(2) CPU 1 reads block A
(3) CPU 0 invalidates
(4) CPU 0 writes block A
28Multiprocessors on a network
- Single-bus multiprocessor architecture has limits
in terms of number of allowed processors due to
limited bus and memory bandwidths. - Solution is to have more than one bus - a network
- Network can connect to memory, which is
physically distributed
29Multiprocessors on a network
- Network can connect above memory (Sun E10,000)
- Shared memory machines connected together over a
network operate as a distributed memory (or DSM
machine)
30Distributed memory
- Distributed memory is the opposite of centralized
memory. - It can have a single address space (called shared
memory), or each processors can have its own
memory address space (called private memory) - In the case of shared memory communication is
done through loads and stores - In the case of private memory communication is
done through message passing (send and receive)
used to access another processors memory
31Shared Memory
- Non-uniform memory access (NUMA) shared memory
multiprocessors - All memory can be addressed by all processors,
but access to a processors own local memory is
faster than access to another processors remote
memory - Looks like a distributed machine, but
interconnection network is usually
custom-designed switches and/or buses - Commodity hardware of a distributed memory
multiprocessor, but all processors have the
illusion of shared memory - Operating system handles accesses to remote
memory transparently on behalf of the
application - Relieves application developer of the burden of
memory management across the network
32Characteristics of multiprocessor computers
Name Number Memory
Communication Topology
of processors size BW/link
.
Cray T3E 2048
524 GB 1200 MB/sec
3-D torus HP/Convex 64
65 GB 980 MB/sec
8-was crossbar SGI Origin 128
131 GB 800 MB/sec
ring SUN Enterprise 64
65 GB 1600 MB/sec 16-way
crossbar
33Cache coherency for single address space
- Since there are multiple busses, snooping will
not work - we need an alternative - use of
directories - The directory keeps the state of every block in
memory, including the sharing status of that
block. - The directory sends over the network explicit
messages to every processor whose cache has that
data. - There are two levels of coherence - at the cache
level the original data is in memory and
replicated in the caches that need it.
34Cache coherency for single address space
- The second level of coherence is at the memory
level - Requires more hardware and the OS takes care of
moving data at the page level.
- Miss penalties are very large, since data needs
to be brought over the network. - However, by moving pages, the miss rate is
reduced (due to co-located data)
35Message Passing
- For machines with private memories (each CPU has
its own memory and cache) - Message passing over a network is used in
clusters (discussed next) - Good for parallel programming techniques
- Using MPI (Message Passing Interface)
- Visible to the programmer
- Example - Sum 100,000 numbers with a
network-connected multiprocessor with 100
processors using multiple private memories - Steps
- Distribute 100 subsets for partial sums
- Do partial sums on each processor
- Split CPUs in half, one side sends, one side
receives and adds
36Clusters
- Connect several (or several hundred)
off-the-shelf computers over network - Strengths - Cheaper, available all the time,
expandable - Can achieve very good performance
- Most of the time good enough
- Since each machine has its own copy of the OS, it
is much easier to isolate in case of failure
- Weaknesses compared to bus multi-processors are
- System administration costs are higher since
there are n independent machines - The bus is slower (I/O bus is slower than
backplane bus) - Smaller memory
- Applications where cost/performance is important
use hybrid clusters of multiprocessors
37Characteristics of clusters vs. multiprocessors
Multiprocessor Number Memory
Communication Topology Name of
processors size BW/link
.
Cray T3E 2048
524 GB 1200 MB/sec
3-D torus HP/Convex 64
65 GB 980 MB/sec
8-was crossbar SGI Origin 128
131 GB 800 MB/sec
ring SUN Enterprise 64
65 GB 1600 MB/sec 16-way
crossbar
Cluster Number Memory
Communication Node type and Name of
processors size BW/link
number
.
Tandem NonStop 4096 1,048
GB 40 MB/sec 16-way SMP, 256 IBM
RS6000 SP2 512 1,048 GB
150 MB/sec 16-way node, 32 IBM RS6000
R40 16 4 GB
12 MB/sec 8-way SMP, 2 SUN Enterprise
60 61 GB
100 MB/sec 30-way SMP, 2 Cluster