Group Talk - PowerPoint PPT Presentation

About This Presentation

Title:

Group Talk

Description:

... Fusion/Quattro/Mach. Gillette Fusion, AMD Fusion, Ford Fusion ... Schick Quattro, NVIDIA Quadro, Audi Quattro. Gillette Mach, ATI Mach, Ford Mustang Mach 1 ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 57

Provided by: brej

Learn more at: https://brej.org

Category:

more less

Transcript and Presenter's Notes

Title: Group Talk

1
Group Talk

Charlie Brej
APT Group
University of Manchester

2
Part 1The Future According to Me

Charlie Brej
APT Group
University of Manchester

3
Razor Blades
1998
Scheme 1 Name Number Plus/Extreme/Ultra/Turbo
/?X Trac II Plus Core Quad Extreme Athlon 64
FX GeForce 8800 Ultra
1971
1901
Scheme 2 Company Name Fusion/Quattro/Mach Gille
tte Fusion, AMD Fusion, Ford Fusion Schick
Quattro, NVIDIA Quadro, Audi Quattro Gillette
Mach, ATI Mach, Ford Mustang Mach 1 Maybe more
soon
2005
2004
4
Razor Blade History
5
Prediction2007 Jan-Sept
15 Blade Apple iShave
6
Why did this not happen?

Because you dont need more than five blades on
your razor
Unless we grow larger faces
Which hasnt happened before, so we wont need
them for some time
We dont need more than four processors
Unless we invent an automagic parallelism
extractor
Which we havent since the 60s, so we wont need
them for some time
People will still demand faster single thread
performance

7
Real Future

Moores law will continue
Transistor count doubles every 18 months
Moving into 3rd dimension
Intelligent transistors placed per person will
remain constant
Not copy-paste
Verification becomes problematic
Designs become very complicated

8
Productivity
Managers 40
Grunt Coder 80
Can we make it pink?
Sales 0
Hero Coder 100
Maintainers 60
Marketting -20
Admin 20
How about Intel Terrano
9
Brejs Law

Person years per design doubles every 18 months
Most transistors are copy-paste
Verification becomes much more complex
Hero coders become more rare
People get stupider
Marketing becomes more important

10
Brejs Law

1985 5 person years
ARM
1997 2560 person years
Pentium II (about right)
2007 81920 person years
Intel has 94,000 employees
AMD has 16,000
A new design every 7 years

11
Brejs Law

2028 Entire population of the USA are employed
by Intel
2031 Entire population of China employed by AMD
2034 Entire world population working on creating
Pentium 12
2090 Project to build Pentium 15 starts but hits
a snag as universe finishes before the project
does

The most powerful force in the universe is
compound interest
Albert Einstein
And we didn't have any fancy Sony Playstation
video games
We had the Atari 2600!
There were no multiple levels or screens.
It was just ONE screen, forever, and you could
never win.
The game just kept getting harder and faster and
until you died.
Just like LIFE!
Ernest Cline

13
Back to the Future

Transistors will be free
Mostly consumed in memory
Diminishing returns
Single thread grinds to a halt
Increase performance by 1 get 100 more money
Fewer designs
Very expensive and long lead up times
Extend rather than redesign

14
Part 2Wagging Logic Non Throughput-Bound
Design Methodology

Charlie Brej
APT Group
University of Manchester

15
Introduction

Async performance
Asynchronous logic is slow
Wagging Logic
Example circuits
Red Star
Design
Results
Conclusions

16
Data propagation
Logic
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
17
Control propagation
Logic
C
C
C
C
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
18
Control propagation
Logic
C
C
C
C
C
C
C
C
C
C
C
C
Latency Cycle Time
1
2
3
4
5
6
7
8
9
10
11
12
0
19
And then it gets worse

Latency is at least six times lower than the
cycle time
Assumes all data arrives at arrive at the same
time
Assumes all acknowledgements arrive at the same
time
Actual number is somewhere between 10 and 100

20
What can we do

Use two-phase signalling
Halve the control delay
Loose all average case advantages
Fine grain pipelining
Need to add 10 latches per stage
Adds latency
Faster completion
Anti-tokens, Early-drop latches
Careful timing analysis

21
Wagging Latches

Alternate latch read/write
Capacity of two latches
Depth of one latch

22
Wagging Logic

Apply same method to the logic
Rotate logic allowing one to set while others
reset

Set
Reset
Reset
23
Single Channel Mixer
24
LCM Channels Mixer
25
Direct Connection Mixer
26
32bit Incrementer Example
Reg
1
Slice 0
Reg
1
Slice 1
HB
1
Slice 2
HB
1
27
32bit Incrementer
Optimal Design 3288 Operations 3.04 GDs per
operation
Original Design 77 Operations 130 GDs per
operation
28
32bit Accumulator Example

Load or Accumulate

29
32bit Accumulator Example
Load
Accumulate
Accumulate
Load
Accumulate
Load
30
32bit Accumulator Example
31
Transistors are Free

What is expensive?
Design effort
Time to market
Yield
What we want
Simple
Copy-Paste
Redundancy

32
Redundancy
Slice
Slice
Slice
Slice
Slice
Slice
33
Arrangement
Slice 0
Slice 0
Slice 0
Slice 2
Slice 1
Slice 5
Slice 3
Slice 1
Slice 2
Slice 1
Slice 3
Slice 4
Slice 4
Slice 2
Slice 5
Slice 3
34
Teaching Monkeys

Dynamic extraction of parallelism
Implicit data dependency tracking
No locking
No polling
No handshakes
Average case performance

35
Red Star

MIPS ISA
32bit RISC
Fast and simple development
Use synchronous design methodology
Complicated features without complicated design
effort
OOO execution, banked caching

36
Red Star
37
Register Bank
38
ADD R1, R1, 1
1401 Operations 7.14 GDs per operation
39
Branch Logic
PC
1

Additional unnecessary stages to extend the
branch shadow
40
Overlapping Instructions
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Branch Shadow
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
Fetch
Decode
Execute
Memory
WriteBack
Dummy
41
Nine Instruction Loop
42
Caching 4 Instruction Loop
RAM
Slice 0
Cache
Slice 1
Cache
0
1
2
3
3
2
1
0
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Instruction 3Branch
0
Slice 3
Cache
43
Caching 3 Instruction Loop
RAM
Slice 0
Cache
Slice 1
Cache
2
3
2
2
0
0
0
0
1
1
1
1
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Branch 0
Slice 3
Cache
44
Caching Delayed Branch
RAM
Slice 0
Cache
If (PCWagLevel ! Slice) Execute a NOP Dont
increment the PC
Slice 1
Cache
2
3
0
1
2
0
1
4
5
6
7
Slice 2
Cache
0Instruction 1Instruction 2Branch 0
Slice 3
Cache
NOP
45
Caching