greg astfalk - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

greg astfalk

Description:

this is not a talk about hewlett-packard's product offering(s) ... yes, i am being a bit facetious but the idea remains true. parallelism methodologies ... – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 38
Provided by: johnst99
Category:

less

Transcript and Presenter's Notes

Title: greg astfalk


1
(No Transcript)
2
high-end computing technology where is it
heading?
greg astfalk woon yung chung woon-yung_chung_at_hp.c
om
3
prologue
  • this is not a talk about hewlett-packards
    product offering(s)
  • the context is hpc (high performance computing)
  • somewhat biased to scientific computing
  • also applies to commercial computing

4
backdrop
  • end-users of hpc systems have needs and wants
    from hpc systems
  • the computer industry delivers the hpc systems
  • there exists a gap between the two wrt
  • programming
  • processors
  • architectures
  • interconnects/storage
  • in this talk we (weakly) quantify the gaps in
    these 4 areas

5
end-users programming wants
  • end-users of hpc machines would ideally like to
    think and code sequentially
  • have a compiler and run-time system that produces
    portable and (nearly) optimal parallel code
  • regardless of processor count
  • regardless of architecture type
  • yes, i am being a bit facetious but the idea
    remains true

6
parallelism methodologies
  • there exists 5 methodologies to achieve
    parallelism
  • automatic parallelization via compilers
  • explicit threading
  • pthreads
  • message-passing
  • mpi
  • pragma/directive
  • openmp
  • explicitly parallel languages
  • upc, et al.

7
parallel programming
  • parallel programming is a cerebral effort
  • if lots of neurons plus mpi constitutes
    prime-time then parallel programming has
    arrived
  • no major technologies on the horizon to change
    this status quo

8
discontinuities
  • the ease of parallel programming has not
    progressed at the same rate that parallel systems
    have become available
  • performance gains require compiler optimization
    or pbo
  • most parallelism requires hand-coding
  • in the real-world many users dont use any
    compiler optimizations

9
parallel efficiency
  • mindful that the bounds on parallel efficiency
    are, in general, far apart
  • 50 efficiency on 32 processors is good
  • 10 efficiency on ?(100) processors is excellent
  • gt2 efficiency on ?(1000) processors is heroic
  • a little communication can knee over the
    efficiency vs. processor count curve

10
apps with sufficient parallelism
  • few existing applications can utilize ?(1000), or
    even ?(100), processors with any reasonable
    degree of efficiency
  • to date have generally required heroic effort
  • new algorithms (i.e., data and control
    decompositions) or nearly complete are necessary
  • such large-scale parallelism will have arrived
    when msc/nastran and oracle exist on such systems
    and utilize the processors

11
latency tolerant algorithms
  • latency tolerance will be a increasingly
    important theme for the future
  • hardware will not solve this problem
  • more on this point later
  • developing algorithms that have significant
    latency tolerance will be necessary
  • this means thinking outside the box about the
    algorithms
  • simple modifications to existing algorithms
    generally wont suffice

12
operating systems
  • development environments will move to nt
  • heavy-lifting will remain with unix
  • four unixs to survive (alphabetically)
  • hp-ux
  • linux
  • aix 5l
  • solaris
  • linux will be important at the lower-end but will
    not significantly encroach on the high-end

13
end-users proc/arch wants
  • all things being equal high-end users would
    likely want a classic cray vector supercomputer
  • no caches
  • multiple pipes to memory
  • single word access
  • hardware support for gather/scatter
  • etc.
  • it is true however that for some applications
    contemporary risc processors perform better

14
processors
  • the processor of choice is now, and will be,
    for some time to come the risc processor
  • risc processors have caches
  • caches are good
  • caches are bad
  • if your code fits in cache, you arent
    supercomputing! ?

15
risc processor performance
  • a rule of thumb is that a risc processor, any
    risc processor, gets on average, on a sustained
    basis,
  • 10 of its peak performance
  • the 3? on this is large
  • achieved performance varies with
  • architecture
  • application
  • algorithm
  • coding
  • dataset size
  • anything else you can think of

16
semiconductor processes
  • semiconductor processes change every 2-3 years
  • assuming that technology scaling applies to
    subsequent generations then per generation
  • frequency increase of 40
  • transistor density increase of 100
  • energy per transition decrease of 60

17
semiconductor processes
   
   
18
what to do with gates
  • it is not a simple question of what the best use
    of the gates is
  • larger caches
  • multiple cores
  • specialized functional units
  • etc.
  • the impact of soft errors with decreasing design
    rule size will be a important topic
  • what happens if a alpha particles flips a bit in
    a register?

19
processor futures
  • you can expect, for the short term, moores law
    like gains in processors peak performance
  • doubling of performance every 18-24 months
  • does not necessarily apply to application
    performance
  • moores law will not last forever
  • 4-5 more turns (maybe?)

20
processor evolution
next generation
performance
IA-64
EPIC

Superscalar risc

2 instructions/cycle

RISC
1 micron - gt .5 micron --gt .35 micron --gt .25
micron --gt .18 micron --gt .13 micron
lt
1 instruction/cycle
CISC
20-30 increase per year due to advances in
underlying semiconductor technology
.3 ins/cycle
time
hp confidential
European analysts briefing, london. September 5,
2000
21
customer spending (m)
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
idc, february 2000
  • technology disruptions
  • risc crossed over cisc in 1996
  • itanium will cross over risc in 2004

22
present high-end architectures
  • todays high-end architecture is either
  • smp
  • ccnuma
  • cluster of smp nodes
  • cluster of ccnuma nodes
  • japanese vector system
  • all of these architectures work
  • efficiency varies with application type

23
architectural issues
  • of the choices available the smp is preferred,
    however
  • smp processor count is limited
  • cost of scalability is prohibitive
  • ccnuma addresses these limitations but induces
    its own
  • disparate latencies
  • better, but still limited, scalability
  • ras limitations
  • clusters too have pros and cons
  • huge latencies
  • low cost
  • etc.

24
physics
  • limitations imposed by physics have led us to
    architectures that have a deep memory hierarchy
  • the algorithmist and programmer must deal with,
    and exploit, the hierarchy to achieve good
    performance
  • this is part of the cerebral effort of parallel
    programming we mentioned earlier

25
memory hierarchy
  • typical latencies for todays technology

26
balanced system ratios
  • a ideal high-end system should be balanced wrt
    its performance metrics
  • for each peak flop/second
  • 0.51 byte of physical memory
  • 10100 byte of disk capacity
  • 416 byte/sec of cache bandwidth
  • 13 byte/sec of memory bandwidth
  • 0.11 bit/sec of interconnect bandwidth
  • 0.020.2 byte/sec of disk bandwidth

27
balanced system
  • applying the balanced system ratios to a unnamed
    contemporary 16 processor smp

28
storage
  • data volumes are growing at a extremely rapid
    pace
  • disk capacity sold doubled from 1997 to 1998
  • storage is a increasingly large percent of the
    total server sale
  • disk technology is advancing too slowly
  • per generation, of 1-1.5 years
  • access time decreases 10
  • spindle bandwidth increases 30
  • capacity increases 50

29
networks
  • only the standards will be widely deployed
  • gigabit ethernet
  • gigabyte ethernet
  • fibre channel (2x and 10x later)
  • sio
  • atm
  • dwdm backbones
  • the last mile problem remains with us
  • inter-system interconnect for clustering will not
    keep pace with the demands (for latency and
    bandwidth)

30
vendors constraints
  • rule 1 be profitable to return value to the
    shareholders
  • you dont control the market size
  • you can only spend 10 of your revenue on rd
  • dont fab your own silicon (hopefully)
  • you must be more than just a technical
    computing company
  • to not do this is to fail to meet rule 1 (see
    above)

31
market sizes
  • according to the industry analysts the technical
    market is, depending on where you draw the
    cut-line, 4-5 billion annually
  • the bulk of the market is small-ish systems (data
    from forest baskett at sgi)

32
a perspective
  • commercial computing is not a enemy
  • without the commercial markets revenue our
    ability to build hpc-like systems would be
    limited
  • the commercial market benefits from the
    technology innovation in the hpc market
  • is performance left on the table in designing a
    system to serve both the commercial and technical
    markets
  • yes

33
why?
  • lack of a cold war
  • performance of hpc systems has been marginalized
  • in the mid-70s how many applications ran faster
    on a vax 11/780 than the cray-1
  • none
  • how many applications today run faster on a
    pentium than the cray t90?
  • some
  • current demand for hpc systems is elastic

34
future prognostication
  • computing in the future will be all about data
    and moving data
  • the growth in data volumes is incredible
  • richer media types (i.e., video) means more data
  • distributed collaborations imply moving data
  • e-whatever requires large, rapid data movement
  • more flops ? more data

35
data movement
  • the scope of data movement encompasses
  • register to functional unit
  • cache to register
  • cache to cache
  • memory to cache
  • disk to memory
  • tape to disk
  • system to system
  • pda to client to server
  • continent to continent
  • all of these are going to be important

36
epilogue
  • for hpc in the future
  • it is going to be risc processors
  • smp and ccnuma architectures
  • smp processor count relatively constant
  • technology trends are reasonably predictable
  • mpi, pthreads and openmp for parallelism
  • latency management will be crucial
  • it will be all about data

37
epilogue (contd)
  • for the computer industry in the future
  • trending toward e-everything
  • e-commerce
  • apps-on-tap
  • brokered services
  • remote data
  • virtual data centers
  • visualization
  • nt for development
  • vectors are dying
  • for hpc vendors in the future
  • there will be fewer ?

38
conclusion
  • hpc users will need to yield more to what the
    industry can provide rather than vice-versa
  • vendors rule 1 is a cruel master
Write a Comment
User Comments (0)
About PowerShow.com