L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni - PowerPoint PPT Presentation

About This Presentation

Title:

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Description:

Each proc hosts its local memory. Each proc supports 64bit complex, vector2, double and integer types. Edges are 3D torus network channels. 6 bi-dir channels per proc ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 42

Provided by: alessandr63

Category:

more less

Transcript and Presenter's Notes

Title: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

1
L'ambiente software di apeNEXT sviluppo ed
esecuzione delle applicazioni

Alessandro Lonardo
I.N.F.N Roma - gruppo APE
alessandro.lonardo_at_roma1.infn.it
http//apegate.roma1.infn.it/APE

2
Index

Machine Architecture
Software Areas
Programming Model
Languages
Example Applications
Development Tools
Execution Environment

3
apeNEXT ArchitectureThe Network

3D mesh of computing nodes
Vertexes are processors
Each proc hosts its local memory
Each proc supports 64bit complex, vector2, double
and integer types
Edges are 3D torus network channels
6 bi-dir channels per proc
Basic comm primitive is first-neighbour send-recv
Processors synchronize on communications (send
starts when recv is issued)

4
apeNEXT ArchitectureThe JT Processor
5
apeNEXT ArchitectureVery Long Instruction Word
6
apeNEXT ArchitectureThe JT FILU

FILU is FP,Integer and Logical unit
MAC op ABC
fully pipelined(1 result per cycle)
12 cycles latency
synthesizes to 200MHz
4 multipliers
4 adders
1.6GFlops on complex MAC

7
apeNEXT SoftwareAreas

Architecture design, development and validation
simulators, no regression tools,
Application development compilation chain,
libraries, profiler
Execution environment operating system, batch
system,
Applications.
System administration.

8
apeNEXT SW development team

Average 5 persons
People in all the collaboration sites
INFN Roma Ferrara
Desy Zeuthen
Univ. Bielefeld
INRIA (France)

9
apeNEXT Programming Model

Single Program Multiple Data each node executes
the same program, but on its own data.
synchronization barriers at global condition
evaluations, with explicit statement or at I/O
operations
node to node synchronization at remote
communications.
Nodes are connected by a 3D network, each node
can efficently transfer data with its first,
second and third neighbour.
gt well suited for homogeneous problems with
short range interactions.

10
apeNEXT Programming Model Data Decomposition(1)

Application discretized D-dim lattice domain
decomposed onto a 3-dim processor mesh (maybe D
! 3)
gt Each node has a subset of the lattice sites
in its own memory
In other cases no decomposition is done,
simulation is done in parallel without
communications just to have a better statistics
(FARM).

11
apeNEXT Programming Model Data Decomposition(2)
x
00
01
For each lattice site and on each node in
parallel the program performs an evolutionary
step. Short-range interactions gt first
neighbour inter-node communication.
y
10
11
12
apeNEXT Programming Model Programming
Languages(1)

TAO dedicated parallel language
Fortran-like base syntax.
Dynamic Language the (experienced) programmer
can freely extend syntax with new statements,
data types and operators
gt libraries configure the language for specific
application domains (LQCD, Spin Glass, )
Allows writing of high efficency codes by
exposing the features of the hardware
architecture (registers, prefetch queues, cache)

13
apeNEXT Programming Model Programming
Languages(2)

C99 language
Few extensions to the standard language.
Eases the porting of applications and standard
libraries.
Allows writing of high efficency codes by
exposing the features of the hardware
architecture (registers, prefetch queues, cache)

14
apeNEXT Programming Model Parallel Language
Constructs(1)

few parallel language constructs (same in C99
and TAO)
Conditioned execution on a subset of nodes based
on local to node conditions (where)
Boolean operators for promotion of local to
global conditions to be used in flow control
statements (any, all, none).
Communications between nodes in the 3D mesh
expressed as variable assignment, directions
specified by mean of magic constants in the
source address (X_PLUS, X_MINUS, , Z_MINUS).

15
apeNEXT Programming Model Parallel Language
Constructs(2)

where statement - conditional execution on a mesh
subset.
where (xgty)
max_xyx
min_xyy
elsewhere
max_xyy
min_xyx
endwhere

16
apeNEXT Programming Model Parallel Language
Constructs(3)

Inter-node communications
integer i
real u1024
register real rd1, rd2, rd3, rloc
...
rd1 uiZ_PLUS
rd2 uiY_PLUSZ_PLUS
rloc uiX_PLUSX_MINUS
rd3 uiX_PLUSY_MINUSZ_PLUS

loads ui from node x, y, z1 into rd1
17
apeNEXT Programming Model Parallel Language
Constructs(4)

any()/all()/none() boolean operators
!!evaluation of mesh size along X
!!with systolic algorithm
sum_ix1
sum_r0node_abs_x
sum_r0sum_rX_PLUS !!internode communication
while(any(sum_r0!node_abs_x))
sum_ixsum_ix1
sum_r0sum_rX_PLUS
endwhile

18
example 2D application kernel C function

T datainLVOL,dataoutLVOL // LVOL is node
local volume
// precalculate neighbourhood tables
int neighpLVOL,2, neighmLVOL,2
...
void kernel_fun()
register T res, d, dp0, dp1, dm0, dm1
for(i0 iltLVOL i) // i is a
linearized index
d dataini // always local
access
dp0 datainneighpi,0// local or remote
access
dp1 datainneighpi,1
dm0 datainneighmi,0
dm1 datainneighmi,1
res calc(d,dp0,dp1,dm0,dm1) // big inline
dataouti res

19
example 2D application kerneldomain decomposition
x
domain decomposition of the datain array
y
datain local domain of nodexy
datainLVOL
boundary of local domain
20
example 2D application kernelFirst Neighbour
Systolic Communication
x
10
00
y
dm0 datainneighpi,0 neighpi,0 local
displacement X_MINUS neighpi,0 local
displacement
11
01
21
Example Monte Carlo Pi Calculation

Estimate Pi by throwing darts at a unit square
Calculate percentage that fall in the unit circle
Area of square r2 1
Area of circle quadrant ¼ p r2 p/4
Randomly throw darts at x,y positions
If x2 y2 lt 1, then point is inside circle
Compute ratio
points inside / points total
p 4ratio
Replicate the calculation on N nodes in
parallel to have better statistics

22
Example Monte Carlo Pi Calculation COpenMP Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include "omp.h" inline int hit()
double x (double) rand() / (double) RAND_MAX
double y (double) rand() / (double) RAND_MAX
if ((xx yy) lt 1.0) return(1) else
return(0) define FIRST_SEED 3374 int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads omp_get_max_threads() unsigned
int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads) if
(argc ! 2) trials 1000000 else
trials atoi(argv1)
srand(FIRST_SEED) for(i0 iltmax_threads i)
/scorrelo i seeds/ seedsi
rand() printf("seeddd\n",i, seedsi)
pragma omp parallel private(i,seeds_index )
shared(seeds, hits, trials) seeds_index
omp_get_thread_num() srand(seedsseeds_inde
x) pragma omp for reduction(hits) for
(i0 i lt trials i) hits hit()
pi 4.0(double)hits/(double)trials
printf("PI estimated to .10g\n", pi) return
0
23
Example Monte Carlo Pi Calculation apeNEXT C
Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include ltsysvars.hgt include
lttopology.hgt inline int hit() double x
(double) rand() / (double) RAND_MAX double y
(double) rand() / (double) RAND_MAX if ((xx
yy) lt 1.0) return(1) else return(0) int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads _mem_imachine_size_x_p
_mem_imachine_size_y_p _mem_imachine_size_
z_p const node_index _mem_inode_abs_id_p
unsigned int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads)
if (argc ! 2) trials 1000000 else
trials atoi(argv1) srand(FIRST_SEED)
for(i0 iltmax_threads i) seedsi
rand() printf("seeddd\n",i, seedsi)
srand(seedsnode_index) for (i0 i lt
trials i) hits hit() hits
global_sum(hits) trials max_threads pi
4.0(double)hits/(double)trials printf("PI
estimated to .10g\n", pi) return 0
24
Example Monte Carlo Pi CalculationResults
Intel P4 Dual Core

lonardo_at_marlingtenv OMP_NUM_THREADS16
./monte_pi-gcc.o
MAX_THREADS 16
seed01396293760
seed11488115307
seed21303873515
seed337393359
seed4824846176
seed51138759395
seed61184683763
seed71884735975
seed8443160774
seed9326610858
seed10878347714
seed11501308535
seed121066424433
seed131420631951
seed14391631339
seed151730610200
PI estimated to 3.14108

25
Example Monte Carlo Pi CalculationResults -
apeNEXT Board (16 Nodes)

lonardo_at_antgtnrun -hib -board 033 -minit0
monte-api.mem
MAX_THREADS 16
seed06556077425992558173
seed14923530068770806084
seed24637196908100545377
seed36221712952809700854
seed4279065984179923185
seed57751953660738243840
seed67614450982016732205
seed71120288809807653798
seed84640801604175907269
seed94885633457180056444
seed10905770433927994553
seed111598073754810041858
seed127232028785291230425
seed136726612558212505416
seed143567338195430110971
seed155194800804163472670
PI estimated to 3.13989775

26
apeNEXT Compilation Chain

rtc tao compiler
Retargetable Tao Compiler produce an
intermediate pseudo-assembly file which is
further translated into assembly for APEmille or
apeNEXT.
Based on Zz dynamic parser.
Relies on a separate module for assembly code
optimizations.
Stable, production quality compiler

27
apeNEXT Compilation Chain

nlcc c compiler
lcc 4.2 compiler port on apeNEXT architecture.
few optimizations.
c99 apeNEXT syntax extensions
Low bug reports rate.

28
apeNEXT Compilation Chain

ngcc c compiler
Porting of GNU C compiler (GCC) for apeNEXT
architecture
Based on gcc version 4.1
Optimization passes performed on the compilers
internal representation of code (tree-SSA, RTL)
Source language C99 and GNU Extensions to C99,
apeNEXT extensions for parallel programming
Possibility to integrate frontends to other
source languages (C, Fortran, TAO)
Target language apeNEXT user level assembly
(SASM)

29
apeNEXT Compilation Chain

ngcc status
Single node C compiler DONE
Vector data types and arithmetics ALMOST DONE
Exploitation of native complex types and
arithmetics TO DO
Remote memory accesses implementation DONE
Prefetch instructions ALMOST DONE
Cache handling TO DO
Where(), any(), all(), none() constructs TO DO
libc adaptation JUST STARTED
Work in progress

30
apeNEXT Compilation Chain

mpp macro-assembler
translates a user-friendly assembly into a
micro-assembly representation
macro expansion.
label analisys.
emission of masm-instructions for cache handling.

31
apeNEXT Compilation Chain

sofan micro-assemby optimizer
based on the salto (INRIA) optimization toolkit
Transforms the micro-assembly code in order to
perform a series of optimizations, such as
mul-add fusion
Dead code removal
Copy propagation
Address generation optimization
Intruction pre-scheduling

32
apeNEXT Compilation Chain

shaker microcode scheduler
generation of optimized microcode to exploit the
Pipelined Very Long Instruction Word Processor
Architecture
scheduling
Register renaming
Register allocation
Microcode compression
Optional generation of executable for the
functional simulator

33
apeNEXT Compilation Chainshaker microcode
scheduler

generation of microcode patterns, texec tmax
shake up phase try to schedule each pattern
earlier as possible respecting
dependencies between instructions
device occupation at each cycle
texec tsu

DEVICES
0
0
1
1
1
1
1
1
3
2
2
3
shake up
2
2
3
3
2
2
4
4

2
2
4
5
4
CYLES
5
5
3
5
tsu
3
3
3
4
4
4
4
5
5
5
5
tmax
34
apeNEXT Compilation Chainshaker microcode
scheduler

shake down phase try to schedule each pattern
later as possible respecting
dependencies between instructions
device occupation at each cycle
texec tsu- tsd
Tipically tmax / texec 10 in computing
intensive code sections

DEVICES
0
1
1
tsd
1
3
2
1
1
shake down

2
3
1
2
3
2
3
3
2
2
3
CYCLES

4
4
3
2
2

3
4
5
4
5
5
5
4
4
5
5
tsu
tsu
5
5
4
4
35
apeNEXT Compilation Chain

sf functional simulator
micro-assemblyInstruction level simulator.
Support for single and multinode simulations
(1x1x1, 2x2x2, 4x2x2).
Fast simulation (multithreaded)
no cycle accurate.
bit exact arithmetic (microcode scheduling may
give differences).

36
apeNEXT Execution EnvironmentOS distributed
architecture(1)

7thLink
Program loading
I/O operations
1 channel per unit
200 MB/s per channel

I2C bootstrap, exception handling, debugging
(1.5 MB/s)
37
apeNEXT Execution EnvironmentOS distributed
architecture(2)

Master
resides on the front-end linux PC
user interface (shell commands)
Partitioning
dispatch I/O request to the slaves
Slave
Resides on the blade PCs
Handles communication with apeNEXT on I2C and
7thLink
PCI boards
tiny kernel of routines embedded in the apeNEXT
program
loader
I/O (routing of data to and from the interface
node)
System services (time counters, etc)

38
apeNEXT Execution Environment

programs can be loaded and executed on a machine
partition
node (1x1x1)
board 16 nodes (4x2x2)
unit 4 boards (4x2x8)
crate 4 units (4x8x8)
rack 2 crates (8x8x8)
Partition is reserved until the program execution
finishes (no multitasking!)
Single process
No virtual memory

39
Batch system

Torque/OpenPBS
today fifo-Scheduling, implementing a users group
quota based scheduler.
queues
rack
crate
unit

40
Batch SystemJob Submission

nsub wrapper of the qsub command

Usage nsub OPTIONS script Submits a apeNEXT
job where OPTIONS are -a date_time
Declares the time after which the
job is eligible for execution
the format is CCYYMMDDhhmm.SS
-c conf chooses among available
apeNEXT configurations
confboardunitunit010-3cratecrate01
rack(defaultcrate) -m host_name
requests a particular host -g group_name
overrides user group -o logfile
overrides logfile name -V
dumps version information -v
be verbose -h shows this help
41
Batch SystemJob submission example
lonardo_at_thebossgtnsub -c crate 7h_test.sh 15942.the
boss.ape lonardo_at_thebossgtqstat -an1 theboss.ape

Req'd Req'd Elap Job ID
Username Queue Jobname SessID NDS
TSK Memory Time S Time --------------------
-------- -------- ---------- ------ ----- ---
------ ----- - ----- 15813.theboss.ape orifici
crate stoc2 1826 1 -- --
2400 R 2152 rack10/1 15860.theboss.ape
simula crate run_tdilu 29170 1 --
-- 2400 R 1434 rack4/1 15877.theboss.ape
zeidlew crate mu.056.cjo 5925 1 --
-- 2400 R 1138 rack8/0 15880.theboss.ape
delia crate run.sh 6386 1 --
-- 2400 R 1059 rack8/1 15896.theboss.ape
delia unit rum0175.sh 32291 1 --
-- 2400 R 0847 rack7/5 15900.theboss.ape
frezzott rack RUN_Rack5. 18099 1 --
-- 2400 R 0756 rack5/0 15906.theboss.ape
delia unit run0175.sh 1072 1 --
-- 2400 R 0649 rack7/0 15918.theboss.ape
frezzott rack RUN_Rack2. 15890 1 --
-- 2400 R 0430 rack2/0 15926.theboss.ape
simula crate run1_tdilu 4409 1 --
-- 2400 R 0247 rack4/0 15927.theboss.ape
delia unit run0200.sh 3596 1 --
-- 2400 R 0247 rack7/4 15928.theboss.ape
lacagnin crate run.5.7.sh 4772 1 --
-- 2400 R 0236 rack1/0 15930.theboss.ape
lacagnin crate run.5.6.sh 2787 1 --
-- 2400 R 0234 rack9/0 15932.theboss.ape
cosmai crate b5.450_n0. 2994 1 --
-- 2400 R 0202 rack9/1 15933.theboss.ape
delia unit rum0200.sh 4216 1 --
-- 2400 R 0153 rack7/2 15934.theboss.ape
delia unit run0225.sh 4552 1 --
-- 2400 R 0124 rack7/1 15935.theboss.ape
devitiis crate theboss.sh 28504 1 --
-- 2400 R 0122 rack3/0 15936.theboss.ape
orifici crate stoc 28592 1 --
-- 2400 R 0057 rack3/1 15937.theboss.ape
devitiis crate theboss.sh 18065 1 --
-- 2400 R 0057 rack6/0 15939.theboss.ape
cosmai crate b5.450_n1. 5845 1 --
-- 2400 R 0054 rack1/1 15940.theboss.ape
devitiis crate theboss.sh 18472 1 --
-- 2400 R 0022 rack6/1 15941.theboss.ape
delia unit rum0225.sh 5290 1 --
-- 2400 R 0014 rack7/3 15942.theboss.ape
lonardo crate 7h_test.sh 11226 1 --
-- 2400 R -- rack10/0

Write a Comment

User Comments (0)