Title: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni
1L'ambiente software di apeNEXT sviluppo ed
esecuzione delle applicazioni
- Alessandro Lonardo
- I.N.F.N Roma - gruppo APE
- alessandro.lonardo_at_roma1.infn.it
- http//apegate.roma1.infn.it/APE
2Index
- Machine Architecture
- Software Areas
- Programming Model
- Languages
- Example Applications
- Development Tools
- Execution Environment
3apeNEXT ArchitectureThe Network
- 3D mesh of computing nodes
- Vertexes are processors
- Each proc hosts its local memory
- Each proc supports 64bit complex, vector2, double
and integer types - Edges are 3D torus network channels
- 6 bi-dir channels per proc
- Basic comm primitive is first-neighbour send-recv
- Processors synchronize on communications (send
starts when recv is issued)
4apeNEXT ArchitectureThe JT Processor
5apeNEXT ArchitectureVery Long Instruction Word
6apeNEXT ArchitectureThe JT FILU
- FILU is FP,Integer and Logical unit
- MAC op ABC
- fully pipelined(1 result per cycle)
- 12 cycles latency
- synthesizes to 200MHz
- 4 multipliers
- 4 adders
- 1.6GFlops on complex MAC
7apeNEXT SoftwareAreas
- Architecture design, development and validation
simulators, no regression tools, - Application development compilation chain,
libraries, profiler - Execution environment operating system, batch
system, - Applications.
- System administration.
8apeNEXT SW development team
- Average 5 persons
- People in all the collaboration sites
- INFN Roma Ferrara
- Desy Zeuthen
- Univ. Bielefeld
- INRIA (France)
-
9apeNEXT Programming Model
- Single Program Multiple Data each node executes
the same program, but on its own data. - synchronization barriers at global condition
evaluations, with explicit statement or at I/O
operations - node to node synchronization at remote
communications. - Nodes are connected by a 3D network, each node
can efficently transfer data with its first,
second and third neighbour. - gt well suited for homogeneous problems with
short range interactions.
10apeNEXT Programming Model Data Decomposition(1)
- Application discretized D-dim lattice domain
decomposed onto a 3-dim processor mesh (maybe D
! 3) - gt Each node has a subset of the lattice sites
in its own memory -
- In other cases no decomposition is done,
simulation is done in parallel without
communications just to have a better statistics
(FARM).
11apeNEXT Programming Model Data Decomposition(2)
x
00
01
For each lattice site and on each node in
parallel the program performs an evolutionary
step. Short-range interactions gt first
neighbour inter-node communication.
y
10
11
12apeNEXT Programming Model Programming
Languages(1)
- TAO dedicated parallel language
- Fortran-like base syntax.
- Dynamic Language the (experienced) programmer
can freely extend syntax with new statements,
data types and operators - gt libraries configure the language for specific
application domains (LQCD, Spin Glass, ) - Allows writing of high efficency codes by
exposing the features of the hardware
architecture (registers, prefetch queues, cache)
13apeNEXT Programming Model Programming
Languages(2)
- C99 language
- Few extensions to the standard language.
- Eases the porting of applications and standard
libraries. - Allows writing of high efficency codes by
exposing the features of the hardware
architecture (registers, prefetch queues, cache)
14apeNEXT Programming Model Parallel Language
Constructs(1)
- few parallel language constructs (same in C99
and TAO) - Conditioned execution on a subset of nodes based
on local to node conditions (where) - Boolean operators for promotion of local to
global conditions to be used in flow control
statements (any, all, none). - Communications between nodes in the 3D mesh
expressed as variable assignment, directions
specified by mean of magic constants in the
source address (X_PLUS, X_MINUS, , Z_MINUS).
15apeNEXT Programming Model Parallel Language
Constructs(2)
- where statement - conditional execution on a mesh
subset. - where (xgty)
- max_xyx
- min_xyy
- elsewhere
- max_xyy
- min_xyx
- endwhere
16apeNEXT Programming Model Parallel Language
Constructs(3)
- Inter-node communications
- integer i
- real u1024
- register real rd1, rd2, rd3, rloc
- ...
- rd1 uiZ_PLUS
- rd2 uiY_PLUSZ_PLUS
- rloc uiX_PLUSX_MINUS
- rd3 uiX_PLUSY_MINUSZ_PLUS
loads ui from node x, y, z1 into rd1
17apeNEXT Programming Model Parallel Language
Constructs(4)
- any()/all()/none() boolean operators
- !!evaluation of mesh size along X
- !!with systolic algorithm
- sum_ix1
- sum_r0node_abs_x
- sum_r0sum_rX_PLUS !!internode communication
- while(any(sum_r0!node_abs_x))
- sum_ixsum_ix1
- sum_r0sum_rX_PLUS
- endwhile
18example 2D application kernel C function
- T datainLVOL,dataoutLVOL // LVOL is node
local volume - // precalculate neighbourhood tables
- int neighpLVOL,2, neighmLVOL,2
- ...
- void kernel_fun()
- register T res, d, dp0, dp1, dm0, dm1
- for(i0 iltLVOL i) // i is a
linearized index - d dataini // always local
access - dp0 datainneighpi,0// local or remote
access - dp1 datainneighpi,1
- dm0 datainneighmi,0
- dm1 datainneighmi,1
- res calc(d,dp0,dp1,dm0,dm1) // big inline
- dataouti res
-
-
19example 2D application kerneldomain decomposition
x
domain decomposition of the datain array
y
datain local domain of nodexy
datainLVOL
boundary of local domain
20example 2D application kernelFirst Neighbour
Systolic Communication
x
10
00
y
dm0 datainneighpi,0 neighpi,0 local
displacement X_MINUS neighpi,0 local
displacement
11
01
21Example Monte Carlo Pi Calculation
- Estimate Pi by throwing darts at a unit square
- Calculate percentage that fall in the unit circle
- Area of square r2 1
- Area of circle quadrant ¼ p r2 p/4
- Randomly throw darts at x,y positions
- If x2 y2 lt 1, then point is inside circle
- Compute ratio
- points inside / points total
- p 4ratio
- Replicate the calculation on N nodes in
- parallel to have better statistics
22Example Monte Carlo Pi Calculation COpenMP Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include "omp.h" inline int hit()
double x (double) rand() / (double) RAND_MAX
double y (double) rand() / (double) RAND_MAX
if ((xx yy) lt 1.0) return(1) else
return(0) define FIRST_SEED 3374 int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads omp_get_max_threads() unsigned
int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads) if
(argc ! 2) trials 1000000 else
trials atoi(argv1)
srand(FIRST_SEED) for(i0 iltmax_threads i)
/scorrelo i seeds/ seedsi
rand() printf("seeddd\n",i, seedsi)
pragma omp parallel private(i,seeds_index )
shared(seeds, hits, trials) seeds_index
omp_get_thread_num() srand(seedsseeds_inde
x) pragma omp for reduction(hits) for
(i0 i lt trials i) hits hit()
pi 4.0(double)hits/(double)trials
printf("PI estimated to .10g\n", pi) return
0
23Example Monte Carlo Pi Calculation apeNEXT C
Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include ltsysvars.hgt include
lttopology.hgt inline int hit() double x
(double) rand() / (double) RAND_MAX double y
(double) rand() / (double) RAND_MAX if ((xx
yy) lt 1.0) return(1) else return(0) int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads _mem_imachine_size_x_p
_mem_imachine_size_y_p _mem_imachine_size_
z_p const node_index _mem_inode_abs_id_p
unsigned int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads)
if (argc ! 2) trials 1000000 else
trials atoi(argv1) srand(FIRST_SEED)
for(i0 iltmax_threads i) seedsi
rand() printf("seeddd\n",i, seedsi)
srand(seedsnode_index) for (i0 i lt
trials i) hits hit() hits
global_sum(hits) trials max_threads pi
4.0(double)hits/(double)trials printf("PI
estimated to .10g\n", pi) return 0
24Example Monte Carlo Pi CalculationResults
Intel P4 Dual Core
- lonardo_at_marlingtenv OMP_NUM_THREADS16
./monte_pi-gcc.o - MAX_THREADS 16
- seed01396293760
- seed11488115307
- seed21303873515
- seed337393359
- seed4824846176
- seed51138759395
- seed61184683763
- seed71884735975
- seed8443160774
- seed9326610858
- seed10878347714
- seed11501308535
- seed121066424433
- seed131420631951
- seed14391631339
- seed151730610200
- PI estimated to 3.14108
25Example Monte Carlo Pi CalculationResults -
apeNEXT Board (16 Nodes)
- lonardo_at_antgtnrun -hib -board 033 -minit0
monte-api.mem - MAX_THREADS 16
- seed06556077425992558173
- seed14923530068770806084
- seed24637196908100545377
- seed36221712952809700854
- seed4279065984179923185
- seed57751953660738243840
- seed67614450982016732205
- seed71120288809807653798
- seed84640801604175907269
- seed94885633457180056444
- seed10905770433927994553
- seed111598073754810041858
- seed127232028785291230425
- seed136726612558212505416
- seed143567338195430110971
- seed155194800804163472670
- PI estimated to 3.13989775
26apeNEXT Compilation Chain
- rtc tao compiler
- Retargetable Tao Compiler produce an
intermediate pseudo-assembly file which is
further translated into assembly for APEmille or
apeNEXT. - Based on Zz dynamic parser.
- Relies on a separate module for assembly code
optimizations. - Stable, production quality compiler
27apeNEXT Compilation Chain
- nlcc c compiler
- lcc 4.2 compiler port on apeNEXT architecture.
- few optimizations.
- c99 apeNEXT syntax extensions
- Low bug reports rate.
28apeNEXT Compilation Chain
- ngcc c compiler
- Porting of GNU C compiler (GCC) for apeNEXT
architecture - Based on gcc version 4.1
- Optimization passes performed on the compilers
internal representation of code (tree-SSA, RTL) - Source language C99 and GNU Extensions to C99,
apeNEXT extensions for parallel programming - Possibility to integrate frontends to other
source languages (C, Fortran, TAO) - Target language apeNEXT user level assembly
(SASM)
29apeNEXT Compilation Chain
- ngcc status
- Single node C compiler DONE
- Vector data types and arithmetics ALMOST DONE
- Exploitation of native complex types and
arithmetics TO DO - Remote memory accesses implementation DONE
- Prefetch instructions ALMOST DONE
- Cache handling TO DO
- Where(), any(), all(), none() constructs TO DO
- libc adaptation JUST STARTED
- Work in progress
30apeNEXT Compilation Chain
- mpp macro-assembler
- translates a user-friendly assembly into a
micro-assembly representation - macro expansion.
- label analisys.
- emission of masm-instructions for cache handling.
31apeNEXT Compilation Chain
- sofan micro-assemby optimizer
- based on the salto (INRIA) optimization toolkit
- Transforms the micro-assembly code in order to
perform a series of optimizations, such as - mul-add fusion
- Dead code removal
- Copy propagation
- Address generation optimization
- Intruction pre-scheduling
32apeNEXT Compilation Chain
- shaker microcode scheduler
- generation of optimized microcode to exploit the
Pipelined Very Long Instruction Word Processor
Architecture - scheduling
- Register renaming
- Register allocation
- Microcode compression
- Optional generation of executable for the
functional simulator
33apeNEXT Compilation Chainshaker microcode
scheduler
- generation of microcode patterns, texec tmax
- shake up phase try to schedule each pattern
earlier as possible respecting - dependencies between instructions
- device occupation at each cycle
- texec tsu
DEVICES
0
0
1
1
1
1
1
1
3
2
2
3
shake up
2
2
3
3
2
2
4
4
2
2
4
5
4
CYLES
5
5
3
5
tsu
3
3
3
4
4
4
4
5
5
5
5
tmax
34apeNEXT Compilation Chainshaker microcode
scheduler
- shake down phase try to schedule each pattern
later as possible respecting - dependencies between instructions
- device occupation at each cycle
- texec tsu- tsd
- Tipically tmax / texec 10 in computing
intensive code sections
DEVICES
0
1
1
tsd
1
3
2
1
1
shake down
2
3
1
2
3
2
3
3
2
2
3
CYCLES
4
4
3
2
2
3
4
5
4
5
5
5
4
4
5
5
tsu
tsu
5
5
4
4
35apeNEXT Compilation Chain
- sf functional simulator
- micro-assemblyInstruction level simulator.
- Support for single and multinode simulations
(1x1x1, 2x2x2, 4x2x2). - Fast simulation (multithreaded)
- no cycle accurate.
- bit exact arithmetic (microcode scheduling may
give differences).
36apeNEXT Execution EnvironmentOS distributed
architecture(1)
- 7thLink
- Program loading
- I/O operations
- 1 channel per unit
- 200 MB/s per channel
I2C bootstrap, exception handling, debugging
(1.5 MB/s)
37apeNEXT Execution EnvironmentOS distributed
architecture(2)
- Master
- resides on the front-end linux PC
- user interface (shell commands)
- Partitioning
- dispatch I/O request to the slaves
- Slave
- Resides on the blade PCs
- Handles communication with apeNEXT on I2C and
7thLink - PCI boards
- tiny kernel of routines embedded in the apeNEXT
program - loader
- I/O (routing of data to and from the interface
node) - System services (time counters, etc)
38apeNEXT Execution Environment
- programs can be loaded and executed on a machine
partition - node (1x1x1)
- board 16 nodes (4x2x2)
- unit 4 boards (4x2x8)
- crate 4 units (4x8x8)
- rack 2 crates (8x8x8)
- Partition is reserved until the program execution
finishes (no multitasking!) - Single process
- No virtual memory
39Batch system
- Torque/OpenPBS
- today fifo-Scheduling, implementing a users group
quota based scheduler. - queues
- rack
- crate
- unit
40Batch SystemJob Submission
- nsub wrapper of the qsub command
Usage nsub OPTIONS script Submits a apeNEXT
job where OPTIONS are -a date_time
Declares the time after which the
job is eligible for execution
the format is CCYYMMDDhhmm.SS
-c conf chooses among available
apeNEXT configurations
confboardunitunit010-3cratecrate01
rack(defaultcrate) -m host_name
requests a particular host -g group_name
overrides user group -o logfile
overrides logfile name -V
dumps version information -v
be verbose -h shows this help
41Batch SystemJob submission example
lonardo_at_thebossgtnsub -c crate 7h_test.sh 15942.the
boss.ape lonardo_at_thebossgtqstat -an1 theboss.ape
Req'd Req'd Elap Job ID
Username Queue Jobname SessID NDS
TSK Memory Time S Time --------------------
-------- -------- ---------- ------ ----- ---
------ ----- - ----- 15813.theboss.ape orifici
crate stoc2 1826 1 -- --
2400 R 2152 rack10/1 15860.theboss.ape
simula crate run_tdilu 29170 1 --
-- 2400 R 1434 rack4/1 15877.theboss.ape
zeidlew crate mu.056.cjo 5925 1 --
-- 2400 R 1138 rack8/0 15880.theboss.ape
delia crate run.sh 6386 1 --
-- 2400 R 1059 rack8/1 15896.theboss.ape
delia unit rum0175.sh 32291 1 --
-- 2400 R 0847 rack7/5 15900.theboss.ape
frezzott rack RUN_Rack5. 18099 1 --
-- 2400 R 0756 rack5/0 15906.theboss.ape
delia unit run0175.sh 1072 1 --
-- 2400 R 0649 rack7/0 15918.theboss.ape
frezzott rack RUN_Rack2. 15890 1 --
-- 2400 R 0430 rack2/0 15926.theboss.ape
simula crate run1_tdilu 4409 1 --
-- 2400 R 0247 rack4/0 15927.theboss.ape
delia unit run0200.sh 3596 1 --
-- 2400 R 0247 rack7/4 15928.theboss.ape
lacagnin crate run.5.7.sh 4772 1 --
-- 2400 R 0236 rack1/0 15930.theboss.ape
lacagnin crate run.5.6.sh 2787 1 --
-- 2400 R 0234 rack9/0 15932.theboss.ape
cosmai crate b5.450_n0. 2994 1 --
-- 2400 R 0202 rack9/1 15933.theboss.ape
delia unit rum0200.sh 4216 1 --
-- 2400 R 0153 rack7/2 15934.theboss.ape
delia unit run0225.sh 4552 1 --
-- 2400 R 0124 rack7/1 15935.theboss.ape
devitiis crate theboss.sh 28504 1 --
-- 2400 R 0122 rack3/0 15936.theboss.ape
orifici crate stoc 28592 1 --
-- 2400 R 0057 rack3/1 15937.theboss.ape
devitiis crate theboss.sh 18065 1 --
-- 2400 R 0057 rack6/0 15939.theboss.ape
cosmai crate b5.450_n1. 5845 1 --
-- 2400 R 0054 rack1/1 15940.theboss.ape
devitiis crate theboss.sh 18472 1 --
-- 2400 R 0022 rack6/1 15941.theboss.ape
delia unit rum0225.sh 5290 1 --
-- 2400 R 0014 rack7/3 15942.theboss.ape
lonardo crate 7h_test.sh 11226 1 --
-- 2400 R -- rack10/0