L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Description:

Each proc hosts its local memory. Each proc supports 64bit complex, vector2, double and integer types. Edges are 3D torus network channels. 6 bi-dir channels per proc ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 42
Provided by: alessandr63
Category:

less

Transcript and Presenter's Notes

Title: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni


1
L'ambiente software di apeNEXT sviluppo ed
esecuzione delle applicazioni
  • Alessandro Lonardo
  • I.N.F.N Roma - gruppo APE
  • alessandro.lonardo_at_roma1.infn.it
  • http//apegate.roma1.infn.it/APE

2
Index
  • Machine Architecture
  • Software Areas
  • Programming Model
  • Languages
  • Example Applications
  • Development Tools
  • Execution Environment

3
apeNEXT ArchitectureThe Network
  • 3D mesh of computing nodes
  • Vertexes are processors
  • Each proc hosts its local memory
  • Each proc supports 64bit complex, vector2, double
    and integer types
  • Edges are 3D torus network channels
  • 6 bi-dir channels per proc
  • Basic comm primitive is first-neighbour send-recv
  • Processors synchronize on communications (send
    starts when recv is issued)

4
apeNEXT ArchitectureThe JT Processor
5
apeNEXT ArchitectureVery Long Instruction Word
6
apeNEXT ArchitectureThe JT FILU
  • FILU is FP,Integer and Logical unit
  • MAC op ABC
  • fully pipelined(1 result per cycle)
  • 12 cycles latency
  • synthesizes to 200MHz
  • 4 multipliers
  • 4 adders
  • 1.6GFlops on complex MAC

7
apeNEXT SoftwareAreas
  • Architecture design, development and validation
    simulators, no regression tools,
  • Application development compilation chain,
    libraries, profiler
  • Execution environment operating system, batch
    system,
  • Applications.
  • System administration.

8
apeNEXT SW development team
  • Average 5 persons
  • People in all the collaboration sites
  • INFN Roma Ferrara
  • Desy Zeuthen
  • Univ. Bielefeld
  • INRIA (France)

9
apeNEXT Programming Model
  • Single Program Multiple Data each node executes
    the same program, but on its own data.
  • synchronization barriers at global condition
    evaluations, with explicit statement or at I/O
    operations
  • node to node synchronization at remote
    communications.
  • Nodes are connected by a 3D network, each node
    can efficently transfer data with its first,
    second and third neighbour.
  • gt well suited for homogeneous problems with
    short range interactions.

10
apeNEXT Programming Model Data Decomposition(1)
  • Application discretized D-dim lattice domain
    decomposed onto a 3-dim processor mesh (maybe D
    ! 3)
  • gt Each node has a subset of the lattice sites
    in its own memory
  • In other cases no decomposition is done,
    simulation is done in parallel without
    communications just to have a better statistics
    (FARM).

11
apeNEXT Programming Model Data Decomposition(2)
x
00
01
For each lattice site and on each node in
parallel the program performs an evolutionary
step. Short-range interactions gt first
neighbour inter-node communication.
y
10
11
12
apeNEXT Programming Model Programming
Languages(1)
  • TAO dedicated parallel language
  • Fortran-like base syntax.
  • Dynamic Language the (experienced) programmer
    can freely extend syntax with new statements,
    data types and operators
  • gt libraries configure the language for specific
    application domains (LQCD, Spin Glass, )
  • Allows writing of high efficency codes by
    exposing the features of the hardware
    architecture (registers, prefetch queues, cache)

13
apeNEXT Programming Model Programming
Languages(2)
  • C99 language
  • Few extensions to the standard language.
  • Eases the porting of applications and standard
    libraries.
  • Allows writing of high efficency codes by
    exposing the features of the hardware
    architecture (registers, prefetch queues, cache)

14
apeNEXT Programming Model Parallel Language
Constructs(1)
  • few parallel language constructs (same in C99
    and TAO)
  • Conditioned execution on a subset of nodes based
    on local to node conditions (where)
  • Boolean operators for promotion of local to
    global conditions to be used in flow control
    statements (any, all, none).
  • Communications between nodes in the 3D mesh
    expressed as variable assignment, directions
    specified by mean of magic constants in the
    source address (X_PLUS, X_MINUS, , Z_MINUS).

15
apeNEXT Programming Model Parallel Language
Constructs(2)
  • where statement - conditional execution on a mesh
    subset.
  • where (xgty)
  • max_xyx
  • min_xyy
  • elsewhere
  • max_xyy
  • min_xyx
  • endwhere

16
apeNEXT Programming Model Parallel Language
Constructs(3)
  • Inter-node communications
  • integer i
  • real u1024
  • register real rd1, rd2, rd3, rloc
  • ...
  • rd1 uiZ_PLUS
  • rd2 uiY_PLUSZ_PLUS
  • rloc uiX_PLUSX_MINUS
  • rd3 uiX_PLUSY_MINUSZ_PLUS

loads ui from node x, y, z1 into rd1
17
apeNEXT Programming Model Parallel Language
Constructs(4)
  • any()/all()/none() boolean operators
  • !!evaluation of mesh size along X
  • !!with systolic algorithm
  • sum_ix1
  • sum_r0node_abs_x
  • sum_r0sum_rX_PLUS !!internode communication
  • while(any(sum_r0!node_abs_x))
  • sum_ixsum_ix1
  • sum_r0sum_rX_PLUS
  • endwhile

18
example 2D application kernel C function
  • T datainLVOL,dataoutLVOL // LVOL is node
    local volume
  • // precalculate neighbourhood tables
  • int neighpLVOL,2, neighmLVOL,2
  • ...
  • void kernel_fun()
  • register T res, d, dp0, dp1, dm0, dm1
  • for(i0 iltLVOL i) // i is a
    linearized index
  • d dataini // always local
    access
  • dp0 datainneighpi,0// local or remote
    access
  • dp1 datainneighpi,1
  • dm0 datainneighmi,0
  • dm1 datainneighmi,1
  • res calc(d,dp0,dp1,dm0,dm1) // big inline
  • dataouti res

19
example 2D application kerneldomain decomposition
x
domain decomposition of the datain array
y
datain local domain of nodexy
datainLVOL
boundary of local domain
20
example 2D application kernelFirst Neighbour
Systolic Communication
x
10
00
y
dm0 datainneighpi,0 neighpi,0 local
displacement X_MINUS neighpi,0 local
displacement
11
01
21
Example Monte Carlo Pi Calculation
  • Estimate Pi by throwing darts at a unit square
  • Calculate percentage that fall in the unit circle
  • Area of square r2 1
  • Area of circle quadrant ¼ p r2 p/4
  • Randomly throw darts at x,y positions
  • If x2 y2 lt 1, then point is inside circle
  • Compute ratio
  • points inside / points total
  • p 4ratio
  • Replicate the calculation on N nodes in
  • parallel to have better statistics

22
Example Monte Carlo Pi Calculation COpenMP Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include "omp.h" inline int hit()
double x (double) rand() / (double) RAND_MAX
double y (double) rand() / (double) RAND_MAX
if ((xx yy) lt 1.0) return(1) else
return(0) define FIRST_SEED 3374 int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads omp_get_max_threads() unsigned
int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads) if
(argc ! 2) trials 1000000 else
trials atoi(argv1)
srand(FIRST_SEED) for(i0 iltmax_threads i)
/scorrelo i seeds/ seedsi
rand() printf("seeddd\n",i, seedsi)
pragma omp parallel private(i,seeds_index )
shared(seeds, hits, trials) seeds_index
omp_get_thread_num() srand(seedsseeds_inde
x) pragma omp for reduction(hits) for
(i0 i lt trials i) hits hit()
pi 4.0(double)hits/(double)trials
printf("PI estimated to .10g\n", pi) return
0
23
Example Monte Carlo Pi Calculation apeNEXT C
Code
include ltstdio.hgt include ltmath.hgt include
ltstdlib.hgt include ltsysvars.hgt include
lttopology.hgt inline int hit() double x
(double) rand() / (double) RAND_MAX double y
(double) rand() / (double) RAND_MAX if ((xx
yy) lt 1.0) return(1) else return(0) int
main(int argc, char argv) int i, hits 0,
trials 0 int seeds_index 0 const int
max_threads _mem_imachine_size_x_p
_mem_imachine_size_y_p _mem_imachine_size_
z_p const node_index _mem_inode_abs_id_p
unsigned int seedsmax_threads double pi
printf("MAX_THREADS d\n", max_threads)
if (argc ! 2) trials 1000000 else
trials atoi(argv1) srand(FIRST_SEED)
for(i0 iltmax_threads i) seedsi
rand() printf("seeddd\n",i, seedsi)
srand(seedsnode_index) for (i0 i lt
trials i) hits hit() hits
global_sum(hits) trials max_threads pi
4.0(double)hits/(double)trials printf("PI
estimated to .10g\n", pi) return 0
24
Example Monte Carlo Pi CalculationResults
Intel P4 Dual Core
  • lonardo_at_marlingtenv OMP_NUM_THREADS16
    ./monte_pi-gcc.o
  • MAX_THREADS 16
  • seed01396293760
  • seed11488115307
  • seed21303873515
  • seed337393359
  • seed4824846176
  • seed51138759395
  • seed61184683763
  • seed71884735975
  • seed8443160774
  • seed9326610858
  • seed10878347714
  • seed11501308535
  • seed121066424433
  • seed131420631951
  • seed14391631339
  • seed151730610200
  • PI estimated to 3.14108

25
Example Monte Carlo Pi CalculationResults -
apeNEXT Board (16 Nodes)
  • lonardo_at_antgtnrun -hib -board 033 -minit0
    monte-api.mem
  • MAX_THREADS 16
  • seed06556077425992558173
  • seed14923530068770806084
  • seed24637196908100545377
  • seed36221712952809700854
  • seed4279065984179923185
  • seed57751953660738243840
  • seed67614450982016732205
  • seed71120288809807653798
  • seed84640801604175907269
  • seed94885633457180056444
  • seed10905770433927994553
  • seed111598073754810041858
  • seed127232028785291230425
  • seed136726612558212505416
  • seed143567338195430110971
  • seed155194800804163472670
  • PI estimated to 3.13989775

26
apeNEXT Compilation Chain
  • rtc tao compiler
  • Retargetable Tao Compiler produce an
    intermediate pseudo-assembly file which is
    further translated into assembly for APEmille or
    apeNEXT.
  • Based on Zz dynamic parser.
  • Relies on a separate module for assembly code
    optimizations.
  • Stable, production quality compiler

27
apeNEXT Compilation Chain
  • nlcc c compiler
  • lcc 4.2 compiler port on apeNEXT architecture.
  • few optimizations.
  • c99 apeNEXT syntax extensions
  • Low bug reports rate.

28
apeNEXT Compilation Chain
  • ngcc c compiler
  • Porting of GNU C compiler (GCC) for apeNEXT
    architecture
  • Based on gcc version 4.1
  • Optimization passes performed on the compilers
    internal representation of code (tree-SSA, RTL)
  • Source language C99 and GNU Extensions to C99,
    apeNEXT extensions for parallel programming
  • Possibility to integrate frontends to other
    source languages (C, Fortran, TAO)
  • Target language apeNEXT user level assembly
    (SASM)

29
apeNEXT Compilation Chain
  • ngcc status
  • Single node C compiler DONE
  • Vector data types and arithmetics ALMOST DONE
  • Exploitation of native complex types and
    arithmetics TO DO
  • Remote memory accesses implementation DONE
  • Prefetch instructions ALMOST DONE
  • Cache handling TO DO
  • Where(), any(), all(), none() constructs TO DO
  • libc adaptation JUST STARTED
  • Work in progress

30
apeNEXT Compilation Chain
  • mpp macro-assembler
  • translates a user-friendly assembly into a
    micro-assembly representation
  • macro expansion.
  • label analisys.
  • emission of masm-instructions for cache handling.

31
apeNEXT Compilation Chain
  • sofan micro-assemby optimizer
  • based on the salto (INRIA) optimization toolkit
  • Transforms the micro-assembly code in order to
    perform a series of optimizations, such as
  • mul-add fusion
  • Dead code removal
  • Copy propagation
  • Address generation optimization
  • Intruction pre-scheduling

32
apeNEXT Compilation Chain
  • shaker microcode scheduler
  • generation of optimized microcode to exploit the
    Pipelined Very Long Instruction Word Processor
    Architecture
  • scheduling
  • Register renaming
  • Register allocation
  • Microcode compression
  • Optional generation of executable for the
    functional simulator

33
apeNEXT Compilation Chainshaker microcode
scheduler
  • generation of microcode patterns, texec tmax
  • shake up phase try to schedule each pattern
    earlier as possible respecting
  • dependencies between instructions
  • device occupation at each cycle
  • texec tsu

DEVICES
0
0
1
1
1
1
1
1
3
2
2
3
shake up
2
2
3
3
2
2
4
4

2
2
4
5
4
CYLES
5
5
3
5
tsu
3
3
3
4
4
4
4
5
5
5
5
tmax
34
apeNEXT Compilation Chainshaker microcode
scheduler
  • shake down phase try to schedule each pattern
    later as possible respecting
  • dependencies between instructions
  • device occupation at each cycle
  • texec tsu- tsd
  • Tipically tmax / texec 10 in computing
    intensive code sections

DEVICES
0
1
1
tsd
1
3
2
1
1
shake down

2
3
1
2
3
2
3
3
2
2
3
CYCLES

4
4
3
2
2

3
4
5
4
5
5
5
4
4
5
5
tsu
tsu
5
5
4
4
35
apeNEXT Compilation Chain
  • sf functional simulator
  • micro-assemblyInstruction level simulator.
  • Support for single and multinode simulations
    (1x1x1, 2x2x2, 4x2x2).
  • Fast simulation (multithreaded)
  • no cycle accurate.
  • bit exact arithmetic (microcode scheduling may
    give differences).

36
apeNEXT Execution EnvironmentOS distributed
architecture(1)
  • 7thLink
  • Program loading
  • I/O operations
  • 1 channel per unit
  • 200 MB/s per channel

I2C bootstrap, exception handling, debugging
(1.5 MB/s)
37
apeNEXT Execution EnvironmentOS distributed
architecture(2)
  • Master
  • resides on the front-end linux PC
  • user interface (shell commands)
  • Partitioning
  • dispatch I/O request to the slaves
  • Slave
  • Resides on the blade PCs
  • Handles communication with apeNEXT on I2C and
    7thLink
  • PCI boards
  • tiny kernel of routines embedded in the apeNEXT
    program
  • loader
  • I/O (routing of data to and from the interface
    node)
  • System services (time counters, etc)

38
apeNEXT Execution Environment
  • programs can be loaded and executed on a machine
    partition
  • node (1x1x1)
  • board 16 nodes (4x2x2)
  • unit 4 boards (4x2x8)
  • crate 4 units (4x8x8)
  • rack 2 crates (8x8x8)
  • Partition is reserved until the program execution
    finishes (no multitasking!)
  • Single process
  • No virtual memory

39
Batch system
  • Torque/OpenPBS
  • today fifo-Scheduling, implementing a users group
    quota based scheduler.
  • queues
  • rack
  • crate
  • unit

40
Batch SystemJob Submission
  • nsub wrapper of the qsub command

Usage nsub OPTIONS script Submits a apeNEXT
job where OPTIONS are -a date_time
Declares the time after which the
job is eligible for execution
the format is CCYYMMDDhhmm.SS
-c conf chooses among available
apeNEXT configurations
confboardunitunit010-3cratecrate01
rack(defaultcrate) -m host_name
requests a particular host -g group_name
overrides user group -o logfile
overrides logfile name -V
dumps version information -v
be verbose -h shows this help
41
Batch SystemJob submission example
lonardo_at_thebossgtnsub -c crate 7h_test.sh 15942.the
boss.ape lonardo_at_thebossgtqstat -an1 theboss.ape

Req'd Req'd Elap Job ID
Username Queue Jobname SessID NDS
TSK Memory Time S Time --------------------
-------- -------- ---------- ------ ----- ---
------ ----- - ----- 15813.theboss.ape orifici
crate stoc2 1826 1 -- --
2400 R 2152 rack10/1 15860.theboss.ape
simula crate run_tdilu 29170 1 --
-- 2400 R 1434 rack4/1 15877.theboss.ape
zeidlew crate mu.056.cjo 5925 1 --
-- 2400 R 1138 rack8/0 15880.theboss.ape
delia crate run.sh 6386 1 --
-- 2400 R 1059 rack8/1 15896.theboss.ape
delia unit rum0175.sh 32291 1 --
-- 2400 R 0847 rack7/5 15900.theboss.ape
frezzott rack RUN_Rack5. 18099 1 --
-- 2400 R 0756 rack5/0 15906.theboss.ape
delia unit run0175.sh 1072 1 --
-- 2400 R 0649 rack7/0 15918.theboss.ape
frezzott rack RUN_Rack2. 15890 1 --
-- 2400 R 0430 rack2/0 15926.theboss.ape
simula crate run1_tdilu 4409 1 --
-- 2400 R 0247 rack4/0 15927.theboss.ape
delia unit run0200.sh 3596 1 --
-- 2400 R 0247 rack7/4 15928.theboss.ape
lacagnin crate run.5.7.sh 4772 1 --
-- 2400 R 0236 rack1/0 15930.theboss.ape
lacagnin crate run.5.6.sh 2787 1 --
-- 2400 R 0234 rack9/0 15932.theboss.ape
cosmai crate b5.450_n0. 2994 1 --
-- 2400 R 0202 rack9/1 15933.theboss.ape
delia unit rum0200.sh 4216 1 --
-- 2400 R 0153 rack7/2 15934.theboss.ape
delia unit run0225.sh 4552 1 --
-- 2400 R 0124 rack7/1 15935.theboss.ape
devitiis crate theboss.sh 28504 1 --
-- 2400 R 0122 rack3/0 15936.theboss.ape
orifici crate stoc 28592 1 --
-- 2400 R 0057 rack3/1 15937.theboss.ape
devitiis crate theboss.sh 18065 1 --
-- 2400 R 0057 rack6/0 15939.theboss.ape
cosmai crate b5.450_n1. 5845 1 --
-- 2400 R 0054 rack1/1 15940.theboss.ape
devitiis crate theboss.sh 18472 1 --
-- 2400 R 0022 rack6/1 15941.theboss.ape
delia unit rum0225.sh 5290 1 --
-- 2400 R 0014 rack7/3 15942.theboss.ape
lonardo crate 7h_test.sh 11226 1 --
-- 2400 R -- rack10/0
Write a Comment
User Comments (0)
About PowerShow.com