Title: Super Scaling PROOF to very large clusters
1Super Scaling PROOF to very large clusters
- Maarten Ballintijn, Kris Gulbrandsen,
- Gunther Roland / MIT
- Rene Brun, Fons Rademakers / CERN
- Philippe Canal / FNAL
- CHEP 2004
2 Outline
- PROOF Overview
- Benchmark Package
- Benchmark results
- Other developments
- Future plans
3 Outline
- PROOF Overview
- Benchmark Package
- Benchmark results
- Other developments
- Future plans
4 PROOF Parallel ROOT Facility
- Interactive analysis of very large sets of ROOT
data files on a cluster of computers - Employ inherent parallelism in event data
- The main design goals are
- Transparency, scalability, adaptability
- On the GRID, extended from local cluster to wide
area virtual cluster or cluster of clusters - Collaboration between ROOT group at CERN and MIT
Heavy Ion Group
5 PROOF, continued
Slave
- Multi Tier architecture
- Optimize for Data Locality
- WAN Ready and GRID compatible
User
6PROOF - Architecture
- Data Access Strategies
- Local data first, also rootd, rfio, SAN/NAS
- Transparency
- Input objects copied from client
- Output objects merged, returned to client
- Scalability and Adaptability
- Vary packet size (specific workload, slave
performance, dynamic load) - Heterogeneous Servers
- Migrate to multi site configurations
7 Outline
- PROOF Overview
- Benchmark Package
- Dataset generation
- Benchmark TSelector
- Statistics and Event Trace
- Benchmark results
- Other developments
- Future plans
8 Dataset generation
- Use the ROOT Event example class
- Script for creating PAR file is provided
- Generate data on all nodes with slaves
- Slaves generate data files in parallel
- Specify location, size and number of files
make_event_par.sh root root0
gROOT-gtProof() root1 .X make_event_trees.C(/tmp
/data,100000,4) root2 .L make_tdset.C root2
TDSet d make_tdset.C()
9 Benchmark TSelector
- Three selectors are used
- EventTree_NoProc.C Empty Process() function,
reads no data - EventTree_Proc.C Reads all data and fills
histogram (actually only 35 read in this test) - EventTree_ProcOpt.C Reads a fraction of the
data (20) and fills histogram
10 Statistics and Event Trace
- Global Histograms to monitor master
- Number of packets, number of events, processing
time, get packet latency per slave - Can be viewed using standard feedback
- Trace Tree, detailed log of events during query
- Master only or Master and Slave
- Detailed List of recorded events follows
- Implemented using standard ROOT classes and PROOF
facilities
11 Events recorded in Trace
- Each event contains a timestamp and the recording
slave or master - Begin and End of Query
- Begin and End of File
- Packet details and processing time
- File Open statistics (slaves)
- File Read statistics (slaves)
- Easy to add new events
12 Outline
- PROOF Overview
- Benchmark Package
- Benchmark results
- Other developments
- Future plans
13 Benchmark Results
- CDF cluster at Fermilab
- 160 nodes, initial tests
- Pharm, Phobos private cluster, 24 nodes
- 6, 730 MHz P3 dual
- 6, 930 MHz P3 dual
- 12, 1.8 GHz P4 dual
- Dataset
- 1 files per slave, 60000 events, 100 Mb
14 Results on Pharm
15 Results on Pharm, continued
16 Local and remote File open
Local
local
remote
17 Slave I/O Performance
18 Benchmark Results
- Phobos-RCF, central facility at BNL, 370 nodes
total - 75, 3.05 Ghz P4 dual, IDE
- 99, 2.4 Ghz P4 dual, IDE
- 18, 1.4 Ghz P3 dual, IDE
- Dataset
- 1 files per slave, 60000 events, 100 Mb
19 PHOBOS RCF LAN Layout
20 Results on Phobos-RCF
21 Looking at the problem
22 Processing time distributions
23 Processing time, detailed
24 Request packet from Master
25 Benchmark Conclusions
- The benchmark and measurement facility has proven
to be a very useful tool - Dont use NFS based home directories
- LAN topology is important
- LAN speed is important
- More testing is required to pinpoint sporadic
long latency
26 Outline
- PROOF Overview
- Benchmark Package
- Benchmark results
- Other developments
- Future plans
27 Other developments
- Packetizer fixes and new dev version
- PROOF Parallel startup
- TDrawFeedback
- TParameter utility class
- TCondor improvements
- Authentication improvements
- Long64_t introduction
28 Outline
- PROOF Overview
- Benchmark Package
- Benchmark results
- Other developments
- Future plans
29 Future plans
- Understand and Solve LAN latency problem
- In prototype stage
- TProofDraw()
- Multi level master configuration
- Documentation
- HowTo
- Benchmarking
- PEAC PROOF Grid scheduler
30 The End
31Parallel Script Execution
proof.conf slave node1 slave node2 slave
node3 slave node4
Remote PROOF Cluster
Local PC
root
.root
node1
ana.C
.root
root
node2
root root 0 .x ana.C
root root 0 .x ana.C root 1
gROOT-gtProof(remote)
root root 0 tree-gtProcess(ana.C) root 1
gROOT-gtProof(remote) root 2
dset-gtProcess(ana.C)
.root
node3
.root
node4
32 Simplified message flow
33 TSelector control flow
Begin()
Send Input Objects
SlaveBegin()
Process()
...
Process()
SlaveTerminate()
Return Output Objects
Terminate()
34PEAC System Overview
35 Active Files during Query
36 Pharm Slave I/O
37(No Transcript)
38Active Files during Query