Title: BlueLink
1BlueLink SX-6
2- Phil Tannenbaum once said,
- You can optimise your code until it is perfect,
and we can make a small error in judgement and it
will be all for nought!
3- Corollaries, sort of,
- Just because the SX-6 can do more than 12 GBps
to SX-GFS does not guarantee BlueLink will! - The SX-6 can demonstrate 30 efficiency on real
applications. BlueLink is based on MOM4, a
scalable parallel code but you know this.
4Todays Topics
- Intro
- I/O Performance Issues
- System Scheduling Issues
- SX6 Vs TX7 Utilisation
5Intro
6To Do Pre-Production
- BlueLink is not a stellar performer on SX-6
- Run on a SurfaceFamily Scheduled Queue
- Do a VAMPIR analysis
- Minimise messaging and I/O elapsed time
- Not always the same as maximising performance
- Pack out everything possible to minimise elapsed
time
7I/O Performance(Using a Production Mindset)
8Local File System I/O
SX-GFS File System I/O
9Local Disk Vs GFS
- File System Performance
- GFS File Systems are 2 and 4-way Striped
- (Maybe All Will become 4-way)
- Local File Systems are 1-way
- GFS Should Provide from 80-350 of Local Disk
Performance - If It Does Not, We Need to Understand WhySo Far
there has Always Been a Reason(User Creativity
is high on the list)
10Local Disk vs GFS (2)
- GFS Locking is like NFS, not Local Disk
- NFS Locking means No Locking At All
- You Must Be Careful Moving/Deleting GFS Files
- Thought is Needed for
- Moving GFS Files
- Removing GFS Files
- Assuming GFS Files are for 1 Job, (Naming)While
You Have a Job(s) Running - Housekeeping Jobs Are Known Culprits
11Additional Issues
- The I/O Subsystem works in Chunks of
- Allocation Sizes on Disk
- Buffer Sizes in Cache
- Runtime Buffer Sizes
- GFS I/O 64 KB goes by NFS not GFS
12Additional Issues (2)
- All that Being Said,
- File systems and disks can be bottlenecks
- Best Disk Performance 160 MBps (2-way)
- Best Disk Performance 250 MBps (4-way)
- Both are limits you will probably not achieve
- Consider 42 processors and 42 tasks
- Writing to 1 File
- 160 MBps is the maximum for 1 or 42 tasks
- Writing to 42 files on 1 Disk
- 160 MBps is the maximum for all 42 tasks
13Additional Issues (3)
- Consider 42 processors and 42 tasks
- Writing to 42 Files on 42 Disks
- 160 MBps is the maximum for 1 each task
- Theoretical aggregated is now 6 GBps
- BUT
- BlueLink is not alone
- Inter-job contention will occur
14Additional Issues (4)
- For production (not research)
- Consider MMF file use
- Consider Copies of Read only filesto Distribute
I/O load across channels - Consider F_SETBUF Techniques
- Almost memory speed while in buffer
- Buffer Refills and CLOSE can be Costly
- Temp Files
- OPEN with large F_SETBUF set
- REWIND and TRUNCATE prior to CLOSEto avoid
flushing to disk when not necessary
15F_SETBUF NC_BLOCKSZ
- If it is too big you read far more than you need
- 1 GB F_SETBUF (or NC_BLOCKSZ)
- READ Moves 1 GB in the Runtime Buffer
- If you only use 10 MB You did an Extra 990 MB I/O
- The Disk is VERY BUSY
- You Run Slowly
- If it is too small
- Repeated Buffer Filling Occurs per FORTRAN READ
- Really Bad on GFS!
- You Run Slowly
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Scheduling
- The SX6 is a Cluster not an SMP
- Some Jobs Use Huge Resources
- They cannot be easily reassigned without huge
cost - Making decisions about auto-resource reassignment
is non-trivial - Cluster Norms for Scheduling
- Make Input Data Available on System
- Assign Exclusive Processors and Memory
- Execute Program
- Move Output Data from System
- Static Scheduling, No Overcommitment
- Elapsed Time is the Only Performance Metric
- The SX6 is Better than That, BUT.
22Scheduling (2)
- There is a delay in starting new jobs
(hysteresis) - Queuing System Assesses Nodes
- Allows for checkpointed job restart-in-progress
- Allows for new job initialisation and spin-up
delays
23Scheduling (3)
- Ideal Production Configuration
- Push Files to the SX-6 from Data Source
- Start Job on SX-6 and Let it Complete
- Single Script, Preferably 1 Long Executable
- Acquires Processors and Helps Scheduling
- Pull Files from the SX-6 from Data Repository
- If Files Need to Go Concurrently with Execution,
- QSUB
- Do Not background
- Do Not DO_TX7
24The TX7
- Do_TX7
- Unreliable (rcp stalls or failures, as an
example) - SX6 Job Waits Using Processors and Memory
- This is a Reality on ALL Cluster Systems the
Processors and Memory are ALL YOURS - NQS/ERS Can Overcommit, But Other Issues Arise
- Provides Tight Coupling
- Mimics SX5 Where Your Job Controlled Files
- Developed to Assist Conversion from SX5 to SX6
25The TX7 (2)
- Background Tasks Why Not?
- Background tasks are tightly coupled
- Synchronisation is easy
- Ref the scheduling part of this talk, should show
why not!
26The TX7 (3)
- submit (NQS) to rtds, TX7, gale, cherax, etc
- Reliable, Failures are Notified
- SX6 Job becomes Asynchronous
- Releases Resources ASAP
- Does Not Provide Tight Coupling
- Push Files to SX6 from gale, etc
- Submit SX6 Job
- Pull Files off SX6 from gale, etc
- This is the World of Cluster Computing
27Scheduling (4)
- The SX6 will increasingly become Very Busy
- CPUs will Become Scarce Resources
- BlueLink will use 42 CPU, for example
- EPS Multinode GASP and LAPS ( 100 CPU total)
- Etc
- Job Migration for non-RTO Will Be Common
- One Key Trade-off is Maximum Local-disk I/O
Performanceon a Quiet System, for Best
Turnaround on a Full One - Local File Migration Is Not Generally Viable
- Ethernet transfer (yes!) of 100 GB is 30 minutes
- Alternatives Under Investigation
- Development PSR to NEC
- Local Development to Move Local Files Through the
IXS
28Scheduling (5)
- Today RTO is pre-emptive
- RTO Jobs Start Immediately
- Tomorrow it Might Be Prioritised
- Constrained to Subset of Nodes
- Queued in Order, Executed in Order
- SUPER-UX issue-
- I/O is not prioritised
- TX7 only gives preference to an SX-6
- Equal Competition for RTO and Research for I/O
29 What Does this Have to do with Surface
Scheduling? What Does Gang SchedulingMean for
My Job?What Does Family Scheduling Mean for My
Job?
30Surface Scheduling
- Overcomittment
- Highest Resource Utilisation is the Goal
- Uses CPU for other jobs during I/O Other
waiting - Requires Fast and Prompt System Response
- And Extra Jobs Waiting to Use the CPU
- Gang Scheduling is Mandatory when Overcomitting
- No Overcomittment
- Fastest Possible Completion Time is the Goal
- NWP production is a Realtime problem,
it is not a supercomputer problem
31Gang Scheduling is Processor Scheduling
- Developed by Cray Research for the XMP
- Needed When parallel and less parallel jobs had
to coexist on few processors.. - Virtually Eliminated CPU Spin Waiting
- Enables Maximum Use of Expensive Resource
- The CPUs were many US 100,000s each
- Goal Highest Resource Utilisation
- Not Always Achievable
32Pre-Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
8-way Synchronization Points
CPU Spin Waiting
57 CPU Slices (in example) are Spent Spinning
33Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
37 CPU Slices Available for Small Jobs, 0
Spinning
34Family Scheduling is Processor Scheduling
- Not Gang Scheduling
- Works When System Surface Schedules
- Surface Scheduling is not a Computer Science
Termand is Often called Static Scheduling - Surface Scheduling No Overcommitment of CPUs
- 1 task/process per CPU
- Compute, Wait I/O, Wait Message
- Job_Cost Resident_Time x Number_CPUs x
_per_sec - Tasks/Processes
- Waiting on I/O or Messages Will Spin Wait
- No Attempt to Recover Cycles for Other Work
- Result Shortest Possible Elapsed Run Time
35Gang Scheduling vs Family Scheduling Elapsed Time
Using GS
Using FS
Idle
Job2
Job3
Job4
Job1
Job5
36Family Scheduling Elapsed Time
Messaging/Synchronization Points
Idle
Job2
Job3
Job4
Job1
Job5
37Cluster Optimisation
- Load Balance Across Processors Desired
- Elapsed Time is the Longest Running Thread
- Minimize I/O Time
- Your program is Spinning During I/O
- Minimize Message Passing Time
- Your program is Spinning During MPI
- Vectorize and Optimize
- Highest Vectorization ? Best performancein All
Cases, But is Most Often True
38Final Thoughts
39- Appreciate System Scheduling Issues
- Use Well Thought Out I/O
- A floppy disk only goes so fast, as does a Fibre
RAID - Copies of Data Are Often Good for Parallel Reads
- Consider SX-6 and TX7 Relationship
- Submit is Better than Do_TX7 Over the Long Term
- Push Data to the GFS Before the Job
- Pull Data from the GFS After the Job
- Use NQS
40Questions?