BlueLink - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

BlueLink

Description:

'Just because the SX-6 can do more than 12 GBps to SX-GFS does not guarantee BlueLink will!' 'The SX-6 can demonstrate 30% efficiency on real applications. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 41

Provided by: PDT2

Category:

more less

Transcript and Presenter's Notes

Title: BlueLink

1
BlueLink SX-6
2

Phil Tannenbaum once said,
You can optimise your code until it is perfect,
and we can make a small error in judgement and it
will be all for nought!

Corollaries, sort of,
Just because the SX-6 can do more than 12 GBps
to SX-GFS does not guarantee BlueLink will!
The SX-6 can demonstrate 30 efficiency on real
applications. BlueLink is based on MOM4, a
scalable parallel code but you know this.

4
Todays Topics

Intro
I/O Performance Issues
System Scheduling Issues
SX6 Vs TX7 Utilisation

5
Intro
6
To Do Pre-Production

BlueLink is not a stellar performer on SX-6
Run on a SurfaceFamily Scheduled Queue
Do a VAMPIR analysis
Minimise messaging and I/O elapsed time
Not always the same as maximising performance
Pack out everything possible to minimise elapsed
time

7
I/O Performance(Using a Production Mindset)
8
Local File System I/O
SX-GFS File System I/O
9
Local Disk Vs GFS

File System Performance
GFS File Systems are 2 and 4-way Striped
(Maybe All Will become 4-way)
Local File Systems are 1-way
GFS Should Provide from 80-350 of Local Disk
Performance
If It Does Not, We Need to Understand WhySo Far
there has Always Been a Reason(User Creativity
is high on the list)

10
Local Disk vs GFS (2)

GFS Locking is like NFS, not Local Disk
NFS Locking means No Locking At All
You Must Be Careful Moving/Deleting GFS Files
Thought is Needed for
Moving GFS Files
Removing GFS Files
Assuming GFS Files are for 1 Job, (Naming)While
You Have a Job(s) Running
Housekeeping Jobs Are Known Culprits

11
Additional Issues

The I/O Subsystem works in Chunks of
Allocation Sizes on Disk
Buffer Sizes in Cache
Runtime Buffer Sizes
GFS I/O 64 KB goes by NFS not GFS

12
Additional Issues (2)

All that Being Said,
File systems and disks can be bottlenecks
Best Disk Performance 160 MBps (2-way)
Best Disk Performance 250 MBps (4-way)
Both are limits you will probably not achieve
Consider 42 processors and 42 tasks
Writing to 1 File
160 MBps is the maximum for 1 or 42 tasks
Writing to 42 files on 1 Disk
160 MBps is the maximum for all 42 tasks

13
Additional Issues (3)

Consider 42 processors and 42 tasks
Writing to 42 Files on 42 Disks
160 MBps is the maximum for 1 each task
Theoretical aggregated is now 6 GBps
BUT
BlueLink is not alone
Inter-job contention will occur

14
Additional Issues (4)

For production (not research)
Consider MMF file use
Consider Copies of Read only filesto Distribute
I/O load across channels
Consider F_SETBUF Techniques
Almost memory speed while in buffer
Buffer Refills and CLOSE can be Costly
Temp Files
OPEN with large F_SETBUF set
REWIND and TRUNCATE prior to CLOSEto avoid
flushing to disk when not necessary

15
F_SETBUF NC_BLOCKSZ

If it is too big you read far more than you need
1 GB F_SETBUF (or NC_BLOCKSZ)
READ Moves 1 GB in the Runtime Buffer
If you only use 10 MB You did an Extra 990 MB I/O
The Disk is VERY BUSY
You Run Slowly
If it is too small
Repeated Buffer Filling Occurs per FORTRAN READ
Really Bad on GFS!
You Run Slowly

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Scheduling

The SX6 is a Cluster not an SMP
Some Jobs Use Huge Resources
They cannot be easily reassigned without huge
cost
Making decisions about auto-resource reassignment
is non-trivial
Cluster Norms for Scheduling
Make Input Data Available on System
Assign Exclusive Processors and Memory
Execute Program
Move Output Data from System
Static Scheduling, No Overcommitment
Elapsed Time is the Only Performance Metric
The SX6 is Better than That, BUT.

22
Scheduling (2)

There is a delay in starting new jobs
(hysteresis)
Queuing System Assesses Nodes
Allows for checkpointed job restart-in-progress
Allows for new job initialisation and spin-up
delays

23
Scheduling (3)

Ideal Production Configuration
Push Files to the SX-6 from Data Source
Start Job on SX-6 and Let it Complete
Single Script, Preferably 1 Long Executable
Acquires Processors and Helps Scheduling
Pull Files from the SX-6 from Data Repository
If Files Need to Go Concurrently with Execution,
QSUB
Do Not background
Do Not DO_TX7

24
The TX7

Do_TX7
Unreliable (rcp stalls or failures, as an
example)
SX6 Job Waits Using Processors and Memory
This is a Reality on ALL Cluster Systems the
Processors and Memory are ALL YOURS
NQS/ERS Can Overcommit, But Other Issues Arise
Provides Tight Coupling
Mimics SX5 Where Your Job Controlled Files
Developed to Assist Conversion from SX5 to SX6

25
The TX7 (2)

Background Tasks Why Not?
Background tasks are tightly coupled
Synchronisation is easy
Ref the scheduling part of this talk, should show
why not!

26
The TX7 (3)

submit (NQS) to rtds, TX7, gale, cherax, etc
Reliable, Failures are Notified
SX6 Job becomes Asynchronous
Releases Resources ASAP
Does Not Provide Tight Coupling
Push Files to SX6 from gale, etc
Submit SX6 Job
Pull Files off SX6 from gale, etc
This is the World of Cluster Computing

27
Scheduling (4)

The SX6 will increasingly become Very Busy
CPUs will Become Scarce Resources
BlueLink will use 42 CPU, for example
EPS Multinode GASP and LAPS ( 100 CPU total)
Etc
Job Migration for non-RTO Will Be Common
One Key Trade-off is Maximum Local-disk I/O
Performanceon a Quiet System, for Best
Turnaround on a Full One
Local File Migration Is Not Generally Viable
Ethernet transfer (yes!) of 100 GB is 30 minutes
Alternatives Under Investigation
Development PSR to NEC
Local Development to Move Local Files Through the
IXS

28
Scheduling (5)

Today RTO is pre-emptive
RTO Jobs Start Immediately
Tomorrow it Might Be Prioritised
Constrained to Subset of Nodes
Queued in Order, Executed in Order
SUPER-UX issue-
I/O is not prioritised
TX7 only gives preference to an SX-6
Equal Competition for RTO and Research for I/O

29
What Does this Have to do with Surface
Scheduling? What Does Gang SchedulingMean for
My Job?What Does Family Scheduling Mean for My
Job?
30
Surface Scheduling

Overcomittment
Highest Resource Utilisation is the Goal
Uses CPU for other jobs during I/O Other
waiting
Requires Fast and Prompt System Response
And Extra Jobs Waiting to Use the CPU
Gang Scheduling is Mandatory when Overcomitting
No Overcomittment
Fastest Possible Completion Time is the Goal
NWP production is a Realtime problem,
it is not a supercomputer problem

31
Gang Scheduling is Processor Scheduling

Developed by Cray Research for the XMP
Needed When parallel and less parallel jobs had
to coexist on few processors..
Virtually Eliminated CPU Spin Waiting
Enables Maximum Use of Expensive Resource
The CPUs were many US 100,000s each
Goal Highest Resource Utilisation
Not Always Achievable

32
Pre-Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
8-way Synchronization Points
CPU Spin Waiting
57 CPU Slices (in example) are Spent Spinning
33
Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
37 CPU Slices Available for Small Jobs, 0
Spinning
34
Family Scheduling is Processor Scheduling

Not Gang Scheduling
Works When System Surface Schedules
Surface Scheduling is not a Computer Science
Termand is Often called Static Scheduling
Surface Scheduling No Overcommitment of CPUs
1 task/process per CPU
Compute, Wait I/O, Wait Message
Job_Cost Resident_Time x Number_CPUs x
_per_sec
Tasks/Processes
Waiting on I/O or Messages Will Spin Wait
No Attempt to Recover Cycles for Other Work
Result Shortest Possible Elapsed Run Time

35
Gang Scheduling vs Family Scheduling Elapsed Time
Using GS
Using FS
Idle
Job2
Job3
Job4
Job1
Job5
36
Family Scheduling Elapsed Time
Messaging/Synchronization Points
Idle
Job2
Job3
Job4
Job1
Job5
37
Cluster Optimisation