BlueLink - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

BlueLink

Description:

'Just because the SX-6 can do more than 12 GBps to SX-GFS does not guarantee BlueLink will!' 'The SX-6 can demonstrate 30% efficiency on real applications. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 41
Provided by: PDT2
Category:
Tags: bluelink | gfs | real

less

Transcript and Presenter's Notes

Title: BlueLink


1
BlueLink SX-6
2
  • Phil Tannenbaum once said,
  • You can optimise your code until it is perfect,
    and we can make a small error in judgement and it
    will be all for nought!

3
  • Corollaries, sort of,
  • Just because the SX-6 can do more than 12 GBps
    to SX-GFS does not guarantee BlueLink will!
  • The SX-6 can demonstrate 30 efficiency on real
    applications. BlueLink is based on MOM4, a
    scalable parallel code but you know this.

4
Todays Topics
  • Intro
  • I/O Performance Issues
  • System Scheduling Issues
  • SX6 Vs TX7 Utilisation

5
Intro
6
To Do Pre-Production
  • BlueLink is not a stellar performer on SX-6
  • Run on a SurfaceFamily Scheduled Queue
  • Do a VAMPIR analysis
  • Minimise messaging and I/O elapsed time
  • Not always the same as maximising performance
  • Pack out everything possible to minimise elapsed
    time

7
I/O Performance(Using a Production Mindset)
8
Local File System I/O
SX-GFS File System I/O
9
Local Disk Vs GFS
  • File System Performance
  • GFS File Systems are 2 and 4-way Striped
  • (Maybe All Will become 4-way)
  • Local File Systems are 1-way
  • GFS Should Provide from 80-350 of Local Disk
    Performance
  • If It Does Not, We Need to Understand WhySo Far
    there has Always Been a Reason(User Creativity
    is high on the list)

10
Local Disk vs GFS (2)
  • GFS Locking is like NFS, not Local Disk
  • NFS Locking means No Locking At All
  • You Must Be Careful Moving/Deleting GFS Files
  • Thought is Needed for
  • Moving GFS Files
  • Removing GFS Files
  • Assuming GFS Files are for 1 Job, (Naming)While
    You Have a Job(s) Running
  • Housekeeping Jobs Are Known Culprits

11
Additional Issues
  • The I/O Subsystem works in Chunks of
  • Allocation Sizes on Disk
  • Buffer Sizes in Cache
  • Runtime Buffer Sizes
  • GFS I/O 64 KB goes by NFS not GFS

12
Additional Issues (2)
  • All that Being Said,
  • File systems and disks can be bottlenecks
  • Best Disk Performance 160 MBps (2-way)
  • Best Disk Performance 250 MBps (4-way)
  • Both are limits you will probably not achieve
  • Consider 42 processors and 42 tasks
  • Writing to 1 File
  • 160 MBps is the maximum for 1 or 42 tasks
  • Writing to 42 files on 1 Disk
  • 160 MBps is the maximum for all 42 tasks

13
Additional Issues (3)
  • Consider 42 processors and 42 tasks
  • Writing to 42 Files on 42 Disks
  • 160 MBps is the maximum for 1 each task
  • Theoretical aggregated is now 6 GBps
  • BUT
  • BlueLink is not alone
  • Inter-job contention will occur

14
Additional Issues (4)
  • For production (not research)
  • Consider MMF file use
  • Consider Copies of Read only filesto Distribute
    I/O load across channels
  • Consider F_SETBUF Techniques
  • Almost memory speed while in buffer
  • Buffer Refills and CLOSE can be Costly
  • Temp Files
  • OPEN with large F_SETBUF set
  • REWIND and TRUNCATE prior to CLOSEto avoid
    flushing to disk when not necessary

15
F_SETBUF NC_BLOCKSZ
  • If it is too big you read far more than you need
  • 1 GB F_SETBUF (or NC_BLOCKSZ)
  • READ Moves 1 GB in the Runtime Buffer
  • If you only use 10 MB You did an Extra 990 MB I/O
  • The Disk is VERY BUSY
  • You Run Slowly
  • If it is too small
  • Repeated Buffer Filling Occurs per FORTRAN READ
  • Really Bad on GFS!
  • You Run Slowly

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Scheduling
  • The SX6 is a Cluster not an SMP
  • Some Jobs Use Huge Resources
  • They cannot be easily reassigned without huge
    cost
  • Making decisions about auto-resource reassignment
    is non-trivial
  • Cluster Norms for Scheduling
  • Make Input Data Available on System
  • Assign Exclusive Processors and Memory
  • Execute Program
  • Move Output Data from System
  • Static Scheduling, No Overcommitment
  • Elapsed Time is the Only Performance Metric
  • The SX6 is Better than That, BUT.

22
Scheduling (2)
  • There is a delay in starting new jobs
    (hysteresis)
  • Queuing System Assesses Nodes
  • Allows for checkpointed job restart-in-progress
  • Allows for new job initialisation and spin-up
    delays

23
Scheduling (3)
  • Ideal Production Configuration
  • Push Files to the SX-6 from Data Source
  • Start Job on SX-6 and Let it Complete
  • Single Script, Preferably 1 Long Executable
  • Acquires Processors and Helps Scheduling
  • Pull Files from the SX-6 from Data Repository
  • If Files Need to Go Concurrently with Execution,
  • QSUB
  • Do Not background
  • Do Not DO_TX7

24
The TX7
  • Do_TX7
  • Unreliable (rcp stalls or failures, as an
    example)
  • SX6 Job Waits Using Processors and Memory
  • This is a Reality on ALL Cluster Systems the
    Processors and Memory are ALL YOURS
  • NQS/ERS Can Overcommit, But Other Issues Arise
  • Provides Tight Coupling
  • Mimics SX5 Where Your Job Controlled Files
  • Developed to Assist Conversion from SX5 to SX6

25
The TX7 (2)
  • Background Tasks Why Not?
  • Background tasks are tightly coupled
  • Synchronisation is easy
  • Ref the scheduling part of this talk, should show
    why not!

26
The TX7 (3)
  • submit (NQS) to rtds, TX7, gale, cherax, etc
  • Reliable, Failures are Notified
  • SX6 Job becomes Asynchronous
  • Releases Resources ASAP
  • Does Not Provide Tight Coupling
  • Push Files to SX6 from gale, etc
  • Submit SX6 Job
  • Pull Files off SX6 from gale, etc
  • This is the World of Cluster Computing

27
Scheduling (4)
  • The SX6 will increasingly become Very Busy
  • CPUs will Become Scarce Resources
  • BlueLink will use 42 CPU, for example
  • EPS Multinode GASP and LAPS ( 100 CPU total)
  • Etc
  • Job Migration for non-RTO Will Be Common
  • One Key Trade-off is Maximum Local-disk I/O
    Performanceon a Quiet System, for Best
    Turnaround on a Full One
  • Local File Migration Is Not Generally Viable
  • Ethernet transfer (yes!) of 100 GB is 30 minutes
  • Alternatives Under Investigation
  • Development PSR to NEC
  • Local Development to Move Local Files Through the
    IXS

28
Scheduling (5)
  • Today RTO is pre-emptive
  • RTO Jobs Start Immediately
  • Tomorrow it Might Be Prioritised
  • Constrained to Subset of Nodes
  • Queued in Order, Executed in Order
  • SUPER-UX issue-
  • I/O is not prioritised
  • TX7 only gives preference to an SX-6
  • Equal Competition for RTO and Research for I/O

29
What Does this Have to do with Surface
Scheduling? What Does Gang SchedulingMean for
My Job?What Does Family Scheduling Mean for My
Job?
30
Surface Scheduling
  • Overcomittment
  • Highest Resource Utilisation is the Goal
  • Uses CPU for other jobs during I/O Other
    waiting
  • Requires Fast and Prompt System Response
  • And Extra Jobs Waiting to Use the CPU
  • Gang Scheduling is Mandatory when Overcomitting
  • No Overcomittment
  • Fastest Possible Completion Time is the Goal
  • NWP production is a Realtime problem,
    it is not a supercomputer problem

31
Gang Scheduling is Processor Scheduling
  • Developed by Cray Research for the XMP
  • Needed When parallel and less parallel jobs had
    to coexist on few processors..
  • Virtually Eliminated CPU Spin Waiting
  • Enables Maximum Use of Expensive Resource
  • The CPUs were many US 100,000s each
  • Goal Highest Resource Utilisation
  • Not Always Achievable

32
Pre-Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
8-way Synchronization Points
CPU Spin Waiting
57 CPU Slices (in example) are Spent Spinning
33
Gang Scheduling CPU Time Slices CPU is
assigned to a Job
1 2 3 4 5 6 7 8
CPU
Idle
Job2
Job3
Job4
Job1
Job5
37 CPU Slices Available for Small Jobs, 0
Spinning
34
Family Scheduling is Processor Scheduling
  • Not Gang Scheduling
  • Works When System Surface Schedules
  • Surface Scheduling is not a Computer Science
    Termand is Often called Static Scheduling
  • Surface Scheduling No Overcommitment of CPUs
  • 1 task/process per CPU
  • Compute, Wait I/O, Wait Message
  • Job_Cost Resident_Time x Number_CPUs x
    _per_sec
  • Tasks/Processes
  • Waiting on I/O or Messages Will Spin Wait
  • No Attempt to Recover Cycles for Other Work
  • Result Shortest Possible Elapsed Run Time

35
Gang Scheduling vs Family Scheduling Elapsed Time
Using GS
Using FS
Idle
Job2
Job3
Job4
Job1
Job5
36
Family Scheduling Elapsed Time
Messaging/Synchronization Points
Idle
Job2
Job3
Job4
Job1
Job5
37
Cluster Optimisation
  • Load Balance Across Processors Desired
  • Elapsed Time is the Longest Running Thread
  • Minimize I/O Time
  • Your program is Spinning During I/O
  • Minimize Message Passing Time
  • Your program is Spinning During MPI
  • Vectorize and Optimize
  • Highest Vectorization ? Best performancein All
    Cases, But is Most Often True

38
Final Thoughts
39
  • Appreciate System Scheduling Issues
  • Use Well Thought Out I/O
  • A floppy disk only goes so fast, as does a Fibre
    RAID
  • Copies of Data Are Often Good for Parallel Reads
  • Consider SX-6 and TX7 Relationship
  • Submit is Better than Do_TX7 Over the Long Term
  • Push Data to the GFS Before the Job
  • Pull Data from the GFS After the Job
  • Use NQS

40
Questions?
Write a Comment
User Comments (0)
About PowerShow.com