Title: HTC in Research
1HTC inResearch Education
2Claims for benefits provided by Distributed
Processing Systems
- High Availability and Reliability
- High System Performance
- Ease of Modular and Incremental Growth
- Automatic Load and Resource Sharing
- Good Response to Temporary Overloads
- Easy Expansion in Capacity and/or Function
What is a Distributed Data Processing System? ,
P.H. Enslow, Computer, January 1978
3Democratizationof ComputingYou do not need to
be asuper-person to do super-computing
4NCBI FTP
Searching for small RNAs candidates in a
kingdom 45 CPU days
.ffn
.fna
IGRExtract3
.ptt
.gbk
RNAMotif
FindTerm
TransTerm
All other IGRs
ROI IGRs
BLAST
Terminators
Conservation
Known sRNAs, riboswitches
sRNAPredict
IGR sequences of candidates
Candidate loci
FFN_parse
IGRs all known sRNAs
BLAST
BLAST
TFBS matrices
homology
QRNA
ORFs flank known
ORFs flank candidates
Patser
2o cons.
BLAST
BLAST
paralogy
TFBSs
sRNA_Annotate
synteny
Annotated candidate sRNA-encoding genes
5Education and Training
- Computer Science develop and implement novel
HTC technologies (horizontal) - Domain Sciences develop and implement
end-to-end HTC capabilities that are fully
integrated in the scientific discovery process
(vertical) - Experimental methods develop and implement a
curriculum that harnesses HTC capabilities to
teach how to use modeling and numerical data to
answer scientific questions. - System Management develop and implement a
curriculum that uses HTC resources to teach how
to build, deploy, maintain and operate
distributed systems
6- "As we look to hire new graduates, both at the
undergraduate and graduate levels, we find that
in most cases people are coming in with a good,
solid core computer science traditional education
... but not a great, broad-based education in all
the kinds of computing that near and dear to our
business." - Ron BrachmanVice President of Worldwide Research
Operations, Yahoo
7- Yahoo! Inc., a leading global Internet company,
today announced that it will be the first in the
industry to launch an open source program aimed
at advancing the research and development of
systems software for distributed computing.
Yahoos program is intended to leverage its
leadership in Hadoop, an open source distributed
computing sub-project of the Apache Software
Foundation, to enable researchers to modify and
evaluate the systems software running on a
4,000-processor supercomputer provided by Yahoo.
Unlike other companies and traditional
supercomputing centers, which focus on providing
users with computers for running applications and
for coursework, Yahoos program focuses on
pushing the boundaries of large-scale systems
software research.
81986-2006Celebrating 20 years since we first
installed Condor in our CS department
9Integrating Linux Technology with Condor Kim van
der Riet Principal Software Engineer
10What will Red Hat be doing?
- Red Hat will be investing into the Condor project
locally in Madison WI, in addition to driving
work required in upstream and related projects.
This work will include - Engineering on Condor features infrastructure
- Should result in tighter integration with related
technologies - Tighter kernel integration
- Information transfer between the Condor team and
Red Hat engineers working on things like
Messaging, Virtualization, etc. - Creating and packaging Condor components for
Linux distributions - Support for Condor packaged in RH distributions
- All work goes back to upstream communities, so
this partnership will benefit all. - Shameless plug If you want to be involved, Red
Hat is hiring...
10
11High Throughput Computingon Blue Gene
- IBM Rochester Amanda Peters, Tom Budnik
- With contributions from
- IBM Rochester Mike Mundy, Greg Stewart, Pat
McCarthy - IBM Watson Research Alan King, Jim Sexton
- UW-Madison Condor Greg Thain, Miron Livny,
Todd Tannenbaum
12Condor and IBM Blue Gene Collaboration
- Both IBM and Condor teams engaged in adapting
code to bring Condor and Blue Gene technologies
together -
- Initial Collaboration (Blue Gene/L)
- Prototype/research Condor running HTC workloads
on Blue Gene/L - Condor developed dispatcher/launcher running HTC
jobs - Prototype work for Condor being performed on
Rochester On-Demand Center Blue Gene system - Mid-term Collaboration (Blue Gene/L)
- Condor supports HPC workloads along with HTC
workloads on Blue Gene/L - Long-term Collaboration (Next Generation Blue
Gene) - I/O Node exploitation with Condor
- Partner in design of HTC services for Next
Generation Blue Gene - Standardized launcher, boot/allocation services,
job submission/tracking via database, etc. - Study ways to automatically switch between
HTC/HPC workloads on a partition - Data persistence (persisting data in memory
across executables) - Data affinity scheduling
- Petascale environment issues
13The Grid Blueprint for a New Computing
Infrastructure Edited by Ian Foster and Carl
Kesselman July 1998, 701 pages.
The grid promises to fundamentally change the way
we think about and use computing. This
infrastructure will connect multiple regional and
national computational grids, creating a
universal source of pervasive and dependable
computing power that supports dramatically new
classes of applications. The Grid provides a
clear vision of what computational grids are, why
we need them, who will use them, and how they
will be programmed.
14- We claim that these mechanisms, although
originally developed in the context of a cluster
of workstations, are also applicable to
computational grids. In addition to the required
flexibility of services in these grids, a very
important concern is that the system be robust
enough to run in production mode continuously
even in the face of component failures.
Miron Livny Rajesh Raman, "High Throughput
Resource Management", in The Grid Blueprint for
a New Computing Infrastructure.
15(No Transcript)
16CERN 92
17The search for SUSY
- Sanjay Padhi is a UW Chancellor Fellow who is
working at the group of Prof. Sau Lan Wu located
at CERN (Geneva) - Using Condor Technologies he established a grid
access point in his office at CERN - Through this access-point he managed to harness
in 3 month (12/05-2/06) more that 500 CPU years
from the LHC Computing Grid (LCG) the Open
Science Grid (OSG) the Grid Laboratory Of
Wisconsin (GLOW) resources and local group owned
desk-top resources.
Super-Symmetry
18High Throughput Computing
- We first introduced the distinction between High
Performance Computing (HPC) and High Throughput
Computing (HTC) in a seminar at the NASA Goddard
Flight Center in July of 1996 and a month later
at the European Laboratory for Particle Physics
(CERN). In June of 1997 HPCWire published an
interview on High Throughput Computing.
19Why HTC?
- For many experimental scientists, scientific
progress and quality of research are strongly
linked to computing throughput. In other words,
they are less concerned about instantaneous
computing power. Instead, what matters to them is
the amount of computing they can harness over a
month or a year --- they measure computing power
in units of scenarios per day, wind patterns per
week, instructions sets per month, or crystal
configurations per year.
20High Throughput Computingis a24-7-365activity
FLOPY ? (606024752)FLOPS
21High Throughput Computing
EPFL 97
- Miron Livny
- Computer Sciences
- University of Wisconsin-Madison
- miron_at_cs.wisc.edu
22Customers of HTC
- Most HTC application follow the Master-Worker
paradigm where a group of workers executes a
loosely coupled heap of tasks controlled by on or
more masters. - Job Level - Tens to thousands of independent jobs
- Task Level - A parallel application (PVM,MPI-2)
that consists of a small group of master
processes and tens to hundreds worker processes.
23The Challenge
- Turn large collections of existing
distributively owned computing resources into
effective High Throughput Computing Environments - Minimize Wait while Idle
24Obstacles to HTC
(Sociology) (Robustness) (Portability) (Technology
)
- Ownership Distribution
- Size and Uncertainties
- Technology Evolution
- Physical Distribution
25Sociology
- Make owners ( system administrators) happy.
- Give owners full control on
- when and by whom private resources are used for
HTC - impact of HTC on private Quality of Service
- membership and information on HTC related
activities - No changes to existing software and make it easy
- to install, configure, monitor, and maintain
Happy owners ? more resources ? higher throughput
26Sociology
- Owners look for a verifiable contract with the
HTC environment that spells out the rules of
engagements. - System administrators do not like weird
distributed applications that have the potential
of interfering with the happiness of their
interactive users.
27Robustness
- To be effective, a HTC environment must run as
a 24-7-356 operation. - Customers count on it
- Debugging and fault isolation may be a very
time consuming processes - In a large distributed system, everything that
might go wrong will go wrong.
Robust system ? less down time ? higher throughput
28Portability
- To be effective, the HTC software must run on
and support the latest greatest hardware and
software. - Owners select hardware and software according to
their needs and tradeoffs - Customers expect it to be there.
- Application developer expect only few (if any)
changes to their applications.
Portability ? more platforms? higher throughput
29Technology
- A HTC environment is a large, dynamic and
evolving Distributed System - Autonomous and heterogeneous resources
- Remote file access
- Authentication
- Local and wide-area networking
30Robust and PortableMechanisms Hold The
ToHigh ThroughputComputing
Policies play only a secondary role in HTC
31Leads to a bottom upapproach to building and
operating distributed systems
32My jobs should run
- on my laptop if it is not connected to the
network - on my group resources if my certificate expired
- ... on my campus resources if the meta scheduler
is down - on my national resources if the trans-Atlantic
link was cut by a submarine
33The Open Science Grid(OSG)
- Miron Livny - OSG PI Facility Coordinator,
- Computer Sciences Department
- University of Wisconsin-Madison
Supported by the Department of Energy Office of
Science SciDAC-2 program from the High Energy
Physics, Nuclear Physics and Advanced Software
and Computing Research programs, and the
National Science Foundation Math and Physical
Sciences, Office of CyberInfrastructure and
Office of International Science and Engineering
Directorates.
34The Evolution of the OSG
LIGO operation
LIGO preparation
LHC construction, preparation
LHC Ops
iVDGL
(NSF)
OSG
Trillium
Grid3
GriPhyN
(DOENSF)
(NSF)
PPDG
(DOE)
DOE Science Grid
(DOE)
1999
2000
2001
2002
2005
2003
2004
2006
2007
2008
2009
European Grid Worldwide LHC Computing Grid
Campus, regional grids
35The Open Science Grid vision
- Transform processing and data intensive science
through a cross-domain self-managed national
distributed cyber-infrastructure that brings
together campus and community infrastructure and
facilitating the needs of Virtual Organizations
(VO) at all scales
36D0 Data Re-Processing
Total Events
12 sites contributed up to 1000 jobs/day
OSG CPUHours/Week
2M CPU hours 286M events 286K Jobs on
OSG 48TB Input data 22TB Output data
37The Three Cornerstones
Need to be harmonized into a well integrated
whole.
National
Campus
Community
38OSG challenges
- Develop the organizational and management
structure of a consortium that drives such a
Cyber Infrastructure - Develop the organizational and management
structure for the project that builds, operates
and evolves such Cyber Infrastructure - Maintain and evolve a software stack capable of
offering powerful and dependable capabilities
that meet the science objectives of the NSF and
DOE scientific communities - Operate and evolve a dependable and well managed
distributed facility
396,400 CPUs available Campus Condor pool
backfills idle nodes in PBS clusters - provided
5.5 million CPU-hours in 2006, all from idle
nodes in clusters Use on TeraGrid 2.4 million
hours in 2006 spent Building a database of
hypothetical zeolite structures 2007 5.5
million hours allocated to TG
http//www.cs.wisc.edu/condor/PCW2007/presentation
s/cheeseman_Purdue_Condor_Week_2007.ppt
40Clemson Campus Condor Pool
- Machines in 27 different locations on Campus
- 1,700 job slots
- gt1.8M hours served in6 months
- users from Industrial and Chemical engineering,
and Economics - Fast ramp up of usage
- Accessible to the OSG through a gateway
41Grid Laboratory of Wisconsin
2003 Initiative funded by NSF(MIR)/UW at 1.5M.
Second phase funded in 2007 by NSF(MIR)/UW at
1.5M. Six Initial GLOW Sites
- Computational Genomics, Chemistry
- Amanda, Ice-cube, Physics/Space Science
- High Energy Physics/CMS, Physics
- Materials by Design, Chemical Engineering
- Radiation Therapy, Medical Physics
- Computer Science
Diverse users with different deadlines and usage
patterns.
42GLOW Usage 4/04-11/08
Over 35M CPU hours served!
43The next 20 years
- We all came to this meeting because we believe in
the value of HTC and are aware of the challenges
we face in offering researchers and educators
dependable HTC capabilities. - We all agree that HTC is not just about
technologies but is also very much about people
users, developers, administrators, accountants,
operators, policy makers,