Title: Planning on the Grid
1Planning on the Grid
- With slides contributed by
- Ewa Deelman and Yolanda Gil
2Thinking about applications of planning
- Youve seen Planning as X,
- X ? SAT, CSP, ILP,
- Now Y as Planning
- Y ? Grid/Web services
composition,
3Problem-solving on Grids
- Users pool access to distributed resources
(computers, instruments, data, ..) - Applications are often composed of separate
components run at several locations - Grid middleware tools allow for scheduling jobs,
resource discovery. e.g. Globus toolkit
4The Computational Grid
- Emerging computational and networking
infrastructure - bring together compute resources, data storage
system, instruments, human resources - Enable entirely new approaches to applications
and problem solving - remote resources the rule, not the exception
- can solve ever bigger problems
- Wide-area distributed computing
- national and international
- Facilitate collaborative environments
- Sharing of data which can be expensive to produce
(experimentation/simulation)
5Example LIGO Experiment(Laser Interferometer
Gravitational-Wave Observatory)
- Aims to detect gravitational waves predicted
- by theory of relativity.
- Can be used to detect
- binary pulsars
- mergers of black holes
- starquakes in neutron stars
- Two installations in Louisiana (Livingston) and
Washington State - Other projects Virgo (Italy), GEO (Germany),
Tama (Japan) - Instruments are designed to measure the effect of
gravitational waves on test masses suspended in
vacuum. - Data collected during experiments is a collection
of time series (multi-channel) - Analysis is performed in time and Fourier domains
6LIGOs Pulsar Search(Laser Interferometer
Gravitational-wave Observatory)
Extract channel
Short Fourier Transform
transpose
Long time frames
30 minutes
Short time frames
Single Frame
Time-frequency Image
Extract frequency range
event DB
Construct image
Find Candidate
Store
7Motivation Using Todays Grid
- Users have high level requirements naturally
stated in terms of the application domain - Ex Obtain frequency spectrum for signal S in
instrument I and timeframe T - Users have to turn these requirements into
executable job workflows in detailed scripts - Users must figure out which code generates
desired products, which files contain it,
physical location of the files, hosts that
support execution given code requirements,
availability of hosts, access policies, etc. - Users must query Grid middleware metadata
catalog, replica locator, resource descriptor and
monitoring, etc. - Users must oversee execution
8Problems with todays Grid
- Usability users must be proficient in grid
computing - Complexity many interrelated choices and dead
ends - Solution cost any-cost solutions are already
hard - Global cost optimization necessary when
contention - Reliability of execution job resubmission upon
failure
9Planning for workflow generation and maintenance
- Outline
- Formalization as a planning problem
- Integration with the grid middleware
- Case study planning for workflows in LIGO
- The grid as a test bed for planning and
scheduling research
10(No Transcript)
11Desiderata for workflow generator
- Allow users to refer to data requirements by
descriptions, not file names - Intuitive, requires far less input
- Seek high quality workflows according to variable
metric - Model variety of constraints declaratively
- Data dependencies, resource constraints, user
access rights, .
12Planning for workflow generation and maintenance
- Outline
- Formalization as a planning problem
- Integration with the grid middleware
- Case study planning for workflows in LIGO
- The grid as a test bed for planning and
scheduling research
13Planning for workflow generation
- Application components as operators
- Desired data as goals
- World state includes available hosts, existing
data products, network bandwidths,
14Existing tools for building workflowsabstract
workflow generation
- Chimera
- Input-ouput transforms for files, in Virtual
Data Language
DV third1-pulsar(a_at_input"H2_sSFT_LSC-AS-Q_714
384000_256_50_1.ilwd", b_at_output"H2_pulsa
r_LSC-AS-Q_714384000_256_50.5_0.004_3.ilwd",
t1"714384000", t2"714384255", format"ilwd",
channel"LSC-AS-Q", fcenter"50.5",
fband"0.004", instrument"H2", ra"3.123643",
de"2.56234", fderv1"0.0", fderv2"0.0",
fderv3"0.0", fderv4"0.0", fderv5"0.0")
15Planning operator
- (operator pulsar-search
- (preconds
- (
- ( 7143800)
- ( LSC-AS-Q)
- ( 0.5)
- ( 50)
- ( 20)
- )
- (and
- (created H2_sSFT_LSC-AS-Q_714384000_256_50_1.
ilwd))
-
- (effects
- ()
- ( (add
- (created H2_pulsar_LSC-AS-Q_714384000_256_50.5_0.
004_3.ilwd)) - )
- ))
16Operator with metadata parameters
- (operator pulsar-search
- (preconds
- (
- ( Number)
- ( Channel)
- ( Number)
- ( Number)
- ( Number)
- ( File-Handle)
- These two are parameters for the
frequency-extract. - ( (and Number (get-low-freq-from-center-and
-band -
))) - ( (and Number (get-high-freq-from-center-an
d-band -
))) - )
- (and
- (forall ((
- (and File-Group-Handle
- (gen-sub-sft-range-for-pulsar-sear
ch
-
- (effects
- ()
- (
- (add (created ))
-
- (add (pulsar
-
-
-
-
- ))
- )
- ))
17Operator with host identified
- (operator pulsar-search
- (preconds
- (( (or Condor-pool Mpi))
- ( Number)
- ( Channel)
- ( Number)
- ( Number)
- ( Number)
- ( File-Handle)
- These two are parameters for the
frequency-extract. - ( (and Number (get-low-freq-from-center-and
-band -
))) - ( (and Number (get-high-freq-from-center-an
d-band -
))) - ( (and Number
- (estimate-pulsar-search-run-time
-
-
))) - )
-
- (effects
- ()
- (
- (add (created ))
- (add (at ))
- (add (pulsar
-
-
-
-
- ))
- )
- ))
18Planning for workflow generation
- Application components as operators
- Parameters include host plan is a concrete
workflow - Desired data (in descriptive form) as goals
- World state includes available hosts, existing
data products, network bandwidths,
19Operator descriptions
- Represent applying a given component at a
particular location with fixed parameters, inputs
and outputs. - Preconditions combine
- data dependencies derive input requirements
from outputs - Task constraints e.g. component must be run on
an MPI machine
20Plan quality
- Objective function may include
- Performance expected runtime, variance
- Reliability probability of failure, expected
number of retries - Computational cost use of expensive
resources, conformance to policies
21Using local heuristics and global metrics
- Need local heuristics since search space is
intractable - e.g. prefer host for program with high-bandwidth
connection to where the output is required - Need to test a global metric (e.g. overall
runtime) since local heuristics can lead to
globally poor solution - Create as many plans as possible, return best
- Search control to eliminate redundant solutions
22Example search heuristics
- (control-rule only-transfer-from-loc-with-greatest
-bandwidth - (if (and (current-ops (transfer-file))
- (current-goal (at ))
- (true-in-state (at ))
- (true-in-state (at ))
- (higher-bandwidth
))) - (then reject bindings (( . ))))
- (control-rule prefer-mpi-to-condor-for-pulsar-sear
ch - (if (and (current-ops (pulsar-search))
- (type-of Mpi)
- (type-of Condor-pool)))
- (then prefer bindings (( . ))
(( . ))))
23Planning for workflow generation and maintenance
- Outline
- Formalization as a planning problem
- Integration with the grid middleware
- The grid as a test bed for planning and
scheduling research
24(No Transcript)
25Generating the planning problem
- Currently, static file representation for
available hosts, bandwidths - Query grid services prior to planning to find
which relevant files exist - Future versions will make dynamic queries
- Goal is translated from user request, plan is
translated into DAG format suitable for grid
scheduler.
26LIGOs Pulsar Search at SC02
- Used LIGOs data collected during the first
scientific run of the instrument - Targeted a set of 1000 locations known pulsar or
random locations - Results of the analysis published to the LIGO
Scientific Collaboration - Performed using LDAS and compute and storage
resources at Caltech, University of Southern
California, University of Wisconsin Milwaukee.
27Summary benefits of planning
- Automating workflow composition
- Just being addressed in Grid middleware
- Reasoning with explicit descriptions of data
- More intuitive for users
- Far fewer inputs required than at file level
- Better workflows by searching many plans
28Planning for workflow generation and maintenance
- Outline
- Existing Grid tools for workflow generation
- Formalization as a planning problem
- Integration with the grid middleware
- The grid as a test bed for planning and
scheduling research
29Many areas of planning research relevant for grid
- Planning for a dynamic environment plan
monitoring and repair, planning under uncertainty - Scheduling resource reasoning, temporal
reasoning - Plan quality learning, acquiring preferences,
local search planning - Planning for information gathering integrating
access to grid services with workflow creation - Domain modeling handling multiple ontologies,
acquiring metadata descriptions, acquiring
operators
30Fault-tolerant planning for a dynamic environment
- Grid resources become unavailable, queue length
network bandwidth change - Exploring plan repair strategies, balance of work
done off-line and on-line - Modeling failures, keeping statistics for
creating plans more likely to succeed,
conditional plans, ..
31Fault-tolerant straw men
- Current version build fully detailed plan
offline, resource allocation is fixed - Ignores world dynamics
- Build abstract plan (without specifying hosts)
offline, use a matchmaker online - Matchmaker makes local decisions only
32Global reasoning is needed for resource
allocation
33Approaches for fault-tolerant planning in dynamic
domains
- RAX (Jonsson et al.) general framework. As
implemented - offline builds complete plan
- online adjusts temporal intervals
- Combining planning and scheduling
- offline build several abstract plans
- online reason about critical path to
instantiate each plan - MDP/POMDP approaches
- Open area..
34Challenge understanding when different
approaches are more important
- Hypotheses
- Uneven task distribution, in terms of
computational and data expense and resource
constraints will indicate global planning - Time-dependency, e.g. need to re-plan during
execution, will indicate local planning - Interesting project use experiments in synthetic
and real domains to test hypotheses and uncover
new insights
35Empirical tests with synthetic LIGO problems
- Example Problem requires 100 files on one
machine. Vary the number that exist.
36Domain modeling
Current system
Knowledge from several sources must be used
Info from Grid services (RLS, MCS etc)
task requirements
existing data in files
State info (files, resources)
Comp. selector
User policies
Monolithic planner
available resources
KBs combined in one location
Resource selector
Resource queues
Concrete tasks
Exec. monitor
Network bandwidth
Grid task schedulers
37Where does knowledge used by our planners come
from?
task resource requirements
user policies preferences
- (Operator
- (preconditions
- ..
- ))
- (effects
- ..
- ))
resource policies
data dependencies (VDL)
Each knowledge component is used for other
purposes beyond planning
38Automatically generated operators for several
application domains
task resource requirements
- (Operator
- (preconditions
- ..
- ))
- (effects
- ..
- ))
Digital sky survey LIGO GEO Galaxy
morphology Tomography
policies
data dependencies (VDL)
Investigating patterns of data descriptions for
more efficient planning
39- Question if operators are gathered from
distributed services, can we still guarantee
soundness and completeness? - Under what kinds of conditions?
40Representing appropriate information units with
metadata
- E.g. Have 60,000 files, want to allocate 60 tasks
each dealing with 1,000 files. - Previously, application components specified in
terms of specific files - DV run59000-extractSFTData( input_at_inputnSFT.
59000",,_at_inputnSFT.59999, - output_at_output eSFT.59000,,_at_output
eSFT.59999, - t1"714384000", t2"714384063",
freq1008,band4,instrument"H2") - 59 similar clauses
- DV final-computeFStatistic( input_at_inputeSFT.
00000,,_at_inputeSFT.59999,)
1000 files
60000 files
41Metadata representation
- Replace with two clauses, two input predicates
- A predicate now represents a range of files
- Simpler to model, greater generality, more
efficient for reasoner - (operator run-extractSFTData-range
- (preconds
- (( Number)
- ( (and Number (
0))) - ( (and Number
- (gen-smaller-number 1000
)))) - (and (range "eSFT" 2 1
) - (range "nSFT" 2 1
999))) - (effects ()
- ((add (range "eSFT" 2
)))))
42Requires library operators for ranges
- E.g. if a range of files exists, then so does any
subrange - Questions what are the required operators?
Similar to spatial calculus RCC-8? - (operator subranges-exist
- (preconds
- (( Number)
- ( Object)
- ( (and Number (
0))) - ( (and Number
(gen-known-enclosing-begins -
2 1 ))) - (
- (and Number (gen-known-enclosing-number-of-fi
les - 2 1
- ))))
- (created-range 2 1
)) - (effects ()
43Conclusions
- Implemented system takes data description
requests from LIGO users, composes workflow and
executes on the Grid - Planning and scheduling technologies can make a
large contribution to Grid infrastructure - Many interesting challenges for planning and
scheduling research from Grid applications - http//www.isi.edu/ikcap/cognitive-grids
- http//www.isi.edu/deelman/pegasus.htm
44Koehler and Srivastava
- Different approaches to specifying workflows by
hand
45WSDL service specification(no workflow specified)
ttp//schemas.xmlsoap.org/wsdl/" "OrderEvent" "TripRquest" "FlightRequest" "HotelRequest" "BookingFailure" "pt1" "TripRequest"/ e name "pt2" message "HotelRequest"/
"CIToFS" eration ... "pt9" message "BookingFailure/
46BPEL4WS
"pt1" operation "CToCI" container
"OrderEvent" "HotelService" portType "pt2" operation
"CIToHS" inputContainer "HotelRequest" "pt3" operation "CIToFS" inputContainer
"FlightRequest"
47Golog
48Back-up slides
49What is Needed
- We need alternative foundations that offer
- expressive representations
- flexible reasoners
- Many Artificial Intelligence (AI) techniques are
relevant - Planning to achieve given requirements
- Searching through problem spaces of related
choices - Using and combining heuristics
- Expressive knowledge representation languages
- Reasoners that can incorporate rules,
definitions, axioms, etc. - Schedulers and resource allocation techniques
50Existing tools for building workflowsabstract
workflow generation
- Chimera
- Input-ouput transforms at level of actual files,
in Virtual Data Language
DV first1-createSFT( b_at_output"H2_SFT_LSC-AS-Q_
714384000_64.gwf", t1"714384000",
t2"714384063", format"frame",
channel"H2LSC-AS-Q", instrument"H2") DV
first2-createSFT( b_at_output"H2_SFT_LSC-AS-Q_714
384064_64.gwf", t1"714384064",
t2"714384127", format"frame",
channel"H2LSC-AS-Q", instrument"H2")
DV third1-pulsar(a_at_input"H2_sSFT_LSC-AS-Q_7143
84000_256_50_1.ilwd", b_at_output"H2_pulsar
_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_2.562
34.ilwd", t1"714384000", t2"714384255",
format"ilwd", channel"LSC-AS-Q",
fcenter"50.5", fband"0.004", instrument"H2",
ra"3.123643", de"2.56234", fderv1"0.0",
fderv2"0.0", fderv3"0.0", fderv4"0.0",
fderv5"0.0")
51Existing tools for building workflowsabstract
workflow generation
- Chimera
- Input-ouput transforms for files, in Virtual
Data Language
DV first1-createSFT( b_at_output"H2_SFT_LSC-AS-Q_
714384000_64.gwf", t1"714384000",
t2"714384063", format"frame",
channel"H2LSC-AS-Q", instrument"H2") DV
first2-createSFT( b_at_output"H2_SFT_LSC-AS-Q_714
384064_64.gwf", t1"714384064",
t2"714384127", format"frame",
channel"H2LSC-AS-Q", instrument"H2")
DV third1-pulsar(a_at_input"H2_sSFT_LSC-AS-Q_7143
84000_256_50_1.ilwd", b_at_output"H2_pulsar
_LSC-AS-Q_714384000_256_50.5_0.004_3.123643_2.562
34.ilwd", t1"714384000", t2"714384255",
format"ilwd", channel"LSC-AS-Q",
fcenter"50.5", fband"0.004", instrument"H2",
ra"3.123643", de"2.56234", fderv1"0.0",
fderv2"0.0", fderv3"0.0", fderv4"0.0",
fderv5"0.0")
52Existing tools 2 concrete planner
- Assigns specific hosts and data locations for
tasks - Makes random selection of resources and data
- Provided a feasible solution
- Reused existing data products
INPUT
OUTPUT
53Sample Pulsar Search Results to Date
- SC 2002 run
- Over 58 pulsar searches
- Total of
- 330 tasks
- 469 data transfers
- 330 output files produced.
- The total runtime was 112435.
- To date
- 185 pulsar searches
- Total of
- 975 tasks
- 1365 data transfers
- 975 output files
- Total runtime
- 964947