Title: EPP Grid Activities
1EPP Grid Activities
- AusEHEP Wollongong
- Nov 2004
2Grid Anatomy
- What are the essential components
- CPU Resources Middleware (software common
interface) - Data Resources Middleware
- replica catalogues unifying many data sources
- Authentication Mechanism
- Certificates (Globus GSI), Certificate
Authorities - Virtual Organisation Information Services
- Grid consists of VOs!? users resources
participating in a VO - Who is a part of what research/effort/group
- Authorisation for resource use
- Job Scheduling, Dispatch, and Information
Services - Collaborative Information Sharing Services
- Documentation Discussion (web, wiki,)
- Meetings Conferences (video conf., AccessGrid)
- Code Software (CVS, CMT, PacMan)
- Data Information (Meta Data systems)
2nd Generation GridGlobus
3rd Generation Grid
32nd Generation
- Accessible resources for Belle/ATLAS
- We have access to around 120 CPU (over 2 GHz)
- APAC, AC3, VPAC, ARC
- currently 50 Grid accessible
- Continuing to encourage HPCfacilities to install
middleware - We have access to ANUSFpetabyte storage facility
- Will request 100 TB for Belledata.
- SRB (Storage ResourceBroker)
- Replica catalogue federatingKEK/Belle, ANUSF,
Melbourne EPP data storage - Used to participate in Belles 4x109 event MC
production during 2004
42nd Generation
- SRB (Storage Resource Broker)
- Globally accessible virtual file system
- Domains of storage resources
- eg. ANUSF domain contains the ANU petabyte
storage facility and disk on Roberts in Melbourne - Federations of Domains
- eg. ANUSF and KEK are federated
- Scd /anusf/home/ljw563.anusf Sls l
Sget datafile.mdst Scd /bcs20zone/home/srb.KEK-
B
5Grid Anatomy
- What are the essential components
- CPU Resources Middleware
- Data Resources Middleware
- replica catalogues unifying many data sources
- Authentication Mechanism
- Globus GSI, Certificate Authorities
- Virtual Organisation Information Services
- Grid consists of VOs!? users resources
participating in a VO - Who is a part of what research/effort/group
- Authorisation for resource use
- Job Scheduling, Dispatch, and Information
Services - Collaborative Information Sharing Services
- Documentation Discussion (web, wiki,)
- Meetings Conferences (AccessGrid)
- Code Software (CVS, CMT, PacMan)
- Data Information (Meta Data systems)
3rd Generation Grid
63rd Generation Solutions
- NorduGrid -gt ARC (Advanced Resource Connector)
- Nordic Countries plus others like Australia
- Weve used this for ATLAS DC2
- Globus 2.4 based middleware
- Stable, patched, and redesigned collection of
existing middleware (Globus, EDG) - Grid 3 Middleware -gt VDT
- US based coordination between iVDGL, GriPhyN,
PPDG - Globus 2.4 based middleware
- LHC Computing Grid (LCG) lt- EDG -gt EGEE
- Multiple Tiers CERN T0, Japan/Taiwan T1,
Australia T2 ? - Regional Operations Centre in Taiwan
- Substantial recent development needs to be
looked at once again!
73rd Generation Solutions
- Still a lot of development going on.
- data aware job scheduling is still developing
- VO systems are starting to emerge
- meta-data infrastructure is basic
- Deployment is still a difficult task.
- prescribed system/OS only
8Grid Anatomy
- What are the essential components
- CPU Resources Middleware
- Data Resources Middleware
- replica catalogues unifying many data sources
- Authentication Mechanism
- Globus GSI, Certificate Authorities
- Virtual Organisation Information Services
- Grid consists of VOs!? users resources
participating in a VO - Who is a part of what research/effort/group
- Authorisation for resource use
- Job Scheduling, Dispatch, and Information
Services - Collaborative Information Sharing Services
- Documentation Discussion (web, wiki,)
- Meetings Conferences (AccessGrid)
- Code Software (CVS, CMT, PacMan)
- Data Information (Meta Data systems)
VOs
9Virtual Organisation Systems
- Now there are 3 systems available
- EDG/NorduGrid LDAP based VO
- VOMS (VO Membership Service) from LCG
- CAS (Community Authorisation Service) from Globus
- In 2003 we modified NorduGrid VO software for use
with the Belle Demo Testbed, SC2003 HPC Challenge
(worlds largest testbed) - More useful for rapid Grid deployment than above
systems. - Accommodates Resource Owners security policies
- resource organisations are partof the community
- their internal security policies arefrequently
ignored/by-passed - Takes into account CA
- certificate authorities are a partof the
community - a VO should be able to list CAswho they trust to
sign certificates - Compatible with existing Globus
- Might be of use/interest to theAustralian Grid
community? - GridMgr (Grid Manager)
10Virtual Organisation Systems
- How do VOs manage internal priorities?
- This problem has not yet become apparent!
- This has been left up to local resource settings.
- For non VO resources, changes would require
allocation or configuration renegotiation. - CAS is only VO middleware to address this
- done by VOs specifying policies allowing/denying
access to resources - local resource priorities are not taken into
account - difficult to predict the effect
- VO manage job queue
- centrally managed VO priorities, independent of
- locally managed resource priorities
- resource job consumers pull jobs from the queue
- the VO decides and can change which jobs are run
first - results of prototype testing fair-share
system could be used - users/groups are allocated a target fraction of
all resources
11Grid Anatomy
- What are the essential components
- CPU Resources Middleware
- Data Resources Middleware
- replica catalogues unifying many data sources
- Authentication Mechanism
- Globus GSI, Certificate Authorities
- Virtual Organisation Information Services
- Grid consists of VOs!? users resources
participating in a VO - Who is a part of what research/effort/group
- Authorisation for resource use
- Job Scheduling, Dispatch, and Information
Services - Collaborative Information Sharing Services
- Documentation Discussion (web, wiki,)
- Meetings Conferences (AccessGrid)
- Code Software (CVS, CMT, PacMan)
- Data Information (Meta Data systems)
Job Scheduling
12Data Grid Scheduling
- Task -gt Job1, Job2 ...
- Job1 -gt input replica 1, input replica 2 ...
- Job1 Input -gt CPU resource 1 ...
- How do you determine whatwhere is best?
13Data Grid Scheduling
- Whats the problem?
- Try to schedule wisely
- free resources, close to input data, less
failures - Some resources are inappropriate
- need to parse and check job requirements and
resource info (RSL - Resource Specification
Language) - Job failure is common
- error reporting is minimal
- need multiple retries for each operation
- need to try other resources in case of resource
failure - eventually we stop and mark a job as BAD
- What about firewalls
- some resource have CPUs which cannot access data
- Schedulers
- Nimrod/G (parameter sweep, not Data Grid)
- GridBus Scheduler (2003, 2004 aided them towards
SRB) - GQSched (prototype developed in 2002, used in
2003 demos)
14Data Grid Scheduling
- GQSched (Grid Quick Scheduler)
- Idea is based around the Nimrod model (user
driven parameter sweep dispatcher) - Addition of sweeps over data files and
collections - Built in 2002 as a demonstration to computer
scientists of simple data grid scheduling - Simple tool familiar to Physicists
- Shell script, Environment parameters
- Data Grid Enabled
- Seamless access to data catalogues and Grid
storage systems - Protocols GSIFTP, GASS (and non-Grid protocols
also HTTP, HTTPS, FTP) - Catalogues GTK2 Replica Catalog, SRB (currently
testing) - Scheduling based on metrics for CPU Resource
Data Resource combinations - previous failures of job on resource
- nearness of physical file locations (replicas)
- resource availability
- Extra features
- Pre- and Post-processing for preparation/collation
of data and job status checks - Creation and clean-up of unique job execution
area - Private network friendly staging of files for
specific resources (3 stage jobs) - Automatic retry and resubmit of jobs
15Grid Scheduling
- gqsched myresources myscript.csh
!/bin/csh -f Param MYFILE GridFile
srb/anusf/home/ljw563.anusf/proc1/.mdst Stage
In MYFILE StageIn recon.conf
event.conf StageIn particle.conf echo
Processing Job JOBID on MYFILE on host
hostname basfexec -v b20020424_1007 ltlt
EOF path create main module register
user_ana path add_module main user_ana initialize
histogram define somehisto.hbook process_event
FILE 1000 EVENTSKIP terminate EOF echo
Finished JobID JOBID . StageOut output.mdst
srb/anusf/home/ljw563.anusf/procout1/ StageOut
myana.hbook myana.JOBID.hbook
16Grid Anatomy
- What are the essential components
- CPU Resources Middleware
- Data Resources Middleware
- replica catalogues unifying many data sources
- Authentication Mechanism
- Globus GSI, Certificate Authorities
- Virtual Organisation Information Services
- Grid consists of VOs!? users resources
participating in a VO - Who is a part of what research/effort/group
- Authorisation for resource use
- Job Scheduling, Dispatch, and Information
Services - Collaborative Information Sharing Services
- Documentation Discussion (web, wiki,)
- Meetings Conferences (AccessGrid)
- Code Software (CVS, CMT, PacMan)
- Data Information (Meta Data systems)
Meta-Data
17Meta-Data System
- Advanced Meta-Data Repository
- Advanced Above and beyond file/collectionorient
ed meta-data - Data oriented queries
- List the files resulting from task X.
- Retrieve the list of all simulation data ofevent
type X. - How can file X be regenerated? (if lost or
expired) - Other queries we can imagine
- What is the status of job X ?
- What analyses similar to Xhave been undertaken?
- What tools are being used for X analysis?
- Who else is doing analysis X or using tool Y ?
- What are the typical parameters used for tool X
? And for analysis Y ? - Search for data skims (filtered sets) thatare
supersets of my analysis criteria.
18Meta-Data System
- XML
- some great advantages
- natural tree structure
- strict schema, data can be validated
- powerful query language (XPath)
- format is very portable
- information readily transformable (XSLT)
- some real disadvantages
- XML databases are still developing, not scalable
- XML DBs are based on lots of documents of the
same type - Would need to break tree into domains, query
becomes difficult - LDAP
- compromise
- natural tree structure
- loose schema but well defined
- reasonable query feature, not as good as XML
- very scalable (easily distributed and mirrored)
- information can be converted to XML with little
effort if necessary - structure/schema is easily accessible and
describes itself !
19Meta-Data System
- Components
- Navigation, Search, Management of MD
- Task/Job/Application generated MD
- Merging and Uploading MD
LDAP Server
User supplied
Software generated
20Meta-Data System
- Navigation and Creation via Web
- Search is coming
21How to use it all together?
- Getting set up
- Certificate from a recognised CA (VPAC)
- Accounts on each CPU/storage resource
- ANUSF storage, VPAC, ARC (UniMelb), APAC
- Install require software on resources (eg. BASF)
- Your certificate in the VO system
- Running jobs
- Find SRB input files, set up output collection
- Convert your scripts to GQSched scripts
- Run GQSched to execute jobs
- Meta Data
- Find/Create a context for your tasks (what you
are currently doing) - Submit this with your job, or store with output
- Merge context output meta-data, then upload
- NOT COMPLETE need auto generated MD from
BASF/jobs