Title: Dan Gunter, Brian L' Tierney, Keith Beattie: LBNL
1 CEDPS Troubleshooting Effort Status Report, April
- Dan Gunter, Brian L. Tierney, Keith Beattie LBNL
- Laura Pearlman, ISI
Center for Enabling Distributed Petascale
Science http//www.cedps-scidac.org
- Updated Architecture for OSG
- New Grid Passport Idea
- Log Database schema work
- OSGA-DAI use
- NERSC deployment plans
- Plans for the next 6 months
3Use Case Troubleshooting
- Allow GOC personnel to
- find log messages for jobs from VOAtlas running
at siteFNAL - find log messages related to servicecondor,
userJoe, siteIndiana - find log messages for userJoe
- find log messages with statuserror
- find all logs where job manager statuskilled
(ie jobs that were killed for running too long) - find log messages with start events with no
matching end event
4Use Case 2 Monitoring / Performance Analysis
- Allow GOC personnel to
- what sites had connection attempts for a given
user DN - what data files were accessed most often
- which user moved the largest amount of data
- find log messages where the time between
start/end events are more than 3X the baseline
5Use Case 3 User Debugging / Provenance
- Allow a user to query for their own logs
- find log messages for all my jobs
- find log messages with statuserror
- Allow a user to determine all hosts/services that
my job used - find log messages related to Job X
6Previously suggested architecture
7Problems with this approach
- Main idea of old architecture
- all grid logs are sent to a central collector
- After talking to several sites, the following has
become clear - some sites are quite worried about sensitive data
in the log files - some sites want to be able to control exactly who
gets access to what log data - Proposed Solution
- most logs stay local to the site, only a minimal
subset is sent to the central collector - sites deploy a new service that provides
X.509-authenticated access to logs
8Updated Architecture
9New Ideas
- Key Points
- Minimal logging sent to central collector by
default - eg resource name, job ID, start time, end_time,
DN, VO - enough information to locate log files of
interest - basically the same info currently collected by
Gratia - site can send more if they choose to
- Site admins have control over access to log
database - hopefully sites will allow users to see their own
logs - Site admins see exactly what data is being sent
to the central collector - data is sent to the central collector using ssl
10New Functionality
- Deployment of site log archives will provide OSG
with the following new functionality - OSG security staff can easily query site archives
to see what DNs have been used - Users can query site archives for their own logs
- GOC stuff can query site archives for to aid
11Grid Passport Stamps
- We are working with Miron to define the concept
of a Grid passport stamp - Every middleware component that comes in contact
with a Grid workflow adds entry and exit
stamp to the passport - Condor sandbox makes this do-able
- This provides the following
- Tells the user exactly which components and hosts
were used by their workflow - provides workflow provenance
- For workflows that fail, provides a mechanism to
determine where the failure occurred - passport stamps must also be collected along the
12Grid Passport Stamps
- Stamps are generated and recorded by an Issuer
- Issuer attributes name, address, and GUID
- timestamp
- local identity (if exists)
13Open Issues
- What about workflows that spawn sub-workflows?
- need to clone the passport, and reassemble at
the end of the job - probably many more issues.
14CEDPS-TS DB Schema
15DB Schema Requirements
- Deal with semi-structured data
- timestamp, event name, level are the only
guaranteed fields - Load data at relatively high rates
- detailed logs from Globus, Condor, etc.
- Perform many different types of queries
- new troubleshooting scenarios
- site admins, users, etc.
16DB Schema Approach
- Put only required info. in main table (avoid
NULLs) time, event type, severity (level) - Make tables for the event type and the attribute
names so they can be referred to by an integer - better for speed, since indexing by integers is
fast (.1sec for 0.5M records), but adds an extra
lookup to map the name to the id - Add some special tables for DN, identifiers,
large text values (less important)
17Schema Loading Perf.
- For efficiency, the loader program keeps the
mapping between the attribute names and event
type names in memory (loaded at startup) so only
INSERT statements are needed - For MySQL, the extended multiple-insert syntax is
used for further speedup - impressively fast on my Mac to localhost, 7200
records/sec. Faster than parsing!
18Fun Query on GT-4.2 Logs
- select DATE_ADD('1970-01-01', interval start_time
second) as 'start time', - (end_time - start_time) as 'duration', dn.value
as 'user DN', jobResource from - (
- select e.time as 'start_time', e_done.time as
'end_time', e.guid, e.jobResource from - (
- select e1.time, e1.id, e1.value as 'guid',
ident.value as 'jobResource' from - (
- select event.id, event.time, ident.value
as 'value' from - event join event_type on event_type.id
event.et_id - left join ident on ident.e_id event.id
- where event_type.name
'org.globus.execution.job.create.start' and - ident.relationship 'guid'
- ) as e1
- join
- (
- select event.id, event.time, ident.value
as 'value' from - event join event_type on event_type.id
event.et_id - left join ident on ident.e_id event.id
- where event_type.name
'org.globus.execution.job.create.end' and
join ident on ident.value e.jobResource
join event as e_done on e_done.id ident.e_id
join event_type on event_type.id e_done.et_id
where event_type.name 'org.globus.execution.job
.terminate.end' and
ident.relationship 'jobResource' ) as e join
ident on ident.value e.jobResource join ident
as id2 on id2.e_id ident.e_id join ident as id3
on id3.value id2.value left join dn on dn.e_id
id3.e_id left join event on event.id
dn.e_id left join event_type on event_type.id
event.et_id where ident.relationship
'jobResource' and id2.relationship 'guid'
and id3.relationship 'guid' and
event_type.name 'org.globus.security.authn.trans
port.end' group by jobResource order by
19Fun Query Results
Duration, user, and jobResource.id of all jobs
(in a given time period) Despite the messy
query, it runs very quickly (0.03s w/ gt250K
- -----------------------------------------------
--------------------------------------- - start time duration user
jobResource - -----------------------------------------------
--------------------------------------- - 2008-04-03 230746 8.74900007247925
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 c56196d0-01d2-11dd-874c-80235764bb7a
- 2008-04-03 230945 10.710000038147
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 0c6a4040-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-03 231039 7.16599988937378
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 2ca067e0-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-03 231051 6.73299980163574
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 3392f860-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-03 231143 6.90599989891052
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 53002ab0-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-03 231202 6.91799998283386
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 5debbac0-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-03 231437 7.9760000705719
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 baa9cb80-01d3-11dd-b78e-8fb0dd1ed847
- 2008-04-04 001031 6.84200000762939
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 89b74d60-01db-11dd-b78f-8fb0dd1ed847
- 2008-04-04 001522 6.05399990081787
/DCorg/DCdoegrids/OUPeople/CNBrian Tierney
180017 374e8ab0-01dc-11dd-b78f-8fb0dd1ed847
- 2008-04-04 002253 6.45799994468689
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 44094000-01dd-11dd-b78f-8fb0dd1ed847 - 2008-04-04 003259 10.5170001983643
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 acdbb3f0-01de-11dd-a9fc-d8b7a3370b2a - 2008-04-04 003614 5.73799967765808
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 21698080-01df-11dd-a9fc-d8b7a3370b2a - 2008-04-04 005442 6.32100009918213
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 b5ce04b0-01e1-11dd-a9fc-d8b7a3370b2a - 2008-04-04 011439 6.54200005531311
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 7f87f250-01e4-11dd-a9fc-d8b7a3370b2a - 2008-04-04 011911 0.796000003814697
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 216ea1e0-01e5-11dd-a9fc-d8b7a3370b2a - 2008-04-04 012015 0.749000072479248
/DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
633921 477d3770-01e5-11dd-a9fc-d8b7a3370b2a
21OGSA-DAI Integration Goals
- Provide access to log data over the grid
- Support flexible authorization policies
- E.g., Site admins can see local site data, VO
admins can see data related to their VO, and
users can see data related to jobs running under
their DN - Facilitate queries across multiple sites
- Possibly add support for joins with other
existing databases - E.g., GRAM audit db
22OGSA-DAI Useful Features
- Can use Globus authz framework to specify
authorization of OGSA-DAI resources - View resources (planned feature) represent
database views - Can use DN in a view definition to represent
the callers DN - This will enable us to enforce policies like
users can only see their own records. - Resource Groups fan queries out to multiple
OGSA-DAI resources - Resource groups of OGSA-DAI views will allow us
to define resources representing things like all
your own records"
23OGSA-DAI Proposed Deployment
Res. group 1
Res. group 2
- View1 select where DN DN
- Shows records with the same DN as the user making
the OGSA-DAI call - Res. Group 1 then contains all records at all
sites with the same DN as the user making the
24NERSC Collaboration
- Attending weekly NERSC production grid meetings
- Deploying CEDPS syslog-ng -based log collection
configuration at NERSC - Helping NERSC with troubleshooting
25Globus Best Practice Logs
- still a number of issues with the logs
- http//www.cedps.net/index.php/GT4.2_Logging
26Focus for the next 6 months
- log parser/ DB loader deployment at NERSC
- write additional data parsers
- Log database scalability testing
- OSGA-DAI integration
- Improve interface to the database
27More Information
- General Information
- http//www.cedps.net/index.php/Troubleshooting