First Steps with Grid Computing

About This Presentation

Title:

First Steps with Grid Computing

Description:

First Steps with Grid Computing & Oracle Application Server 10g Venkata Ravipati Product Manager Oracle Corporation Agenda Introduction Grid Computing OracleAS 10g ... – PowerPoint PPT presentation

Number of Views:270

Avg rating:3.0/5.0

Slides: 59

Provided by: Analy7

Category:

more less

Transcript and Presenter's Notes

Title: First Steps with Grid Computing

1
(No Transcript)
2
First Steps with Grid Computing Oracle
Application Server 10g
Session id 40187

Venkata Ravipati
Product Manager
Oracle Corporation

Sastry malladi CMTS Oracle Corporation
Jamie ShiersIT Division, CERNJamie.Shiers_at_cern.c
h
3
Agenda

Introduction Grid Computing
OracleAS 10g Features
CERN Case Study
OracleAS 10g Roadmap
QA

Introduction Grid Computing
4
IT Challenges

Enterprise I/T is highly fragmented, leading to
poor utilization, excess capacity, and systems
inflexibility.
Adding capacity is complex and labor-intensive
Systems are fragmented into inflexible islands
Expensive server capacity sits underutilized
Installing, configuring, and managing application
infrastructure is slow and expensive
Poorly integrated applications with redundant
functionality increase costs and limit business
responsiveness

5
Grid Computing Solves IT Problems
IT Problem
Grid Solution

High cost of adding capacity
Islands of inflexible systems
Underutilized server capacity
Hard to configure and manage
Poorly integrated applications with redundant
functions

Pool modular, low-cost hardware components
Virtualize system resources
Dynamically allocate workloads and information
Unify management and automate provisioning
Compose applications from reusable services

6
What is Grid computing

Grid computing is a hardware and software
infrastructure that enable
Transparent Resource Sharing across an
enterpriseDivisions,Data Centers, Resources
Categories
Computers
Storage,
Databases
Application Servers
Applications
Coordination resources that are not subject to
centralized control
Using standard, open, general-purpose protocols
and interfaces
To deliver nontrivial qualities of service

7
Enterprise Grid Infrastructure Must Be
Comprehensive
Middleware
Management
Database
Storage
8
Agenda

Introduction Grid Computing
OracleAS 10g Features
CERN Case Study
OracleAS 10g Roadmap
QA

OracleAS 10g Features
9
Introducing Oracle 10g

Complete, integrated grid infrastructure

10
Oracle Application Server 10g
10g
Workload Management
Workload Management
11
Workload Management
IT Problem
Oracle 10g Solution

Adding and allocating computing capacity is
expensive and too slow to adapt to changing
business requirements

Virtualize servers as modular HW resources
Virtualize software as reusable run-time services
Manage workloads automatically based on
pre-defined policies

12
Virtualized Hardware Resources
Add Capacity Quickly and Economically
13
Virtualized Middleware Services
Accounting Application
Group Collections of Resources and Runtime
Services into Logical Applications
14
Policy-based Workload Management

Workload Manager
Dispatcher SchedulerDistribute workloads based
on application-specific policies
Policy Manager Stores application-specific
policies
Resource ManagerManages resource
availability/status
15
Middleware Services

HTTP servers
Web caches
J2EE servers
EJB processes
Portal services
Wireless services
Web services
Integration services

Directory services
Authentication services
Authorization services
Enterprise Reporting services
Query Analysis services

16
Metrics-based Workload Reallocation

Employee Portal
Portal
Accounting
Discoverer, reports
Web Store
HTTP, J2EE Server

Unexpected demand! ? shift more capacity to Web
Store
17
Scheduled Workload Reallocation
Start of Quarter
End of Quarter
General Ledger
General Ledger
Order Entry
Order Entry
18
Policy-based Edge Caching

Virtualized pools of storage enable sharing and
transfer of data between nodes
Adaptive caching policies flexibly accommodate
changing demand

Virtual HTTP Server
Client
Grid Caches
19
Oracle Application Server 10g
10g
20
Software Provisioning
IT Problem
Oracle 10g Solution

Installing, configuring, upgrading and patching
systems is labor-intensive and too slow to adapt
to changing business requirements

Manage virtualized HW and SW resources as one
system
Automate installation, configuration, upgrading,
and patching processes

21
Software Provisioning

Grid Control Repository (GCR) with centralized
inventories for installation and configuration
Provision servers
Provision software
Provision users

22
Automated Deployment

Install and configure a single server node
Register configuration to the Repository
Automatically deploy to nodes as they are added
to the grid

Grid Control Repository
23
Software Cloning

Automated provisioning based on master node
Archive replicate specific configurations
e.g. Payroll config. optimized for Fridays at
400pm
Context-specific adjustments
e.g. IP address, host name, web listener

Select Software and Instances to Clone
Update Configuration Inventory in GCR
Clone to Selected Targets
1
3
2
24
Patch and Update Management

Real-time discovery of new patches
Automated staging and application of patches
Rolling application upgrades
Patch history tracking

25
Oracle Application Server 10g
10g
26
User Provisioning
IT Problem
Oracle 10g Solution

It takes too long to register new users
Users have too many accounts, passwords, and
privileges to manage
Developers re-implement authentication for each
new application

Centralized identity management
Shared authentication service

27
Single Sign-on Across the Grid
Accounting
Sales Portal

Consolidate accounts
Simplify management
Facilitate re-use

Directory
Support Portal
28
User Provisioning

Create users once
Centrally manage roles, privileges, preferences
Support single password for all applications
Delegate administration
Locally administered departments, LOBs, etc.
User self-service
Interoperate with existing security infrastructure

29
Oracle Application Server 10g
10g
30
Application Availability
IT Problem
Oracle 10g Solution

Ensuring required levels of availability is too
expensive

Modular components provide inexpensive redundancy
Coordinated response to system failures ensures
application availability

31
Application Availability

Transparent Application Failover (TAF)
Automatic session migration
Fast-Start Fault Recovery
Automatic failure detection and recovery
Multi-tier Failover Notification (FaN)
Speeds end-to-end application failover time
From 15 minutes to lt15 seconds

32
Transparent Application Failover

Employee Portal
Portal
Accounting
Discoverer, reports
Web Store
HTTP, J2EE Server

Resource failure! ? fail-over the service to
additional nodes
33
Fast-Start Fault Recovery

Employee Portal
Portal
Accounting
Discoverer, reports
Web Store
HTTP, J2EE Server

Nodes recovered ? re-instate automatically
34
Multi-tier Failover Notification (FaN)

Overcomes TCP/IP timeout delays associated with
cross-tier application failovers

RAC Failover AS Detection Total Downtime
gt 15 mins
15 mins
Without FaN With FaN
lt 8 secs
lt 12 secs
lt 4 secs
lt 8 secs
35
Oracle Application Server 10g
10g
36
Application Monitoring
IT Problem
Oracle 10g Solution

Insufficient performance data to plan, tune, and
manage systems effectively

Software pre-instrumented to provide status and
fine-grained performance data
Centralized console analyzes and summarizes Grid
performance

37
Application Monitoring

Monitor virtual application resources
e.g. J2EE containers, HTTP servers, Web caches,
firewalls, routers, software components, etc.
Root cause diagnostics
Track real-time and historic performance metrics
App. availability, business transactions, end
user perf.
Notifications and alerts
Administer service level agreements (SLAs)

38
Repository-based Management

Centralized repository-based management provides
a unified view of entire infrastructure
Manage all your end-to-end application
infrastructure from any device

Grid Control Repository
39
Performance Monitoring

Capture real-time and historical performance data
Analyze and tune workload policies
Answer questions like
How much time is being spent in just the JDBC
part of this application?
What was the average response time over the past
3, 6, and 9 months?

40
Policy-based Alerts

User specified targets, metrics, and thresholds
e.g. CPU utilization, user response times, etc.
Flexible notification methods
e.g. Phone, e-mail, fax, SMS, etc.
Self-correction via pre-defined responses
e.g. Execute a script to shut down low priority
jobs

41
Agenda

Introduction Grid Computing
OracleAS 10g Features
CERN Case Study
OracleAS 10g Roadmap
QA

42
LHC Computing Grid Project

Oracle-based Production Services for LCG 1

43
Goals

To offer production quality services for LCG 1 to
meet the requirements of forthcoming (and
current!) data challenges
e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb
CDC04
To provide distribution kits, scripts and
documentation to assist other sites in offering
production services
To leverage the many years experience in running
such services at CERN and other institutes
Monitoring, backup recovery, tuning, capacity
planning,
To understand experiments requirements in how
these services should be established, extended
and clarify current limitations
Not targeting small-medium scale DB apps that
need to be run and administered locally (to user)

44
What Services?

POOL file catalogue using EDG-RLS (also
non-POOL!)
LRC RLI services client APIs
For GUID lt-gt PFN mappings
and EDG-RMC
For file-level meta-data POOL currently stores
filetype (e.g. ROOT file), fully registered, job
status
Expect also 10 items from CMS DC04 others?
plus (service behind) EDG Replica Manager client
tools
Need to provide robustness, recovery,
scalability, performance,
File catalogue is a critical component of the
Grid!
Job scheduling, data access,

45
The Supported Configuration

All participating sites should run
A Local Replica Catalogue (LRC)
Contains GUID lt-gt PFN mapping for all local files
A Replica Location Index (RLI) lt-- independent
of EDG deadlines
Allows files at other sites to be found
All LRCs are configured to publish to all remote
RLIs
Scalability beyond O(10) sites??
Hierarchical and other configurations may come
later
A Replica Metadata Catalogue (RMC)
Not proposing a single, central RMC
Jobs should use local RMC
Short-term handle synchronisation across RMCs
In principle possible today on the POOL-side
(to be tested)
Long-term middleware re-engineering?

46
Component Overview
CNAF
CERN
Storage Element
Storage Element
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Replica Location Index
Local Replica Catalog
Storage Element
Storage Element
RAL
IN2P3
47
Where should these services be run?

At sites that can provide supported h/w O/S
configurations(next slide)
At sites with existing Oracle support team
We do not yet know whether we can make
Oracle-based services easy enough to setup
(surely?) and run (should be for canned apps?)
where existing Oracle experience is not available
Will learn a lot from current roll-out
Pros can benefit from scripts / doc / tools etc.
Other sites simply re-extract catalog subset
from nearest Tier1 in case of problems?
Need to understand use-cases and service level

48
Requirements for Deployment

A farm node running Red Hat Enterprise Linux and
Oracle9iAS
Runs Java middleware for LRC, RLI etc.
One per VO
A disk server running Red Hat Enterprise Linux
and Oracle9i
Data volume for LCG 1 small (105 106 entries,
each lt 1KB)
Query / lookup rate low (1 every 3 seconds)
Projection to 2008 100 1000Hz 109 entries
Shared between all VOs at a given site
Site responsible for acquiring and installing h/w
and RHEL
349 for basic edition http//www.redhat.com/so
ftware/rhel/es/

49
What if?

DB server dies
No access to catalog until new server configured
DB restored
Hot standby or clustered solution offers
protection against most common cases
Regular dump of full catalog into alternate
format, e.g. POOL XML?
Application server dies
Stateless, hence relatively simple move to a new
host
Could share with another VO
Handled automatically with application server
clusters
Data corrupted
Restore or switch to alternate catalog
Software problems
Hardest to predict and protect against
Could cause running jobs to fail and drain batch
queues!
Very careful testing, including by experiments,
before move to a new version of the middleware
(weeks, including smallish production run?)
Need to foresee all possible problems, establish
recovery plan and test!

What happens during period when catalog is
unavailable?
50
Backup Recovery, Monitoring

Backend DB included in standard backup scheme
Daily full, hourly incrementals archive log
allows point in time recovery
Need additional logging plus agreement with
experiments to understand point in time to
recover to and testing!
Monitoring both at box-level (FIO) and
DB/AS/middleware
Need to ensure problems (inevitable, even if
undesirable) are handled gracefully
Recovery tested regularly, by several members of
the team
Need to understand expectations
Catalog entries guaranteed for ever?
Granularity of recovery?

51
Recommended Usage - Now

POOL jobs recommend extracting catalog sub-set
prior to job and post-cataloging new entries as
separate step
Non-POOL jobs, e.g. EDG-RM client minimum, test
RC and implement simple retry provide enough
output in job log for manual recovery if
necessary
Perpetual retry inappropriate if e.g.
configuration error
In all cases, need to foresee hiccoughs in
servicee.g. 1 hour, particularly during ramp-up
phase
Please provide us with examples of your usage so
that we can ensure adequate coverage by test
suite!
Strict naming convention essential for any
non-trivial catalogue maintenance

52
Status

RLS/RLI/RMC services deployed at CERN for each
experiment DTEAM
RLSTEST service also available, but should not be
used for production!
Distribution mechanism, including kits, scripts
and documentation available and well debugged
Only 1 outside site deployed so far (Taiwan)
others in the pipeline
FZK, RAL, FNAL, IN2P3, NIKHEF
We need help to define list and priorities!
Actual installation rather fast (max a few hours)
Lead time can be long
Assign resources etc a few weeks!
Plan is (still) to target first sites with Oracle
experience to make scripts doc as clear and
smooth as possible
Then see if it makes sense to go further

53
Registration for Access to Oracle Kits

Well known method of account registration in
dedicated group (OR)
Names will be added to mailing list to announce
e.g. new releases of Oracle s/w, patch sets etc.
Foreseeing much more gentle roll-out than for
previous packages
Initially just DBAs supporting canned apps
RLS backend, later potential conditions DB if
appropriate
For simple, moderate-scale DB apps, consider use
of central Sun cluster, already used by all LHC
experiments
Distribution kits, scripts etc in afs
/afs/cern.ch/project/oracle/export/
Documentation also via Web
http//cern.ch/db/RLS/

54
Links

http//cern.ch/wwwdb/grid-data-management.html
High level overview of the various components
pointers to presentations on use-cases etc
http//cern.ch/wwwdb/RLS/
Detailed installation configuration
instructions
http//pool.cern.ch/talksandpubl.html
File catalog use-cases, DB requirements, many
other talks

55
Future Possibilities

Investigating resilience against h/w failure
using Application Server Database clusters
AS clusters also facilitate move of machines,
addition of resources, optimal use of resources
etc.
DB clusters (RAC) can be combined with stand-by
databases and other techniques for even greater
robustness
(Greatly?) simplified deployment, monitoring and
recovery can be expected with Oracle10g

56
Summary

Addressing production-quality DB services for LCG
1
Clearly work in progress, but basic elements in
place at CERN, deployment just starting outside
Based on experience and knowledge of Oracle
products, offering distribution kits,
documentation and other tools to those sites that
are interested
Need more input on requirements and priorities of
experiments regarding production plans

57
A
58
(No Transcript)

Write a Comment

User Comments (0)