Title: Central Services Databases
1Central Services -- Databases Farms
- Run II Computing Review
- September, 2005
2Databases
- Current Resources
- Support Levels
- Effort
- Hardware software
- Monitoring tools
- Completed and Current Projects
- CDF Replication
- SAM Schema
- Freeware Support
3Support Levels Effort
- Database system administration support for
- CDF offline
- CDF online
- D0 offline
- 24x7 support for production databases
- primary secondary on call
- 5 machines databases
- 9x5 support for Development / Integration
databases - 6 machines
- 11 databases -- 4 dev, 4 int, 1 cdfval, 2
testbeds - SAM support, plus other application support
- Effort
- 3.5 DBAs (Trumbo, Stanfield, Kumar, Bonham)
- 2 Sysadmins (Mihalek, Kovich, Kastner)
4Current Resources
5Current Resources
6Monitoring And Data Modeling Tools
- Monitoring Tools
- dbatool/toolman
- To monitor space usage, users, SQL, tempspace,
sniping of inactive sessions, auto start of
listener, IA, estimate table/index stats - OEM (Oracle Enterprise Manager) DB Monitoring
tool - http//www-cdserver.fnal.gov/cd_public/css/dsg/db_
stats/data/db_stats.html - Ganglia
- http//fcdfmon2.fnal.gov/
- Data Modeling Tool
- Oracle Designer is used for data modeling and
initial space estimates for applications.
7Uptimes
- CDF Online
- 100
- CDF Offline
- 99.44 (1776 mins since 11 Nov 2004)
- CDF Offline Replica
- 100
- D0 Offline Production
- 99.85 Uptime (420 mins since 15 Nov 2004)
8Completed Projects
- CDF Online hardware replacement, implementation
of Bzora1. - Replacement of basic replication with streams.
- SAM Schema support for CDF and D0
- Implementation of dcache/enstore for database
backups to tape. - Introduction of SAN technology for backups.
- Tested complete database recovery of d0ofprd1
database. - New license agreement with Oracle Corporation.
- Translated SAM schema to Postgres.
- Replacement of CDF replica machine.
- D0 Luminosity db deployed on Oracle 9i and 10g
versions (on machine loaned from CDF).
9Growth of D0 Offline DBs
10Growth of CDF Online DB
11Growth of CDF Offline DB
12CDF Streams Replication
On-line Users
CAF and Others
On-line 9.2.0.6 (cdfonprd)
Replica1 9.2.0.6 (cdfstrm1) Replication Distributi
on
fnal Replica2 9.2.0.4 (cdfrep01)
Potential Streams
Streams
Four App
Off-line 9.2.0.6 (cdfofprd)
Remote Sites
47 Apps
Potential Streams
Remote Users
Farm Users
13RMAN Backup on SAN
- Inexpensive, large disk array can accommodate
growing RMAN backups - Fast reliable backup and recovery
- 24x7 and 8x5 support tiers available
- Can serve various O/S platforms
- Briefing on the database backup/recovery
standardization on June 16, discussed the san
testing in more detail. - http//www-css.fnal.gov/dsg/internal/briefings_and
_projects/briefings/standardizing_database_backups
.ppt - Multiplexing of archives to local disk and SAN
14RMAN to SAN Experience
- d0ofdev1 RMANs to SAN since Nov. 04
- Two 1TB SAN mount points available
- Keep 2 alternating days of RMANs on SAN,
once/week to local backup disk - RMAN validation to determine backup file
integrity - One validation failure since Nov. 04
- Recoveries from SAN were all successful
15SAM Schema
- Production Deployments
- Autodestination subsystem of SAM schema
- Indexes on param values deployed in production
- Data types correction cut
- Indexes for volumes
- Works-in-Progress
- Request subsystem of SAM Schema
- Cut in Mini-SAM
- Upgrade to Mini-SAM as SAM Schema Evolved
- This facilitates individual developers to have
copy of SAM metadata and seed data available for
server software rewrite if needed. - Mini-SAM in Postgres
- Initiative to move towards freeware databases for
SAM. Proof of product not complete, requires
testing with a dbserver from the SAM development
team
16Oracle Contract
- Negotiated a new contract w/ Oracle
- Explicitly distinguishes between scientific
business use, w/ different arrangements - Scientific provides for 2400 term user
licenses renewed annually - Covers the entire user base
- Can increase or decrease as needed (including
decrease to zero) - Negotiated 87 discount (better than DOE)
- Annual cost for scientific use increased from
114k to 290k - But far less than feared
- Discounts apply for five years
17Freeware Support
- Mysql/Postgres prototype
- Proof of product with CDF data
- Mechanism for population is on demand, it does
not support updates - CDF successfully tested with CDF code -
(Karlsruhe) - Providing consulting for freeware databases
- actively maintaining new versions of mysql
postgres in KITS and working towards a more
robust environment - actively maintaining documentation for mysql
postgres in our freeware area - http//www-css.fnal.gov/dsg/external/freeware
- actively assisting users with questions,
upgrades, testing, etc. for freeware products
18Introduction to CSS Run II Farms Activities
- Personnel
- General system administration of farms
- System design of new IO and concatenation systems
- Deployment of new batch systems
- Evaluation and benchmarking of new hardware types
- Commissioning and troubleshooting of new compute
servers - Fermigrid work for interoperability of farms
19Personnel
- 3 FTE's involved in day to day system
administration (Tader, Van Conant, Syu) - Two team members in supporting role (Timm,
Greaney) for planning and installation.
20New CDF Farms Software
- Production farms made transition to total new
mode of processing - SAM used for data delivery
- SAM used for concatenation bookkeeping
- Condor with CAF scripts as batch system
- Production farms were first large-scale demo of
this setup, late 2004, now known as Phase I
farm. - Phase II production farm commissioned in June of
2005 - Condor successfully deployed as batch system,
success will lead us to use it on other farms as
welll. - No use of FBSNG/DFARM anymore
21Monitoring
- All farms are monitored with Ganglia
- http//fnpca.fnal.gov/ganglia/
- ngop is used for detecting error conditions
generating alerts into the helpdesk system
22New CDF Farms hardware
- Phase I farm
- 27 worker nodes, all off-warranty, used for SAM
testing - Phase II production farm
- 174 worker nodes, for production reconstruction
- Six IDE RAID servers
- Use for concatenation on local disk
- Can be moved between two farms as needed.
- Already have throughput of gt2 TB/day, and can
still increase.
23CDF Condor Phase I
24D0 Farms Design
- 448 worker nodes in D0 farms currently
- Goal is to migrate functions off of SGI head node
d0bbin - Batch system already has been moved
- Will use 5 linux worker nodes, SAN-attached disk,
and Global File System for concatenation - New NFS head node coming soon, also will use
SAN-attached disk - Master of SAM station will move to another node
25D0 Farm Utilization
26Opteron Hardware Evaluation
- CSS led multi-department taskforce to evaluate
AMD Opteron technology - Found Opterons of similar cost to their Intel
counterparts to be significantly better on D0
code - Opteron hardware will be purchased in FY2005
acquisition of worker nodes. - Significant improvement in performance per
dollar, also more efficient in electrical power.
27FY2004 Farms Acquisition
- First major deployment in new Grid Computing
Center - Farm team staff were contacts for installers
- Supervised burnin for 160 CDF 280 D0 nodes
- Includes CDF CAF, D0 CAB nodes as well
- Led troubleshooting effort for several months
- Physically present in GCC with vendor staff
- Resonant vibration between system drives cases
fans resulted in intermittent disk errors
28FERMIGRID
- CSS personnel part of Fermigrid gateway team
- Goals for Fermigrid
- All Fermilab clusters can interoperate
- Unified interface to Open Science Grid
- General Purpose Farms are first Fermigrid compute
element - Other farms we manage will follow shortly.
- http//osg-cat.grid.iu.edu/
29Open Science Grid Gatekeepers
- GP Farms
- Existing OSG gatekeeper being used by CDF to run
Condor glide-ins (GlideCAF). - D0 Farms
- Machine has been ordered, we will install.
- D0 CAB
- Run2-sys personnel are attending our training
sessions, will learn to install, machine has been
ordered. - CDF Farms
- Some new nodes will be available to OSG and
Condor glide-ins, we will install the gatekeeper.
30SAMGrid Gatekeepers
- J. Snow experience running D0 jobs on CMS shows
it is better to have separate SAMGrid gatekeeper. - SAMGrid testbed part of GP Farms since early 2004
- We manage gatekeepers
- Used for testing mostly, not production
- D0 Farms SAMGrid gatekeeper soon to be replaced
with faster node.
31(No Transcript)