Status of the BaBar Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Status of the BaBar Databases

Description:

1 /29. Status of the BaBar Databases. Jacek Becla. BaBar ... entirely in our hands, very stabile so far. One event store fd down (e.g. lock server crash) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 30

Provided by: chip162

Category:

more less

Transcript and Presenter's Notes

Title: Status of the BaBar Databases

1
Status of the BaBar Databases

Jacek Becla
BaBar Database Group

2
BaBar Is in Production

Run 1 May 1999 Oct 2000
24.2 fb-1 (1.3 per month)
Run 2 Feb 2001 July 2002
up to 12.6 fb-1 now (2.5 per month)
Expected 100 fb-1 by July 2002
already well over designed luminosity

3
Prognosis
4
Changes

4 -gt 21 streams
gt5 times more files, locks
no data duplication (streams not self-contained)
Smaller files
2 -gt 0.5, 10 -gt 2 GB
Using Objy 6.1, read only dbs
Clustering hint server and cond OID server
Migrating production to Linux (now)
Introducing multi-fds (now)
Cannot afford a large test-bed anymore

5
OPR

In general keeps up with data
150 pb-1 per day
faster than at the end of Run 1
in spite of 5x load
will have to deal with 300 pb-1 soon

6
Current OPR Configuration

Hardware
6 4-CPU data servers, lock server, jnl server,
catalog server, clustering hint server
conditions OID server
220 clients
Software
Objy 6.1, Solaris 7
about to migrate to Linux

7
OPR Short Term Future

Use multi-fds
2 event store fds, 1 conditions
6 6 data servers
new federation approx. every week
Migrate clients to Linux
2.2 faster CPU, more memory
Use faster machine for lock servers
now Sun Netra T1, 440 MHz
planned Sun Blade 1000, 750 MHz UltraSPARC-3
Discussions about storing all digis in objy, and
reprocessing from Objy, not xtc

8
REPRO

Hardware configuration similar to OPR
Occasionally up to 3 repro farms
over 300 pb-1 on a good day
150150200 nodes
condition merging nightmare

9
REPRO Near Future

Use multi-fds
2 event store fds, 1 conditions
5 5 data servers
new federation every other week
same slow lock servers
Move to Linux
Run in Italy. Timescale mid 2002

10
Robustness

Db creation (weak point) removed
precreation in background by CHS, automatic
recovery, new C api in 6.0
AMS crash
¾ of the farm continues, unless it is a default
AMS(used by CHS)
CHS new central point of failure
entirely in our hands, very stabile so far
One event store fd down (e.g. lock server crash)
the second should finish processing current run
Cleanup server worked on

11
(No Transcript)
12
(No Transcript)
13
Analysis

200 CPUs (Sun Netra T1 like)
17 servers, 24 TB disk cache
On demand staging turned off
Read only dbs
starting to see effect now
Disk space always a problem
micro 5.4 KB/event (aod, col, tag, evt, evshdr)
mini 4.7 KB/event (esd)

14
Analysis cont

Veritas File System reconfiguration
direct I/O instead of buffered I/O
more than doubles effective data rate
Lock server memory leak
grows up to 600 MB in a week
switching every week
Kanga (ROOT based) will become deprecated
recent computing model enhance Objy, deprecate
kanga (freeze by Mid 2002, produce files till
late 2002)

15
AMS

Known (but not fixed) problem
file used immediately after being closed
crashes AMS (in 6.1 kills the client)
Ported to Linux
no performance figures yet
New feature - compression
Redesigning front end part
got ok from Objy

16
A Word on Conditions

Using OID server to find time interval
only in REPRO so far, about to put in OPR
Staircase problem
incorrect design
purging every 2 weeks, 15 min per rolling
calibration (35 in total), run in parallel
Finalize problem
based on genealogy object, (all objects named),
result of iteration in unpredicted order. Just
slow
Condition merging problem

17
Conditionscont

Index problem
occasionally index inconsistent (does not return
all objects in given range). Solution rebuild.
Happens once every 2 months. Not reported yet.
Index scaling
range query (the way we use it) does not scale
response time linear (100 K entries -gt 0.5 sec)
Will extend OID server
now read only access
Will redesign re-implement conditions
and address all the problems, timescale end of 01

18
Data Distribution

Micro-level data mirrored _at_ in2p3
Run2 mirror raw as well
Current tools do not scale with increased data
volume
a lot of manual work
Will try using data grid based tools soon

19
Operations

2 DBAs 3rd coming soon
Many manual tasks slowly being automated

20
Some Numbers

Total size of data 300 TB
files 128K
users in analysis 220
10 active production federations
this includes 5 analysis fds
Cond dbs 12 GB

21
TuningPerformanceScalability
22
4?20 Streams Was Non-trivial

4 streams 100 nodes 60 Hz200 nodes 115 Hz

23
Clustering Hint Server

CORBA based, multithreaded
Precreates in background dbs and conts,
distributes oid to clients
Many other features
containers reused
full integration with HPSS (precreated files
pinned in cache, full dbs immediately migrated)
file disparsification
file transfer to tape 1MB -gt 15-25MB now
db creation locally, pre-sizing
no container extensions on the client side
round robin load balancing
automatic recovery, and so on

24
Others

commitAndHold
significant reduction in lock traffic
Initial transaction for condition
one instead of 50 transactions
Cache authorization
rather then check on every event
Tune client file descriptor limit
Hit 8K limit on AMS site. Reduced client fd
limit 196 -gt 32. AMS response improved, AMS CPU
usage decreased
Increase trans granularity

25
Bottlenecks