Title: Persistency at LHC
1Persistency at LHC
- Vincenzo Innocente
- CERN/EP/CMC
2Sources and Contributions
- Presentations at last RD45 workshop
- Presentations at the Architecture Working Group
- Experiments Web pages
- Focus on
- LHC experiments prototypes
- New generation experiments (BaBar, STAR, RunII)
experience and plans
3 4Persistency what for?
Process 1
Process 3
- A process saves its state to be later re-used by
- the same process
- a different process running the same executable
- a different process running a different
executable - Ideal persistency
- Core Dump!
Process 2
Volatile Memory
Permanent Storage
5Use Cases
- Extended (in space and time) virtual memory
- proprietary format optimized for computational
and storage performance of a single application - Import/Export in a heterogeneous environment
- standard application-independent format
- conversion to/from internal application format
- Management of different versions (identification,
query mechanism) and of concurrency (locking) - proprietary internal mechanism
- rely on the file system DBMS
6Use Cases
- Extended (in space and time) virtual memory
- proprietary format optimized for computational
and storage performance of a single application
- Import/Export in a heterogeneous environment
- standard application-independent format
- conversion to/from internal application format
- Management of different versions (identification,
query mechanism) and of concurrency (locking) - proprietary internal mechanism
- rely on the file system DBMS
7Object Persistency
- Objects are atomic entities
- have a state (data members
including relationships) - provide services (methods)
- Persistent objects survive process boundaries
- when retrieved
- have the same state
- provide the same services
- as they were stored
Event
Event
Event
Volatile Memory
Permanent Storage
Event
Event
8Object Persistency
- Persistency
- Objects retain their state between two program
contexts - Storage entity is a complete object
- State of all data members
- Object class
- OO Language Support
- Abstraction
- Inheritance
- Polymorphism
- Parameterised Types (Templates)
9OO Language Binding
- User had to deal with copying between program and
I/O representations of the same data - User had to traverse the in-memory structure
- User had to write and maintain specialised code
for I/O of each new class/structure type - Tight Language Binding
- ODBMS allow to use persistent objects directly as
variables of the OO language - C, Java and Smalltalk (heterogeneity)
- I/O on demand No explicit store retrieve calls
10Problems with Naïve OP
- Storing services (methods ready to run) is non
trivial - persistency services store data only
- configuration management takes care of code
- frameworks can use dynamic loading to match data
code - Clean and performant object design is difficult
- Different (partial) representations of the state
of an object may be required to cope with
computational, storage and I/O efficiencies (and
code development efficiency) - Object design and implementation evolve,
persistent objects stay the same - Old persistent objects need to be converted
11More Problems with Naïve OP
- Object granularity does not match raw I/O
granularity (which in turn is device dependent) - small objects should be physically clusterized
according to users access patterns - Object logical relationships do not necessarily
reflect access patterns (old rows vs columns
dilemma) - How objects become persistent
- At construction time (user can control
clustering) - By reachability An object becomes persistent
when attached to an already persistent object
(clustering control difficult)
12Physical Model and Logical Model
- Physical model may be changed to optimise
performance - Existing applications continue to work
13Realistic Object Persistency
Conversion from/to computational optimal format?
compression?
object
file
page
object
objects
page
Conversion from/to machine dependent format new
shape
14Components of a POM
- Storage manager
- manage the physical structure on disk
- Transaction/concurrency manager
- client transaction, journaling, locking
mechanisms - (or rely on OS and file system protections)
- RTTI system
- identifies the concrete type of object to
retrieve/store - Converters
- from storage format to user format and
viceversa - machine-dependencies, schema-evolutions,
user-hooks
15Components of a POM
- Application Cache manager
- dynamic memory management with garbage collection
- Tools and (G)UI
- naming, indexing, query mechanisms
- interactive browsing and query
- development tools
- administration tools
16Objectivity/DB
- ODBMS close to ODMG standard (library not
framework) - Storage Manager based on fixed physical hierarchy
- slot-page-container-database(file)-federation
- Lock-server and journals to manage transactions
- Proprietary parsing of extension of C (ooddlx)
- Objects are converted when opened
- schema-evolution effects automatic or user
defined - Basic naming, indexing and query mechanisms
- Crude Browsing and administration tools
- but Objy is integrated with some third-party
frameworks
17ROOT
- Application Framework with embedded I/O
- Storage Manager based on
- logical hierarchy Tbasket-branch-tree
- physical logical-records in files
- No transactions, no concurrency management
- Proprietary parsing of C subset (CINT)
- Objects are converted when retrieved (Streamer)
- Automatically or by user (schema-evolution only
by user) - No naming, indexing or query mechanisms
- but CINT scripting
- Pawerful interactive environment
18(Wrapped O)RDBMS
- Powerful, reliable and efficient storage managers
with full concurrency and transaction management - SQL query mechanisms with transparent (hidden)
indexing and naming - User friendly, fully integrated browsers and
tools - (for relational tables)
- Poor object integration
- (developers should be both OO and ER experts at
the same time)
19 20HEP Data
- Environmental data
- Detector and Accelerator status
- Calibrations, Alignments
- Event-Collection Meta-Data
- (luminosity, selection criteria, )
-
- Event Data, User Data
21Environmental Data
Version C
Geometry
Version B
Version A
Version C
Alignment
Version B
Version A
Version B
Calibration
Version A
time
Snapshot for Environmental data items valid for
the currently processed event.
Parameters
22Event Structure Placement (BaBar)
Event Header
Tag
Tag
Evs
Sim Header
Raw Header
Emc Header
Trk Header
Pid Header
Beta Header
Hdr
Sim Data
Sim
Raw Data
Raw
Emc Data
Trk Data
Pid Data
Rec
Emc Data
Trk Data
Pid Data
Beta Data
Esd
Trk Data
Pid Data
Beta Data
Aod
Databases
23BaBar Event Structure
- Decoupling of placement navigation
- Hierarchical Placement Regions
- Sim (Simulated Data). 100kBytes/event
- Tru (Simulated Truth Data) 40kBytes/event
- Raw (Raw Data) 30kBytes/event
- Rec (Reconstructed Data) 100kBytes/event
- Esd (Event Summary Data) 20kBytes/event
- Aod (Analysis Object Data) 2kBytes/event
- Tag (Event Selection Tag) 200Bytes/event
- Navigation Trees
- Minimize size of navigation headers
- Allow for expansion of data without schema
evolution
24Root Physical Clustering
25ODBMS-MSS Integration
- SLAC-Objy Plan
- Extensible AMS
- Allows use of any type of filesystem via oofs
layer - Generic Authentication Protocol
- Allows proper client identification
- Opaque Information Protocol
- Allows passing of hints to improve filesystem
performance - Defer Request Protocol
- Accommodates hierarchical filesystems
- Redirection Protocol
- Accommodates terabyte filesystems
- Provides for dynamic load balancing
26Dynamic Load Balancing Hierarchical Secure AMS
ams
Redwood
ams
Dynamic Selection
client
hpss
Redwood
ams
Redwood
27One Technology for All ?
- Event catalogues
- Update (add and remove) items of a catalogue
- Searchable SQL or equivalent
- Event data
- Write once-read many (WORM)
- Often on tertiary (sequential) storage
- Bulk data used by the entire collaboration (Raw,
Rec,) - User extracted data (N-tuples)
28One Technology for All ?
- Detector data
- Updates of data items
- Versioning of data items
- Version configuration
- Statistical data
- Understandable by interactive tools
- A single coherent solution (non optimal for all
purposes) - or
- Ad-hoc optimal product for each given type?
29LHCb Event Persistency
SicbCnvSvc
Transient Event Store
Sicb data Files
Sicb/Zebra
Converter
Event Data Service
Converter
Converter
Persistency Service
RootCnvSvc
Algorithm
Algorithm
Root data Files
Root I/O
Converter
Converter
Converter
OutputStream
AppManager
OutputStream
30LHCb Generic Persistent Model
Technology
Converter
(2)
(3)
(4)
12ByteOID
ltnumbergt
(1)
Lookup table
31LHCb Link Tables
- One Link table per Storage technology per DB
- Link to Objy object
- no link table
- 8 Bytes are enough to hold ooRef directly
- Link to ROOT object
- Link table entry must contain all navigation info
- File name
- Tree/Branch name
- Link to ZEBRA (SICB) object
- Link Table contains file name ZEBRA bank name
32Hybrid Event Store in STAR
- Adoption of ROOT I/O for the event store leaves
Objectivity with one role left to cover the true
database functions of the event store - Navigation among event collections, runs/events,
event components - Data locality (now translates basically to file
lookup) - Management of dynamic, asynchronous updating of
the event store from one end of the processing
chain to the other - From initiation of an event collection (run) in
online through addition of components in
reconstruction, analysis and their iterations - But with the concerns and weight of Objectivity
it is overkill for this role. - So we went shopping
- looking to leverage the world around us, as
always - and eyeing particularly the rising wave of
Internet-driven tools and open software - and came up with MySQL in May.
33Requirements STAR 8/99 View
34RHIC Data Management Factors For Evaluation
- Changes in the STAR view from 97 to now are
shown - Objy RootMySQL Factor
- ? ? Cost
- ? ? Performance and capability as data access
solution - ? ? Quality of technical support
- ? ? Ease of use, quality of doc
- ? ? Ease of integration with analysis
- ? ? Ease of maintenance, risk
- ? ? Commonality among experiments
- ? ? Extent, leverage of outside usage
- ? ? Affordable/manageable outside RCF
- ? ? Quality of data distribution mechanisms
- ? ? Integrity of replica copies
- ? ? Availability of browser tools
- ? ? Flexibility in controlling permanent
storage location - ? ? Level of relevant standards compliance,
eg. ODMG - ? ? Java access
- ? ? Partitioning DB and resources among groups
35- Experiments
- Status and Plans
36CMS
- Use Objy in production
- Test Beam DAQ
- Montecarlo (GEANT3) reconstruction
- Objectivity fully integrated in Application
Framework (CARF) - CARF manages transactions, physical clustering
and the whole persistent object structure and its
relations with the transient structure - users access persistent objects through C
pointers - CARF takes care of pinning
- leaf inheritance from ooObj often used
37CMS
- Limited use of Objectivity extentions
- associations, indexes, maps, query predicates,
etc. - object copy, move, versions
- Schema evolution routinely used
- No complex object conversion attempted so far
- Multi-federation environment to decouple
- production
- analysis
- development
38CMS Production Federations
Empty user dbs system dbs last run-data db
Copy of Empty user dbs system dbs all
run-data db
Online Boot
Offline Boot
Online FD
Offline FD
Clone FD
Run1
Us1
Us1
Run1
Conf
Run2
Conf
Us2
Us2
Run3
RunCat
RunN
RunCat
RunN
39CMS User Federations
populates user dbs link system dbs copies or link
run-data db
Empty user dbs system dbs all run-data db
User Boot
Offline Boot
User FD
Offline FD
Clone FD
Run1
Us1
Us1
Run2
Conf
Us2
Us2
Run3
Run1
RunCat
RunN
40Atlas
- Used Objectivity in several test-bed applications
- HCAL test-beam
- ATLFAST
- 1TB Milestone (HPSS used as MSS)
- Plan to use Objectivity in future test-beams
and MonteCarlo reconstruction - The application framework will provide a
database independent interface
41ALICE
- Simulation and reconstruction framework fully
integrated in ROOT - Used in TestBeams
- (actually a real Heavy Ion experiment)
- Mockup Data Challenge 7 TB in seven days
- MonteCarlo simulation and reconstruction
- Use HPSS and/or CASTOR for file management
42ALICE DC II
NA 57 data source
Computer Centre
9 PowerPC AIX
LDC
LDC
5 MB/s
LDC
Intel/Linux PC Cluster 10/15 nodes
LDC
LDC
LDC
LDC
Switch
LDC
LDC
GB eth
GDCEvent Builder
pipe
Switch
ROOTObjectifier
Intel/PC Linux PowerPC /AIX Sun
Switch
LDC
LDC
LDC
10MB/s GB eth
LDC
LDC
LDC
LDC
LDC
10 MB/s
HPSS
CASTOR
??
LDC
ALICE DAQ data source
DATEGDCLDC
43LHCb
- Do not want to limit to one persistency
technology - Speed, when you need speed
- Functionality, when you need functionality
- Ease migration to upcoming (superior)
technologies - Independence
- Well defined interface to persistency
technologies - Interface abstract technology independent API
- Example ODBC for relational DBs
44LHCb
- LHCb application framework (GAUDI) is independent
from persistent technology - Manage its own application caches (data services)
specialized in - event data, detector data, statistical data
- Provides abstract interface for user provided
converters
45BaBar
- Taking data since May
- Use Objectivity for all kind of data
- many home made tools to manage the database
- Complete decoupling between transient objects
(seen by end user) and their persistent
representations - No schema evolution (explicit renaming of
classes) - Starts using multiple-federations to decouple
running environments
46STAR
- Hybrid solution
- ROOT for event file
- MySQL for event catalog and environmental data
- MySQL under test for event tags as well
- HPSS (through Grand Challenge) for tertiary
storage management
47(No Transcript)
48Fermi RUNII (CDF DØ)
- Sequential access model based on RUNI experience
- focus on efficient data access from hierarchical
storage - clustering optimized to largest data volume
access pattern - Use
- ROOT (CDF), EVpack (modified DSPACK) (DØ) for
event files (MSQL and Oracle8 evaluated by DØ) - just I/O back-ends to EDM and DØOM
- SAM for event catalog and file management
- Oracle8 supporting database
49Data Organization
User and physics group (derived) data
Metadata
Event Information Tiers
Warm Cache
Physical Clustering
From Oct 1997 Review - Lee Lueking
50Data Access
Mass Storage
Pipeline
Consumers
Metadata
Thumbnail
Freight Train
Pick Event
User File
Data flow
Group of Users
Disk Storage
File
Tape Storage
Single User
Pipeline Name
Event
Metadata
Lee Lueking - October 1997
51Season IV - aggregate bandwidths, summed from
spreadsheet
52- (non-technical)
- Risk
- Analysis
53Toward 2001 Milestone
- If the ODBMS industry flourishes it is very
likely that by 2005 CMS will be able to obtain
products, embodying thousands of man-years of
work, that are well matched to its worldwide data
management and access needs. The cost of such
products to CMS will be equivalent to at most a
few man-years. We believe that the ODBMS industry
and the corresponding market are likely to
flourish. However, if this is not the case, a
decision will have to be made in approximately
the year 2000 to devote some tens of man-years of
effort to the development of a less satisfactory
data management system for the LHC experiments. - (CMS Computing Technical Proposal, section 3.2,
page 22)
54Commercial vs GPL
- Robust, tested, maintained, well documented (is
stable) - Response to upgrade requests is slow
- They can not jeopardize deployed application
- priority given to short term profit
- difficult to understand internal details (no
source) - but in principle documentation should be enough
- can go out of business
- Good enough for physicists
- Require internal certification
- Response to upgrade even too fast
- old users usually ready to jump on new features
- priority given to challenging requests...
- Open source
- often you need it.
- Author could get bored
55ODBMS
- Objectivity seems to satisfy HEP technical
requirements - Needs upgrade for
- VLDB support
- Mass storage interface
- remote access and data distribution
- More than a DBMS is a DB access layer
- requires to be integrated (or interfaced) to
application frameworks and to administration
tools - It is the only real ODBMS survivor on the market
- how long it will last?
56ROOT
- A physics analysis framework with I/O support
- Classified also as a rapid-prototyping tool
(B.Meyer) - Not sufficient for the management of large data
volumes (LHC major requirement) - an external DBMS is required to manage Meta-Data
- Limited experience so far (as POM in production)
- Many motivated users actively supported by the
authors - Requires major architectural changes to make it
modular - for those who do not want to use it as a framework
57Yet Another POM
- Prototype required to understand problems and
estimate effort - Usable as test-bed before asking upgrades to a
commercial partner - Usable as light-pom?
- no transaction, no journaling, no schema, just
data - (Objectivity can be used in this mode, just make
sure to write-protect input files!!!)
58Personal Comments
- Event Data
- object modeling and direct navigation OK
- DBMS tools (query processor, smart-association,
index, names, versions) more a burden than an
help - Event Catalog, Environmental data, Detector
description - fit better standard (O)DBMS practices and tools
- Statistical data
- simple I/O is not enough, need direct relations
with event catalog and event data - Relation models do not suite HEP applications
59Personal Comments
- My personal LEP experience brought me to the
conclusion that a multitude of persistency
solution are difficult to manage and integrate
properly. - In particular a file-based event-store (with
filenames encoding metadata) does not scale. - My current (limited) experience tends to convince
me more and more that a coherent approach to
persistency is the only solution for LHC given
the resource constrains we have
60Personal Comments
- Applications require to be independent of
underling technologies - Migration to a new technology should imply a
finite effort - Market survey 0.5PY
- Learning 1 year
- Implementation 1PY
- User Migration 0?
- (P stands for Person not Peta!)
61Questions
- One size fit all?
- One coherent solution
- several tools optimized for each problem
- how integration goes (transaction synchronization
for instance)? - How integrated should a POM/DBMS be in the
application framework? - Is hierarchical storage incompatible with
transparent object navigation? - Optimization of distributed resources needs
preemptive localization of data to be accessed
62Questions
- Do HEP applications require 4GL query processors?
- Is not a (multiple) OO language binding enough?
- Is Objectivity a possible choice?
- From technical, political, managerial sides
- Does ROOT RDBMS scale to LHC data volume?
- RD45 was initiated with the idea that a
file-based event store metadata catalog (LEP
like) would not be sufficient... - What should be the objective of an alternative
POM prototype?
63Questions