Title: Software Frameworks for CMS Data Analysis
1Software Frameworksfor CMS Data Analysis
- Vincenzo Innocente
- CERN/EP
2Data Analysis Micro-Process
- Physics analysis is to a large degree an
iterative process of - Reducing data samples to more interesting subsets
- Distilling the sample into information at higher
abstraction level - By summarising lower level information
- By calculating statistical entities from the
samples
- A large part of the work can be done on very
high-level entities in an interactive analysis
and presentation tool - Hence focus on tools that work on simple summary
information(DSTs, N-tuples, tag databases, ...) - Additional tools for detector and event
visualisation
3CMS Data Analysis Model
Quasi-online Reconstruction
Environmental data
Detector Control
Online Monitoring
store
Request part of event
Store rec-Obj
Request part of event
Event Filter Object Formatter
Request part of event
store
Persistent Object Store Manager
Database Management System
Store rec-Obj and calibrations
Physics Paper
store
Request part of event
Data Quality Calibrations Group Analysis
Simulation
User Analysis on demand
4Offline Architecture New Requirements
- Bigger Experiment, higher rate, more data
- Larger and dispersed user community performing
non trivial queries against a large event store - New IT technologies to make best use of
- Increased demand of both flexibility and
coherence - ability to plug-in new algorithms
- ability to run the same algorithms in multiple
environments - guarantees of quality and reproducibility
- high-performance user-friendliness
5Analysis Environments
- Real Time Event Filtering and Monitoring
- Data driven pipeline
- Highly reliability
- Pre-emptive Simulation, Reconstruction and Event
Classification - Massive parallel batch-sequential process
- Excellent error recovery and rollback mechanisms
- Excellent scheduling and bookkeeping systems
- Interactive Statistical Analysis
- Rapid Application Development environment
- Excellent visualization and browsing tools
- Human readable navigation
6Migration
- Today Nobel price becomes trigger for tomorrow
- (and background the day after)
- Boundaries between running environments are fuzzy
- Physics Analysis algorithms should migrate up
to the online to make the trigger more selective - Robust batch systems should be made available for
physics analysis of large data sample - The result of offline calibrations should be fed
back to online to make the trigger more efficient
7Coherent Analysis Environment
Network Services
Visualization Tools
Reconstruction
Simulation
Batch Services
Analysis Tools
Persistency Services
8The Challenge
- Beyond the interactive analysis tool (User point
of view) - Data analysis presentation N-tuples,
histograms, fitting, plotting, - A great range of other activities with fuzzy
boundaries (Developer point of view) - Batch
- Interactive from pointy-clicky to Emacs-like
power tool to scripting - Setting up configuration management tools,
application frameworks and reconstruction
packages - Data store operations Replicating entire data
stores Copying runs, events, event parts between
stores Not just copying but also doing something
more complicatedfiltering, reconstruction,
analysis, - Browsing data stores down to object detail level
- 2D and 3D visualisation
- Moving code across final analysis, reconstruction
and triggers - Today this involves (too) many tools
9Analysis Reconstruction Framework
Physics modules
Specific Framework
Reconstruction Algorithms
Data Monitoring
Event Filter
Physics Analysis
Generic Application Framework
Calibration Objects
Event Objects
Configuration Objects
adapters and extensions
Utility Toolkit
10Why Frameworks
- Physicists concentrate on the development of
reconstruction and analysis algorithms as plug-in
modules - Frameworks
- orchestrates instances of these modules
- hides system related complexities
- Allows for sharing of code for common or related
tasks. - Changes into the physics reconstruction and
analysis logic affect only plug-ins - Changes in system services, migration to new IT
technologies, affect only the framework
11Questions
- What is the role of an experiment-specific
framework - How it integrates with more generic frameworks
- How the user can have a coherent and consistent
view of the Analysis process - How new tools (new frameworks) can be integrated
without disrupting the existing architecture
12Difficult Balance
The most profoundly elegant framework will never
be reused unless the cost of understanding it and
then reusing its abstractions is lower than the
programmers perceived cost of writing them from
scratch (G.Booch, 1994)
- Flexibility (many abstractions)
- Wide range of applications
- Great potentiality of extension and migration
- Difficult to understand, to use
- Rigidity (few abstractions, many concrete
classes) - Easy to use
- Limited range of applications
- Difficult to migrate, extend
13Incoherent Solution
- The experiment kernel deals just with one
problem event processing - External tools are kept as they are
- Communication through I/O converters
- Persistency is just one (or more) of the external
tools - Users see a different environment for each part
of the problem domain
14Coherent, Monolithic Solution
- Framework Kernel is expanded to cover the whole
problem domain - User see The Framework
- New tools should be incorporated into the
framework - Imported classes should be modified to derive
from framework base-classes to keep coherency - Persistency is implemented by the framework
- Example MS
15Coherent, Non-invasive Solution
- Users see a standard environment that acts also
as integration glue - The experiment kernel is composed of a hierarchy
of application-frameworks reusable in various
parts of the problem domain - External frameworks are integrated directly, if
they conform to the standard environment, or
through wrappers, if not. - Persistency is encapsulated by one of the kernel
application-frameworks
Which Glue?
16Python
- Python is an interpreted, object-oriented
language introduced at the beginning of the 90s - It had a fast spread particularly among
scientific communities in search for a rapid
application development tool able to integrate
efficiently already existing, highly optimized,
scientific software (example http//sources.redh
at.com/gsl) - Python provides
- Scripting functionalities such as Perl or Tcl
- Runtime dynamic loading
- A standard OO library for system level support
- Simple mechanisms for interfacing to C objects
- A large body of open-source modules covering a
wide spectrum of application domains, scientific
in particular
17Python as a glue
- Integration in Python is non-intrusive
- Export to Python just the class interface
encapsulation is preserved - Original (C) representation is respected no
translation, no conversion - Additional Python-specific extensions do not
impact original design and functionalities - Binding with Python is at Runtime
- Batch applications need not to be Python aware
- Interactive applications can be extended
(actually constructed) and modified at runtime
18Examples (personal experience)
- Exporting the interface of an application
framework such as Objectivity/DB took few hours - CERN/IT Physics analysis environment (ANAPHE)
provides a complete Python binding (Lizard) which
does not affect the core C library - Seamless integration of CMS framework kernel
(COBRA) and CERN/IT ANAPHE library through their
(independent) python interface - Direct application of other Python modules
(regular expression, string/list manipulation,
numerics, etc) on ANAPHE or COBRA objects - Zero effort in downloading, installing and using
gsl with Lizard -
19Emacs used to edit CMS C plugin to create and
fill histograms
OpenInventor-based display of selected event
Lizard Qt plotter
ANAPHE histogram Extended with pointers to CMS
events
Python shell with Lizard CMS modules
20Example of Today Data Analysis
Python (Lizard)
On Demand Reconstruction And Visualization
Interactive User Analysis
Ask to visualize one event
Request user data
Request part of event
Store selected events
Persistent Object Store Manager
Database Management System
Store rec-Obj and user data
store
Offline Reconstruction And Analysis
Simulation
21Key Components
- Unique context for persistent objects
- Today limited in COBRA to a single Objy
federation - A unique persistent Object Identifier
- Used in communication among Threads and Processes
- Ability to process a single event with no other
a-priori knowledge - Navigation from event to environment (conditions)
- On demand reconstruction (implicit invocation)
- Python used as glue
- Replacing ksh, csh, tcl, perl, kuip, sigma (you
name it) - Plug and play environment
22HEP Data
- Event-Collection Meta-Data
- Environmental data
- Detector and Accelerator status
- Calibrations, Alignments
- (luminosity, selection criteria, )
-
- Event Data, User Data
Navigation is essential for an effective physics
analysis Complexity requires coherent access
mechanisms
23Conclusions (Challenges)
- Today HEP Experiment
- Bigger, higher rate, more data, last longer
- Larger and dispersed user community
- IT
- Ubiquitous
- Develops fast
- Become obsolete even faster
- Traditional HEP analysis software architectures
- Monolithic
- Incoherent
24Conclusions (Solutions)
- Hierarchy of non-intrusive, loosely-connected
Frameworks - Easier Maintenance, Evolution, Migration
- Standard framework acting as glue
- Easier integration
- Coherent user view
- Powerful flexible persistency mechanism
- Uniform
- Transparent data access