Kein Folientitel - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Kein Folientitel

Description:

Prof. Dr. Volker Lindenstruth. Kirchhoff Institute for Physics. University Heidelberg, ... information from middleware to reconfiguring node node reconfiguration ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 19

Provided by: volke98

Category:

more less

Transcript and Presenter's Notes

Title: Kein Folientitel

1
GRID Workshop Marseille
Prof. Dr. Volker LindenstruthKirchhoff
Institute for PhysicsUniversity Heidelberg,
Germany Phone 49 6221 54 4303 Fax 49 6221 54
4045 Email ti_at_kip.uni-heidelberg.de WWW www.ti.u
ni-hd.de

Introduction
ALICE L3 Trigger Processor
vision of LHC farm computing

2
The Kirchhoff Institute

Founded October 1999 (www.kip.uni-heidelberg.de)
180 Full time employees, 12 Professors (11 CS)
4 chairs (3x experimental physics, 1x computer
science)
In-house computer scenter/computer division,
electronics division, mechanical design division,
ASIC laboratory incl. Clean room
Various computer pools Hardware Praktikum

Chair of Computer Science
Founded July 1998
Projects
ALICE Transition Radiation Trigger Processor
ALICE L3 Trigger Processor
HEP Data Grid
LHCb L1 Vertex Trigger Processor

3
TPC Readout

19000 analog channels/sector
512 time bins per analog channels
10-bit value per time bin
128 channels per front-end board
32 front-end boards per readout controller or
optical link
4096 channels per optical link
de-randomizing buffer for 8 full events in
front-end
zero suppression in front-end
zero suppressed raw event
375 kB/event and link
readout rate up to 200 Hz or 75 MB/sec and node

x2x18
cave
counting house
Six receiver persector
Receiver
Processor
L3 Network
4
PCI Receiver Card Architecture
Device can mount in any computer allowing to use
this computer also as GRID node when off-line
Data
Push readout
Pointers
FiFo
Optical
FPGA
Multi Event Buffer
Receiver
PCI 66/64
PCI
Hostbridge
PCI
Host
memory
PCI
5
ALICE L3 Farm Architecture
Pa
r
t
i
c
le
In
n
er
Mu
on
Trigger Detectors Micro Channelplate -
Zero-Degree Cal. - Muon Trigger Chambers -
Transition Radiation Detector
Id
e
nt
i
fi
c
at
i
on
Tr
ac
ki
ng
Tr
ac
ki
ng
Ph
ot
on
36 TPC Sectors
S
y
s
t
e
m
Trigger
Decisions
Cham
bers
Sp
ec
t
ro
m
et
er
Detector
busy
F
E
E
F
E
E
F
E
E
F
E
E
F
E
E
F
E
E
DD
L
Tr
ig
g
er
Data
L0 Trigger
L1 Trigger
L3 on-line/off-line farm
L3/DAQ/Processor Farm Switch Fabric
L2 Trigger
EDM

On-Line Requirements
700 Nodes (_at_200 SI)
20 GB aggregate input data rate
High throughput architecture
Trigger Processor ? high rel.
Analysis farm when ALICE off-line

S
w
i
t
c
h
LDC Local Data concentrator Software
running on standard CPU L3CPU L3 Trigger
Processor - generic commercial off the shelf
CPU GDC Global Data Concentrator - generic
commercial of the shelf CPU NIC Netcork
Interface Card
P
D
S
P
D
S
P
D
S
P
D
S
6
LHC farm computing

Low Cost ??Modular Architecture (use COTS mass
market computers and networks)
LHC farm low maintenance
Monitorinig services (all levels)
Configuration management
Software Management (automated)
Fault tolerance, robustness (not necessarily on
job level)
This is NOT enterprise computing to what
extent can those paradigms applied??
Need Software Interfaces for
Data transfer
(Data)File access (do we have a data model??)
What analysis software/libraries will be
supported??
At will be job control supplied?
We need Reference Installations and Recommended
Platforms (Hard/Software)
GRID KIT?

7
Performance Monitoring

Hierarchical performance measurements (regional
center and GRID wide)
Requirement to measure message latency, retry
rates, Packet loss rates, Processor/network
utilization
Develop software interface to middleware to
communicate status
Develop prototype/testbench (Ethernet/Atoll)
implementing IF
Visualization of performance data based on
software interfaces

8
Fault Tolerance

Failure modelling
Failure types
failure classes
MTBF for each failure
failure specific recovery strategies
fault tolerance levels in GRID
Consequences for cluster topology (how many
redundant network links between which nodes,
....)
Interface to middleware to communicate failure
and receive reconfiguration
Fault tolerant Prototype/Testbench
(ethernet/ATOLL) to demonstrate failure tolerance
for
unplugging of network cables
resetting nodes
node power failure
Processor crashes

9
Cluster Reconfiguration

Generation of unique GRID node ID (unique within
whole grid)
Topology independent automatic configuration upon
startup (node joining cluster/Cluster booting)
Definition asynchronous error reporting mechanism
to middleware
Definition of interface to transfer
reconfiguration information from middleware to
reconfiguring node ? node reconfiguration
Develop prototype/Testbench (Ethernet/Atoll)
implementing IF

10
Distributed Caching

Monitor state (health) of the cluster/grid
Cope with incomplete/inconsistent state
information
(Re-)Distribute data/files in the cluster/grid in
a hierarchical, state-dependent fashion
Devise or adopt suitable consistency semantics
and mechanisms
Attempt to utilize same mechanisms on both levels
(cluster and grid), however with different
trade-offs
Reconcile with performance-aware caching
Integrate into WAN data management middleware

11
GRID Kit

All additional Hardware/Software (drivers, etc.)
to turn COTS computer into GRID node
Development of slow control GRID standard,
integrated into middle ware
e.g. node running hot notifying middleware, which
reschedules applications and powers node down for
cooling
Support for slow control functionality
remote resetting
remote health monigoring (temperature, power
etc.)
allow to reset node remotely
allow to interrupt node remotely
allow to powercycle node remotely

12
Resources KIP Group

Chair of technical Computer Sciencewww.ti.uni-hd.
de
Approved and funded ALICE L3 project
operates one month/a as trigger processor - rest
of the time as general purpose compute center
scale comparable to GRID tier 1 regional center,
however located at CERN close to the detector
fully funded
gt2 FTEs working on project today
Various small scale test beds already existing
Plan to build L3/GRID prototype at gt50 node scale
within next years

13
L3 Fabric using ATOLL
14
Proposed Contribution to the GRID Distributed
Caching

Prof. Hermann Hellwagner
Institute of Information Technology
University KlagenfurtAustria
in conjunction with Univ. Heidelberg / Mannheim
proposal

15
Outline
General aspect to be investigated Distributed
Caching for Cluster/Grid Fault Tolerance Fault
Tolerance (FT) Provide connectivity and access
to data/files in case of failures (of nodes,
regional centres, comm. links, network
partitioning, ...) Distributed and Hierarchical
Caching To be achieved by data/file
replication Devise and implement distributed and
hierarchical caching schemes Cluster ? Grid
Develop and test caching schemes on a local
basis (cluster) Generalise and apply schemes to
global scale (grid) Team up with HD/MA cluster FT
and (re)configuration research Cooperate with WAN
Data Management work package
16
Tasks in Year 1

Basic design for FT-aware caching in cluster
Study requirements
Develop simple fault model
Design state monitoring mechanisms for cluster
Design basic distributed and hierarchical cache
topology
Devise basic data replication distribution
mechanisms
Adopt basic consistency scheme (from other WPs)
Implement and test FT caching prototype for
cluster
Static FT-aware cache design and prototype for
cluster

17
Tasks in Years 2 and 3

Year 2 Refine design and prototype for cluster
Refine fault model, monitoring mechanisms, cache
topology, data replication distribution
mechanisms, consistency scheme
Develop and test strategies for these issues
Devise redistribution mechanisms for data/files
Refine and thoroughly test FT caching SW in
cluster
Dynamic FT-aware cache design prototype for
cluster
Year 3 Generalise and apply results to grid

18
Assets and Required Resources
Assets to contribute to the project Computer
Science competence (e.g. distr. cluster
computing) Two scientists who will
contribute(full CS professor Ph.D. student,
working on distr. video caching) Gigabit cluster
testbed (two Extreme Gbps network
switches) Experience in EU projects, cooperation
with CERN RAL Resources for which funding will
be asked for Ph.D. position Extension to cluster
(switch, workstations) Access to GRID testbed

Write a Comment

User Comments (0)