Kein Folientitel - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Kein Folientitel

Description:

Prof. Dr. Volker Lindenstruth. Kirchhoff Institute for Physics. University Heidelberg, ... information from middleware to reconfiguring node node reconfiguration ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 19
Provided by: volke98
Category:

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
GRID Workshop Marseille
Prof. Dr. Volker LindenstruthKirchhoff
Institute for PhysicsUniversity Heidelberg,
Germany Phone 49 6221 54 4303 Fax 49 6221 54
4045 Email ti_at_kip.uni-heidelberg.de WWW www.ti.u
ni-hd.de
  • Introduction
  • ALICE L3 Trigger Processor
  • vision of LHC farm computing

2
The Kirchhoff Institute
  • Founded October 1999 (www.kip.uni-heidelberg.de)
  • 180 Full time employees, 12 Professors (11 CS)
  • 4 chairs (3x experimental physics, 1x computer
    science)
  • In-house computer scenter/computer division,
    electronics division, mechanical design division,
    ASIC laboratory incl. Clean room
  • Various computer pools Hardware Praktikum
  • Chair of Computer Science
  • Founded July 1998
  • Projects
  • ALICE Transition Radiation Trigger Processor
  • ALICE L3 Trigger Processor
  • HEP Data Grid
  • LHCb L1 Vertex Trigger Processor

3
TPC Readout
  • 19000 analog channels/sector
  • 512 time bins per analog channels
  • 10-bit value per time bin
  • 128 channels per front-end board
  • 32 front-end boards per readout controller or
    optical link
  • 4096 channels per optical link
  • de-randomizing buffer for 8 full events in
    front-end
  • zero suppression in front-end
  • zero suppressed raw event
  • 375 kB/event and link
  • readout rate up to 200 Hz or 75 MB/sec and node

x2x18
cave
counting house
Six receiver persector
Receiver
Processor
L3 Network
4
PCI Receiver Card Architecture
Device can mount in any computer allowing to use
this computer also as GRID node when off-line
Data
Push readout
Pointers
FiFo
Optical
FPGA
Multi Event Buffer
Receiver
PCI 66/64
PCI
Hostbridge
PCI
Host
memory
PCI
5
ALICE L3 Farm Architecture
Pa
r
t
i
c
le
In
n
er
Mu
on
Trigger Detectors Micro Channelplate -
Zero-Degree Cal. - Muon Trigger Chambers -
Transition Radiation Detector
Id
e
nt
i
fi
c
at
i
on
Tr
ac
ki
ng
Tr
ac
ki
ng
Ph
ot
on
36 TPC Sectors
S
y
s
t
e
m
Trigger
Decisions
Cham
bers
Sp
ec
t
ro
m
et
er
Detector
busy
F
E
E
F
E
E
F
E
E
F
E
E
F
E
E
F
E
E
DD
L
Tr
ig
g
er
Data
L0 Trigger
L1 Trigger
L3 on-line/off-line farm
L3/DAQ/Processor Farm Switch Fabric
L2 Trigger
EDM
  • On-Line Requirements
  • 700 Nodes (_at_200 SI)
  • 20 GB aggregate input data rate
  • High throughput architecture
  • Trigger Processor ? high rel.
  • Analysis farm when ALICE off-line

S
w
i
t
c
h
LDC Local Data concentrator Software
running on standard CPU L3CPU L3 Trigger
Processor - generic commercial off the shelf
CPU GDC Global Data Concentrator - generic
commercial of the shelf CPU NIC Netcork
Interface Card
P
D
S
P
D
S
P
D
S
P
D
S
6
LHC farm computing
  • Low Cost ??Modular Architecture (use COTS mass
    market computers and networks)
  • LHC farm low maintenance
  • Monitorinig services (all levels)
  • Configuration management
  • Software Management (automated)
  • Fault tolerance, robustness (not necessarily on
    job level)
  • This is NOT enterprise computing to what
    extent can those paradigms applied??
  • Need Software Interfaces for
  • Data transfer
  • (Data)File access (do we have a data model??)
  • What analysis software/libraries will be
    supported??
  • At will be job control supplied?
  • We need Reference Installations and Recommended
    Platforms (Hard/Software)
  • GRID KIT?

7
Performance Monitoring
  • Hierarchical performance measurements (regional
    center and GRID wide)
  • Requirement to measure message latency, retry
    rates, Packet loss rates, Processor/network
    utilization
  • Develop software interface to middleware to
    communicate status
  • Develop prototype/testbench (Ethernet/Atoll)
    implementing IF
  • Visualization of performance data based on
    software interfaces

8
Fault Tolerance
  • Failure modelling
  • Failure types
  • failure classes
  • MTBF for each failure
  • failure specific recovery strategies
  • fault tolerance levels in GRID
  • Consequences for cluster topology (how many
    redundant network links between which nodes,
    ....)
  • Interface to middleware to communicate failure
    and receive reconfiguration
  • Fault tolerant Prototype/Testbench
    (ethernet/ATOLL) to demonstrate failure tolerance
    for
  • unplugging of network cables
  • resetting nodes
  • node power failure
  • Processor crashes

9
Cluster Reconfiguration
  • Generation of unique GRID node ID (unique within
    whole grid)
  • Topology independent automatic configuration upon
    startup (node joining cluster/Cluster booting)
  • Definition asynchronous error reporting mechanism
    to middleware
  • Definition of interface to transfer
    reconfiguration information from middleware to
    reconfiguring node ? node reconfiguration
  • Develop prototype/Testbench (Ethernet/Atoll)
    implementing IF

10
Distributed Caching
  • Monitor state (health) of the cluster/grid
  • Cope with incomplete/inconsistent state
    information
  • (Re-)Distribute data/files in the cluster/grid in
    a hierarchical, state-dependent fashion
  • Devise or adopt suitable consistency semantics
    and mechanisms
  • Attempt to utilize same mechanisms on both levels
    (cluster and grid), however with different
    trade-offs
  • Reconcile with performance-aware caching
  • Integrate into WAN data management middleware

11
GRID Kit
  • All additional Hardware/Software (drivers, etc.)
    to turn COTS computer into GRID node
  • Development of slow control GRID standard,
    integrated into middle ware
  • e.g. node running hot notifying middleware, which
    reschedules applications and powers node down for
    cooling
  • Support for slow control functionality
  • remote resetting
  • remote health monigoring (temperature, power
    etc.)
  • allow to reset node remotely
  • allow to interrupt node remotely
  • allow to powercycle node remotely

12
Resources KIP Group
  • Chair of technical Computer Sciencewww.ti.uni-hd.
    de
  • Approved and funded ALICE L3 project
  • operates one month/a as trigger processor - rest
    of the time as general purpose compute center
  • scale comparable to GRID tier 1 regional center,
    however located at CERN close to the detector
  • fully funded
  • gt2 FTEs working on project today
  • Various small scale test beds already existing
  • Plan to build L3/GRID prototype at gt50 node scale
    within next years

13
L3 Fabric using ATOLL
14
Proposed Contribution to the GRID Distributed
Caching
  • Prof. Hermann Hellwagner
  • Institute of Information Technology
  • University KlagenfurtAustria
  • in conjunction with Univ. Heidelberg / Mannheim
    proposal

15
Outline
General aspect to be investigated Distributed
Caching for Cluster/Grid Fault Tolerance Fault
Tolerance (FT) Provide connectivity and access
to data/files in case of failures (of nodes,
regional centres, comm. links, network
partitioning, ...) Distributed and Hierarchical
Caching To be achieved by data/file
replication Devise and implement distributed and
hierarchical caching schemes Cluster ? Grid
Develop and test caching schemes on a local
basis (cluster) Generalise and apply schemes to
global scale (grid) Team up with HD/MA cluster FT
and (re)configuration research Cooperate with WAN
Data Management work package
16
Tasks in Year 1
  • Basic design for FT-aware caching in cluster
  • Study requirements
  • Develop simple fault model
  • Design state monitoring mechanisms for cluster
  • Design basic distributed and hierarchical cache
    topology
  • Devise basic data replication distribution
    mechanisms
  • Adopt basic consistency scheme (from other WPs)
  • Implement and test FT caching prototype for
    cluster
  • Static FT-aware cache design and prototype for
    cluster

17
Tasks in Years 2 and 3
  • Year 2 Refine design and prototype for cluster
  • Refine fault model, monitoring mechanisms, cache
    topology, data replication distribution
    mechanisms, consistency scheme
  • Develop and test strategies for these issues
  • Devise redistribution mechanisms for data/files
  • Refine and thoroughly test FT caching SW in
    cluster
  • Dynamic FT-aware cache design prototype for
    cluster
  • Year 3 Generalise and apply results to grid

18
Assets and Required Resources
Assets to contribute to the project Computer
Science competence (e.g. distr. cluster
computing) Two scientists who will
contribute(full CS professor Ph.D. student,
working on distr. video caching) Gigabit cluster
testbed (two Extreme Gbps network
switches) Experience in EU projects, cooperation
with CERN RAL Resources for which funding will
be asked for Ph.D. position Extension to cluster
(switch, workstations) Access to GRID testbed
Write a Comment
User Comments (0)
About PowerShow.com