Building Large Scale Fabrics A Summary - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Building Large Scale Fabrics A Summary

Description:

Building Large Scale Fabrics A Summary – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 20
Provided by: Marcel113
Category:

less

Transcript and Presenter's Notes

Title: Building Large Scale Fabrics A Summary


1
Building Large Scale Fabrics A Summary
Marcel Kunze, FZK

2
Observation
  • Everybody seems to need unprecedented amount of
    CPU, Disk and Network b/w
  • Trend to PC based computing fabrics and commodity
    hardware
  • LCG (CERN), L. Robertson
  • CDF (Fermilab), M. Neubauer
  • D0 (FermiLab), I. Terekhov
  • Belle (KEK), P. Krokovny
  • Hera-B (DESY), J. Hernandez
  • Ligo, P. Shawhan
  • Virgo, D. Busculic
  • AMS, A.Klimentov
  • Considerable savings in cost wrt. RISC based
    farmNot enough bang for the buck (M. Neubauer)

3
AMS02 Benchmarks
1)
Executive time of AMS standard job compare to
CPU clock
1) V.Choutko, A.Klimentov AMS note 2001-11-01
4
Fabrics and Networks Commodity Equipment
Needed for LHC at CERN in 2006 Storage Raw
recording rate 0.1 1 GB/sec Accumulating at 5-8
PetaBytes/year 10 PetaBytes of disk Processing
200000 of todays (2001) fastest
PCs Networks 5-10 Gbps between main Grid
nodes Distributed computing effort to avoid
congestion 1/3 at CERN 2/3 elsewhere
5
PC Cluster 5 (Belle) 1U server Pentium III
1.2GHz 256 CPU (128 nodes)
6
3U
PC Cluster 6 Blade server LP Pentium III
700MHz 40CPU (40 nodes)
7
Disk Storage
8
IDE Performance
9
Basic Questions
  • Compute farms contain several 1000s of computing
    elements
  • Storage farms contain 1000s of disk drives
  • How to build scalable systems ?
  • How to build reliable systems ?
  • How to operate and maintain large fabrics ?
  • How to recover from errors ?
  • EDG deals with the issue (P. Kunszt)
  • IBM deals with the issue (N. Zheleznykh)
  • Project Eliza Self healing clusters
  • Several ideas and tools are already on the market

10
Storage Scalability
  • Difficult to scale up to systems of 1000s of
    components and keep single system
    imageNFS-Automounter, Symbolic links etc.
  • (M.Neubauer, CAF ROOTD does not need this and
    allows for direct worldwide access to distributed
    files w/o mounts)
  • Scalability in size and throughput by means of
    storage virtualisation
  • Allows to set up non-TCP/IP based systems to
    handle multi-GB/s

11
Virtualisation of Storage
Data Servers mount virtual storage as SCSI-Device
Input Load balancing switch
Shared Data Access (Oracle, PROOF)
Storage Area Network (FCAL, InfiniBand,)
200 MB/s sustained
Scalability
12
Storage Elements(M. Gasthuber)
  • PNFS Perfectly Normal FileSystem
  • Store MetaData with the Data
  • 8 hierarchies of file tags
  • Migration of data (hierarchical storage systems)
    dCache
  • Development of DESY and FermiLab
  • ACLs, Kerberos, ROOT-aware
  • Web-Monitoring
  • Cached as well as direct tape access
  • Fail-safe

13
Necessary admin. Tools(A. Manabe)
  • System (SW) Installation /update
  • Dolly (Image cloning)
  • Configuration
  • Arusha (http//ark.sourceforge.net)
  • LCFGng (http//www.lcfg.org)
  • Status Monitoring/ System Health Check
  • CPU/memory/disk/network utilization
    Ganglia1,plantir2
  • (Sub-)system service sanity check
    Pikt3/Pica4/cfengine1 http//ganglia.sourcefor
    ge.net 2 http//www.netsonde.com3
    http//pikt.org 4 http//pica.sourceforge.net/wt
    f.html
  • Command Execution
  • WANI WEB base remote command executer

14
WANI is implemented on Webmin GUI
Start
Command input
Node selection
15
Command execution result
Host name
Results from 200nodes in 1 Page
16
(No Transcript)
17
CPU Scalability
  • The current tools scale up to 1000 CPUs(In the
    previous example 10000 CPUs would require to
    check 50 pages)
  • Autonomous operation required
  • Intelligent self-healing clusters

18
Resource Scheduling
  • Problem How to access local resources from the
    Grid ?
  • Local batch queues vs. Global batch queues
  • Extension of Dynamite (Amsterdam university) to
    work with Globus Dynamite-G (I. Shoshmina)
  • Open Question How do we deal with interactive
    applications on the Grid ?

19
Conclusions
  • A lot of tools exist
  • A lot of work needs yet to be done in the Fabric
    area in order to get reliable, scalable systems
Write a Comment
User Comments (0)
About PowerShow.com