Title: SimMillennium and Beyond From
1SimMillennium and BeyondFrom Computer Systems,
Computational Science and Engineering in the
Large to petabyte stores
- David Culler,
- NSF Site Visit
- March 5, 2003
2SimMillennium Project Goals
- Vision To work, think, and study in a
computationally rich environment with deep
information stores and powerful services - Enable major advances in Computational Science
and Engineering - Simulation, Modeling, and Information Processing
becoming ubiquitous - Explore novel design techniques for large,
complex systems - Fundamental Computer Science problems ahead are
problems of scale - Organized in concert with Univ. structure gt
computational economy - Develop fundamentally better ways of
assimilating and interacting with large volumes
of information and with each other - Explore emerging technologies
- networking, OS, devices
3Research Infrastructure We Built
- Cluster of Clusters (CLUMPS) distributed over
multiple departments - gigabit ethernet within and between
- Myrinet High speed interconnect
- Vineyard Cluster System Architecture
- Rootstock remote cluster installation tools
- Ganglia remote cluster monitoring
- GEXEC remote execution, GM (Myricom) messaging,
MPI - PCP parallel file tools
- collection of port daemons, tools to make it all
hand together - Gigabit to desktop, immersadesk, ...
4(No Transcript)
5Cluster Counts
- Millennium Central Cluster
- 99 Dell 2300/6400/6450 Xeon Dual/Quad 336
processors - Total 238 GB memory, 2 TB disk
- Myrinet 2000 1000Mb fiber ethernet
- Millennium Campus Clusters (Astro, Math, CE, EE,
Physics, Bio) - 176 proc, 34 GB mem, 1.2 TB local disk
- total 512 proc, 292 GB mem, 3.2 TB scratch
- NPACI ROCKS Cluster
- 8 proc, 2 GB mem, 36 GB
- OceanStore/ROC cluster
- PlanetLab Cluster
- 6 prc, 1.32 GHz, 3 GB mem, 180 GB
- CITRIS Cluster 1 3/2002 deployment (Intel
Donation) - 4 Dell Precision 730 Itanium Duals 8 processors
- Total 8GB memory, 128GB disk
- Myrinet 2000 1000Mb copper ethernet (SimMil)
- CITRIS Cluster 2 deployment (Intel Donation)
- 128 Dell McKinley class Duals 256 processors
- 16x2 installed
6Cluster Top Users 2/2003
http//ganglia.millennium.berkeley.edu
- 800 users total on central cluster
- 84 major users for 2/2003 average 62 total CPU
utilization - ROC middle tier storage layer
testing/performance (bling,ach,fox_at_stanford) - Computer Vision Group image recognition,
boundary detection and segmentation, data mining
(aberg,lwalk,dmartin,ryanw, xren) 2 hours on
cluster vs. 2 weeks on local resources - Computational Biology Lab - large-scale
biological sequence database searches in parallel
(brenner_at_compbio) - Tempest - TCAD tools for Next Generation
Lithography (yunfei) - Internet services performance characteristics
of multithreaded servers (jrvb,jcondit) - Sensor Networks power reduction (vwen)
- Economic modeling (stanton_at_haas)
- Machine learning information retrieval, text
processing (blei) - Analyzing trends in BGP routing tables (sagarwal,
mccaesar) - Graphics - Optical simulation and high quality
rendering (adamb, csh) - Digital Library Project image retreival by
image content (loretta) - Bottleneck Analysis of Fine-grain Parallelism
(bfields) - SPUR Earthquake simulation (jspark_at_ce)
- Titanium compiler and runtime system design for
high performance parallel programming languages
(bonachea) - AMANDA neutrino detection from polar ice core
samples (amanda)
7Impact
- Numerous groups doing research they could not
have done without it - Malik photorealistic rendering, physics
simulation,.. - Yelick, Titanium, Heart Modeling, ...
- Wilensky, Digital Library, image segmentation
- Brewer, Culler, Ninja Internet Service Arch...
- Price, AMANDA, ...
- Kubiatowicz, OceanStore, Katz, Sahara,
Hellerstein PIER - First eScience Portals
- Tempest, EUV lithography, Sugar MEMS simulation
services - safe.millennium.berkeley.edu on Sept 11
- built w/i hours, scaled to million hits per day
- CS267 core of MS of computation science X
- Cluster tools widely adopted
- NPACI ROCKS
- Ganglia the most downloaded cluster tool, in all
the distributions, OSCAR, open source development
team
8Computational Economy
- Developed economic-based resource allocation
- decentralized design
- interactive and batch
- Advanced the SOA
- controlled experiments with priced and unpriced
clusters - analysis of utility gain relative to traditional
resource allocation algorithms - Picked up in several other areas
- index pricing internet bandwidth
- iceberg pricing in telco/internet merge
- core to internet design for planetary scale
services
9Emergence of Planetary-Scale Services
- In past year Millennium became THE simulation
engine for P2P - oceanstore, I3, Sahara, BGP alternatives, PIER
- Ganglia was the technical enabler for planetlab
- gt 100 machines at gt 50 sites in gt 8 countries
- THE testbed for internet-scale systems research
10Fundamental Bottleneck Storage
- Current storage hierarchy
- based on NPACI reference
- 3 TB local /scratch and /net/MMxx/scratch 4-day
deletion - 0.5 TB global NFS /work 9-day deletion
- inadequate BW and capacity
- 4 TB /home and /project
- uniform naming through automount
- doesnt scale to cluster access
- gt augment capacity, BW, and metadata BW
- weve been tracking cluster storage options since
xFS on NOW and Tertiary Disk in 1995.
11Another Cluster a storage cluster
Millennium Clusters
Scalable GigE Core
Massive Storage Clusters
Myrinet SAN
Citris Clusters
Designed for higher reliability Avoid competition
from on-going computation Local disks heavily
used as scratch
12Initial Cluster Design with 3.5TB Distributed
File Store
Myrinet 2000
2 Frontend Nodes
Foundry 8000
Campus Core
2
2
1TFlop 1.6TB memory 128 Dual Itanium 2 Compute
Nodes
Foundry 1500
128
128
Foundry 8000
4 Storage Controller 2 MetaServers
6
6
4
1 Gigabit Ethernet
Myrinet
3.5TB Fibre Channel Storage
Fibre Channel
13Initial 3.5 TB Cluster Data Store
Meta Server
Meta Server
864GB
864GB
864GB
864GB
Storage Controller
Storage Controller
Storage Controller
Storage Controller
BlueARC si8300 with 24 36GB 15K rpm disks and
growth room
36GB 15K rpm
fibre channel
gbit ethernet
myrinet
14Lustre A High-Performance, Scalable, Distributed
File System for Clusters and Shared-Data
Environments
- Progress since xFS
- TruCluster, GPFS, pvfs, ...
- need production quality
- NAS is finally here
- History CMU, Seagate, Los Alamos, Sandia,
TriLabs - Distributed Filesystem replacing NFS
- Object based file storage
- object like inode represents a file
- Opensource development managed by Cluster File
Systems, Inc. - Gaining wide acceptance for production
high-performance computing - PNNL and LLNL
- Los Alamos and Sandia Labs
- HP support as part of linux cluster effort
- Intel Enterprise Architecture Lab
15Lustre Key Advantages
- Open protocols, standards Portals API, XML, LDAP
- Runs on commodity PC hardware 3rd party OST
- such as BlueArc
- Uses commodity filesystems on OSTs
- such as ext3, JFS ReiserFS and XFS
- Scalable and efficient design splits
- (qty 2) Metadata servers storing file system
metadata - (up to 100) Object storage targets storing files
- To support up to 2000 clients
- Flexible model for adding new storage to existing
Lustre file system. - Metadata server failover
16Lustre Functionality
recovery, file status, file creation
Meta Servers (Meta Data Servers)
Storage Controllers (Object Storage Targets)
system and parallel file I/O, file locking
directory metadata and concurrency
Clients
17Growth Plan
- based on conservative 50 per year density
- expect roughly double
35 TB 8 SS 3 MS
23 TB 8 SS 3 MS
14 TB 8 SS 3 MS
8 TB 6 SS 3 MS
3.5 TB 4 SS 2 MS
y03
y04
y05
y06
y07
18Example Projects
- Cluster monitoring trace
- ¼ TB per year for 300 nodes
- ROC failure data
- ¼ TB per year, much higher if get industrial
feeds - Digital Library
- Video
- 100 GB/hour uncompressed
- Vision
- 100 GB per experiement
- PlanetLab
- internet wide instrumentation and logging
We will look back and say, we are doing
research today that we could not have done
without this
19End of the Tape Era
20Emergence of the Sensor Net Era
- 100s of research groups and companies using the
Berkeley Mote / TinyOS platform - dozens of projects on campus
- billions of networked devices connected to the
physical world constantly streaming data - gt start building the storage and processing
infrastructure for this new class of system today!
21Environment Monitoring Experience
- Canonical patch net architecture
- live historical readings www.greatduckisland.net
- 43 nodes, 7/13-11/18
- above and below ground
- light, temperature, relative humidity, and
occupancy data, at 1 minute resolution - gt1 million measurements
- Best nodes 90,000
- 3 major maintenance events
- node design and packaging in harsh environment
- -20 100 degrees, rain, wind
- power mgmt and interplay with sensors and
environment
22Sample Results
Node Lifetime and Utility
Effective Communication Phase
Packet Loss Correlation