Title: les robertson cernit0899 1
1The Data Storage Challenge for LHC
- CERN School of Computing
- Stare Jablonki - September 1999
- Les Robertson
- CERN - IT Division
- les.robertson_at_cern.ch
2Part I - The technology
- today's workhorses
- magnetic hard disk
- magneto-optics
- magnetic tape systems
- optical disks
- exotic storage technologies
- holography
- atomic force microscopy
- robotics for handling mass storage
3disk storage
- state of the art
- technology limits - the super-paramagnetic
problem - heads
- access performance and caches
- magneto optics OAW, Terastor
units very small sizes are expressed in
micrometres, denoted ? almost everything else
is in - inches - in square inches - in2 feet
- 1 foot 12 inches Gigabit 109 bits -
Gb Gigabit per square inch - Gb/in2 Gigabyte
109 bytes - GB
4disk storage - state of the art
- platters
- sputtered magnetic andprotective layers
- protective layer has texturedlanding area for
the head - to avoid stiction on take-off - head flies at around 50 nanometres
- current product - 3-4 Gb/in2
- lab demonstrations - gt20 Gb/in2
5super-paramagnetic limit
- bit size - decreases in proportion as the areal
density increases - width X length
- 1 Gbpi2 3.5? X 0.18?
- 10 Gbpi2 1? X 0.06?
- 40 Gbpi2 0.5? X 0.03?
- 80 Gbpi2 0.4? X 0.02?
- fewer particles in a bitsmaller separation
between bits - increased tendency fordomains spontaneously
tochange polarisation
6super-paramagnetic limit
- super-paramagnetic limit
- point where the fluctuations in thermodynamic
energy at operating temperatures have a moderate
probability of causing magnetic state changes - in current disks, the magnetic energy barrier is
about 40 times the thermodynamic range - it is expected that new materials, recording
techniques will push the barrier to at least
100 Gbpi2
7heads
- inductive read heads
- signal current varies as rate of flux change
- MR read heads
- NiFe conductor -- resistance changes with flux
strength - independent of velocity
- signal strength proportional to sense current
- increased sensitivity in high density, high
bandwidth recording - a transverse bias field is applied to
discriminate between positive and negative
recording polarisations
inductive read head
magneto-resistive read head
?R ? H ?V ? I ?R
Isense
8inductive write head, MR read head
picture IBM Research - Almaden
9Giant Magneto-Resistive Effectthe Spin Valve
- Giant Magneto-Resistive
- Multi-layer head
- magneto-resistive layer (NiFe)
- conducting layer (e.g. Ag, Cu)
- pinned layer (e.g. Co) - fixed magnetic
orientation - exchange layer ferro-magnetic material which
maintains the pinned layer orientation - GMR exploits the different behaviour of
conduction electrons with spin parallel to or
opposed to the magnetic orientation of the MR and
pinned layers - hence the term Spin Valve
GMR layers
exchange layer - magnetised
pinned layer (Co)
conducting layer (Cu)
MR layer (NiFe)
sense
current
10Spin Valve
picture IBM Research - Almaden
11merged head
12Seagate Cheetah 36
36 GB capacity half height 3.5 12 platters, 24
heads 5.7 ms average seek 10,000 rpm 2.99 ms
latency 1 MB cache 18-28 MB/sec internal
transfer rate
photo - Seagate Technology, Inc.
13Data transfer speed
- Data transfer speed increases with
- the linear density (? of the areal density - i.e.
about 26 per year) - the rotation speed - which has only increased by
about 50 in the past 5-6 years - The actual data transfer speed
- is faster on outer tracks than
- on inner tracks - so be careful
- when reading specifications to
- discriminate between average and
- maximum transfer speed.
assumes recent evolution maintained 60 per year
increase in areal density, rotational speed
increasing 50 in 5 years
1999 1
14The importance of the cache
- Access time depends on
- the seek time - which has hardly improved by 50
in ten years - the latency - half a turn of the platter
- Without a cache, thiswould lead to
veryunimpressive performancefor small transfer
sizes - The cache helps to getback to the nominaldata
transfer rate - nomore than that!
15Future possibilities
- continuing developments of GMR - with the
formidable research capability of IBM - current interest in the use of rare-earth/transiti
on metal composites, evolved for MO recording - low Curie point
- stable magnetisation at normal operating
temperatures - stable magnetic domains demonstrated at a density
of 250 Gb/in² - Longer term --
- holography
- atomic force microscopy
- .
16Optically Assisted Winchester (OAW)
- Developed by a Seagate subsidiary - Quinta
- Magnetic layer uses a composition of rare earth
transition metals - Write
- laser heats material beyond Curie point
- induction coils changes magnetic orientation
- magnetisation stable at normal temperatures
- Read
- rotation of polarisation of reflected light
(Kerr effect) - Technology
- laser delivery fibres
- micro mirror (head of a pin)
- micro-optics
- Potential 100 Gb/in2 ?
- limited by the resolution of the optics
17The Solid Immersion Lens Near Field Recording
Terastor Corporation
- Solid Immersion Lens
- laser is focussed internally in a material with a
very high refractive index - with a red laser can get the spot diameter down
to 0.2?(the bit width for 160 Gb/in2)
where ? is the wavelength n the
refractive index na is the
numerical aperture
18SIL NFR
- Near field recording
- principle of the scanning near-field optical
microscope - the oscillating dipoles of the radiating surface
produce an evanescent field which decays in
about one wavelength - .. but activate other dipoles within this range
19Developments in Magneto Optics
- The recorded area of the disk cannot be narrower
than the spot (or at least the high temperature
area of the spot) - But when recording, spots can be overlapped to
increase linear density - This is not possible on conventional MO disk,
which has a thick transparent substrate over the
recording layer, which required a high field
coil, with a high inductance and so low
modulation frequency - Surface recording reduces the separation of the
head and recording layer, making crescent
recording possible, and also enabling the use of
high numerical aperture lenses - producing
smaller spots - But it is a challenge for the designer of
removable media
disk rotation
20Magnetic Super Resolution - MSR
- Easy to see how the crescents are recorded, but
how are they read back? - Three layers
- 1) recording layer
- 2) intermediate masking layer temperaturesensiti
ve magnetic orientation - low temperature parallel to plane
- intermediate temperature perpendicular
- high temperature loses orientation
- couples the recording layer to the read-out
layer only at intermediate temperatures - 3) read-out layer magnetised (erased) during
read-out
21magnetic tape
- why use magnetic tapes?
- basics
- linear
- helical scan
- state of the art drive - the StorageTek 9840
- current trends
22Why use magnetic tape?
- Why use a sequential access medium with a history
of relatively poor reliability? - historically the answer has been --
- cost - 10-100 times cheaper per Byte than disk
- volumetric storage density
- removable, transportable medium
- backup
- archive
- data exchange
- robotic storage - automated access to enormous
amounts of data - but there is considerable competition from
- hard disks - cost, storage density
- optical storage - archive longevity, data exchange
23Volumetric Storage Density
Assumes shelf storage of -- raw
tape, DVD cartridge -- disk without
enclosure, power
supply, fan -- no compression on tape
Storage Capacity and Density
400
120
350
100
3
300
80
250
GB
Native Cartridge Capacity -
Density TB/m
200
60
150
40
100
20
50
0
0
IBM 3590
STK 9840
STK Redwood
DVD-RAM (2-side)
Quantum DLT 8000
LTO Ultrium (future)
Seagate Cheetah (3.5" disk)
TB/cubic-metre
capacity (GB)
Device type
24basic characteristics
- medium
- flexible substrate - 10? thick polyethylene
PET/PEN - recording layer - 0.1-0.2?
- Metal Particle
- Metal Evaporated
- stored in cartridge (1 reel) or cassette (2 reel)
- tape extracted and loaded on drive
- recording technology spin-off from magnetic
disk developments - MR, GMR heads
- track following servo systems
- media
25sequential access
- basically a sequential medium
- no delete/update
- new data written at end
- open - and read from start of file
- usually a directory at the beginning of the tape
- so open(file) can use servo information for
afast skip to the start of the data
26logical data format
- The tape is organised logically as a set of
files, separated by labels and tape marks. - In early drives, the drive could seek rapidly to
the next tape mark, which was recorded with a
very special patternModern drives use a
directory and information on servo tracks to seek
to the logical tape mark
file data
file data
file data
...
volume labels
tape mark
file labels
tape mark
tape mark
end of volume
tape mark
tape mark
file labels
tape mark
tape mark
file labels
tape mark
tape mark
27physical data format
- The data is recorded in blocks, each with a
cyclic redundancy check (CRC) to detect errors - The logical block is recorded in a series of
physical blocks, spread across the parallel
recording channels - each channel correspondsto a set of physical
headelements - Substantial recordingcapacity is reservedfor
error correctiondata - The 4-channel DLT formatis shown - newer tape
systems have even more complex patterns to
supportrecovery from more severetape damage
28linear recording
- linear recording
- tape passes over fixed head
- multiple track read write
- serpentine dual-directional recording
- head unit
- dual-directional
- low head-medium contact pressure
- multi-channel head array
29linear recording
- media issues
- tape roughness, head contact, surface wear, dust
- tape path complexity, tension gt distortion
- lateral expansion/contraction with environmental
changes - reel sag in long term storage
head array
tape has expanded laterally since it was recorded
30helical scan
- developed for entertainment business
- high end market in broadcasting
- mass market in domestic VCR
- tape moves slowly past rapidly spinning head on
scanner
31helical scan
- head wear problems due to tape contact pressure
- helical path controlled using tape edge -
requires very accurate slitting in manufacture - edge damage, tape warp cause track curving
- linear tapes reserve a guard band at the edges
- historically helical scan has had a higher track
density than linear - 2800 tracks per inch helical
- 7-800 tracks per inch linear
- but linear tape is improving track density with
MR heads, track following technology
32data compression
- an advantage of sequential access over random
access disks is that the device can implement
data compression - digital Lempel-Ziv 1 algorithm
- replaces variable length phrases with code words
- enhanced LZ 1 algorithm (e.g. StorageTek 9840)
can give up to four times compression on
commercial data, 2 times on pre-compressed
physics data
33the recording channel
write channel
read channel
channel complexity can increase with improved
ASIC technology
349840 Mechanism
Head
Coupling
Head 23 Patents Pending 1 Patent
Issued Mechanism 10 Patents Pending
1 Patent Issued
Reel Motor
Operator Panel
359840
- 1/2 tape in IBM 3480 form factor
- MP on PEN medium
- 288 tracks
- 16 parallel heads ( ? 18 stripes )
- 2 metres/second past head - 10 MB/sec data rate
- cassette (2 reel) with tape unloaded at mid point
- tape path entirely in cassette
- 4 sec load
- 900 feet of tape ( 274 metres )
- 8 sec average search
- 16 sec max rewind
- 20 Gbytes user data (uncompressed)
- LZ-1 enhanced compression
36Cartridge
Cartridge 6 Patents Pending 1
Patent Issued
37(No Transcript)
38current trends
Many new drives Several aggressive road
maps Major application is backup Expect strong
competition at the low end from optical
scheduled for 2000
39Optical Recording
- The historical advantage of optical over magnetic
technology was the potential recording density - Red laser -- spot size 0.4? diameter 5
Gbits/inch2 - Many high end products - but never gave real
competition to magnetic products - performance, cost
- niche market for write-once applications
- magnetic disk has now reached or exceeded optical
recording densities - BUT for the first time we see real competition
from low-end mass market products CD-R, DVD-R
and DVD-RAM
40Write Once - CD-R DVD-R
- preformed polycarbonate substrate
- wobbled groove to guide and clock laser
- photo/heat sensitive dye layer
- cyanine
- reflection layer
- gold
- laser spot heats dye, changes its structure which
in turn deforms the substrate - read-out laser is absorbed/scattered by the
deformation
41DVD-R
- laser system
- ? 640 nm numerical aperture 0.6 refractive
index 0.8 - spot diameter 0.4 ?
- capacity of side 4.7GB
- 1.3 MB/sec record read speed
- Prices (Panasonic)
- 5.4K for the drive
- 35 double sided media ( 3.90 / GB)
- (a CD-R 640 MB disk costs about 1 in quantity)
42Erasable DVD-RAM
- phase change recording layer - TeGeSb
- heated by laser spot
- high power writefast melt-cool cycleleaves
amorphous spotwith low reflectivity - lower power eraseslower melt-cool cycleleaves
crystalline spotwith high reflectivity - read-out - low power laser
- land groove recording
43DVD-RAM
- capacity 2.6 GB per side
- single layer only, unlike DVD-ROM
- 4.7 GB per side in version 2 due in 2000
- record and read-back performance - 1.3 MB/sec
- access time 210 ms
- 1999 prices
- drive 640
- double sided disk (5.2 GB) 35 (6.70 per GB)
- With high volume
- could we expect media costs to come down to 1-2
per disk (like CD-R today)? - giving 0.2 per GB
44exotic storage technologies
- holography
- atomic force microscopy
- Keele Ultra High Density Memory
45holographic storage
graphic Byte Magazine
46atomic force microscopy
- atomic force microscopy applied to data storage
by IBM - sharp tip mounted on a micro-mechanical
cantilever made from silicon nitride - heat pressure applied as it is passed over
plastic substrate - read-out - the cantilevertip are scanned over
the surface - 45 GB/in2 demonstrated
- 300 GB/in2 theoretically possible
pictures - IBM Research Almaden
47Keele Ultra High Density Memory
?
- Basic research done at Keele University, by
emeritus professor Ted Williams (inventor of an
NMR scanner in late 70s/early 80s) - The Keele Ultra High Density Memory uses magneto
optical alloys to store 2.3 TeraBytes of user
memory on a device the size of a credit card, but
8.5 cm thick, for less than 50! - Uses optical techniques to store and retrive data
in 3D storage - Multi-layer (3) recording
- Could put 100 Gbytes in a wristwatch
- All information on the technology controlled by a
venture capital company - which says that
licensing negotiations are under way with a large
company - products can be expected in under 2
years
?
48Robotics - no problembut prices are best at the
top!
65 per 9.4GB slot 7/GB
NSM jukebox 620 DVDs
20 per 50GB slot 0.4/GB
49Part II - LHC requirements solutions
- summary of the requirements of the LHC
experiments - strawman LHC computing farm
- cost factors an attempt to estimate the costs of
storage in 2005 - conclusions
50LHC storage requirements
- summary of the storage requirements of the LHC
experiments - but this is just part of the computing fabric
- which also includes processing and networking
51Data Recording and Offline Computing Facilities
at CERN - for LHC experiments
- For each LHC experiment capacity at CERN is
needed for - Data Recording
- First-pass reconstruction
- Some re-processing
- Basic Analysis (pass-1 pass-2) - ESD ? AODTAG
- Support for a few analysis groups
(ATLASCMS 4 groups, 100/1600 physicists) - Good external networking
- Current assumption is that this would be
complemented with a few large regional centres
together providing about as much computing
capacity as at CERN
raw data ? ESD
52Capacity Estimates
- Estimate uses figures from CMS in mid-98ATLAS
would be similar - Raw data is recorded at 100 MB/sec
53PetaByte
- 1015 Bytes
- 1,000 TeraBytes
- 20,000 Redwood tapes
- 30,000 Cheetah 36 disks
- 100,000 dual-sided DVD-RAM disks
- 1,500,000 sets of the Encyclopaedia Britannica
(w/o photos)
54disk capacity v. data rate
CERN physics 1999 12 MB/sec-per-TB
CMS 2006 74 MB/sec-per-TB
55ALICE
- ALICE requires a much higher data recording rate
than ATLAS or CMS - 1 GB/sec - during the 1-2 month ions run
- Total raw data 1 PByte per year
- Tape data rates may remain modestly in the
15-20 MB/sec range - Requiring a nominal 50-70 drives in practice
100-150 drives and some good storage management
software - This problem will be addressed by Fabrizio in his
talks
56storage network
12 Gbps
processors
5600 processors 1400 boxes 160 clusters 40
sub-farms
tapes
1.5 Gbps
0.8 Gbps
6 Gbps
8 Gbps
24 Gbps
farm network
960 Gbps
0.8 Gbps (daq)
100 drives
CMS Offline Farm at CERN circa 2006
LAN-WAN routers
250 Gbps
storage network
5 Gbps
0.8 Gbps
0.5 M SPECint95 0.5 PByte disk
5400 disks 340 arrays ...
disks
lmr for Monarc study- april 1999
57Is there a problem?
- Because HEP computing has the property of event
independencewe can process any number of events
in paralleland so we can use real commodity
components (well, maybe not for tertiary
storage)nothing special - just lots of them - The technology is looking good
- but there are two small problems which come from
the scale - -- Cost
- -- Management
- Fabrizio will talk about the storage management
issues, - but note that the management problem applies
across the board - - processors, network, storage, workflow, WAN
58Cost evolution
- cost factors
- development costs
- production costs ?
- technology
- market volume
- marketing costs
- distribution costs
- price factors
- production costs
- profit
- competition
- the best technology often does not win
59Share of Hard Disk Market Units shipped in 1998
1998 145M disks sold - total revenue 30
Bn 110M in PCs (IDE) 30M SCSI/FCAL - mostly
storage systems which generated
13Bn revenues
60prices paid by CERN compared with 35 evolution
since 1990 simple disk arrays (JBOD)
61How much should we budget for hard disk?
- So we are reasonably happy that LHC can use
inexpensive disk, and that the prices will
continue to decrease steadily - To minimise data loss and other operational
problems associated with failing disks, we will
use RAID. Today RAID systems come with a
substantial price penalty, but we can expect that
in 2005-06 we shall only have to pay for the
redundant disk capacity. - Bottom line At an estimated 4-8/GByte the
500TB needed by CMS will cost 2-4M
62tape price evolution
- Estimating the cost of magnetic tape is not
nearly so easy.
63 Total revenues 5Bn 0.5 linear devices
DLT, 3590, 9840, 3570 0.5 helical Redwood 19mm
helical AMPEX, Sony D1 8mm helical
EXABYTE, Sony AIT 4mm helical DAT
64DVD?
- As we saw earlier, DVD-R and DVD-RAM have the
potential to provide a very convenient way of
archiving modest amounts of data - 5-10 GBytes -
at a modest data rate (1.4 MB/sec). - DVD is a random access device - offering a
significantly different functionality from
sequential access tape. - The cost today for a DVD-RAM disk is a few per
GB, rather similar to the cost of 8mm, 4mm tape. - With a little improvement in the cost of the
DVD-RAM drive - DVD-RAM could destroy the
market for low-end tape (home, small office
backup archive)
65Data Centre Tapes
- But we are concerned with data centre tapes -
0.5 linearWhere performance, capacity,
robotics, . are important factors - But so is overall costwhich today, for ATLAS
or CMS would be dominated by the media cost! - ALICE is a bit different
66Can we estimate how tape costs will evolve?
- NO - we cannot estimate - only guess for media
- which dominates the overall cost - Cost of high quality drives will not change much
- Cost of a robot slot will not change (but we may
see competitive pricing for DLT format robots)
CHF per GB of data
CHF per foot of tape
Fe
Cr
MP
log scale!
?
single supplier multiple suppliers
67guesstimate for Magnetic Tape
- Maybe the recording density increases by a factor
of 4 - So the cost of the media will fall to CHF 0.5 per
GB - And a cartridge will hold 100 GB
- The 5-year cost then works out at CHF 1/GB
- Two problems for tapes
- raw disk may be only 3 times more expensive
- as we guessed earlier, DVD-RAM might be
substantially cheaper(if there are suitably
priced robotics!)
68time to change the balance?
- the classic model
- use the disk as a cache of the active data, which
is kept on tape - may not be the right one for LHC
- we should consider using much more disk for
all of the really active data - and using tape or something cheaper to archive
the rest
69conclusion (i)
- disks OK
- merging of magnetic and magneto-optical
techniques will ensure that the technology can
evolve smoothly well into the LHC time-frame - unlikely to be displaced as the standard for
secondary storage - DVD - too slow, too small
- holography - waiting for a material breakthrough
- the rest are not on the LHC time-scale
- robots OK
70conclusion (ii)
- ---- BUT tertiary storage is a problem
- tape - reliability, cost, market - all
questionable - DVD - may be a solution if a healthy market
develops - very likely to eliminate tape for low-end PC
applications - could well compete on price reliability for
data centre applications - but likely to remain low capacity, low
performance - removable magnetic or magneto-optic disk may
compete strongly with tape - but are not likely
to be cheaper
71conclusion (iii)
- which may just give us the opportunity we need to
change the analysis model - active data on disk
- exchange data using random access DVDs
- and use tape as the last resort - like the
rest of the industry! - but how do you select the active raw data?