Title: STACS
1STACS Storage Access Coordination of Tertiary
Storage for High Energy Physics
Applications Arie Shoshani, Alex Sim, John
Wu, Luis Bernardo, Henrik Nordberg, Doron
Rotem Scientific Data Management
Group Computing Science Directorate Lawrence
Berkeley National Laboratory
no longer at LBNL
2Outline
- Short High Energy Physics overview (of data
handling problem) - Description of the Storage Coordination System
- File tracking
- The Query Estimator (QE)
- Details of the bit-sliced index
- The Query Monitor
- coordination of file bundles
- The Cache Manager
- tertiary storage queuing and tape coordination
- transfer time for query estimation
3Optimizing Storage Management for High Energy
Physics Applications
Data Volumes for planned HENP experiments
STAR Solenoidal Tracker At RHIC RHIC
Relativistic Heavy Ion Collider
4Particle Detection Systems
Phenix at RHIC
STAR detector at RHIC
5Result of Particle Collision (event)
6Typical Scientific Exploration Process
- Generate large amounts of raw data
- large simulations
- collect from experiments
- Post-processing of data
- analyze data (find particles produced, tracks)
- generate summary data
- e.g. momentum, no. of pions, transverse energy
- Number of properties is large (50-100)
- Analyze data
- use summary data as guide
- extract subsets from the large dataset
- Need to access events based on partialproperties
specification (range queries) - e.g. ((0.1 lt AVpT lt 0.2) (10 lt Np lt 20)) v (N gt
6000) - apply analysis code
7Size of Data and Access Patterns
- STAR experiment
- 108 events over 3 years
- 1-10 MB per event reconstructed data
- events organized into 0.1 - 1 GB files
- 1015 total size
- 106 files, 30,000 tapes (30 GB tapes)
- Access patterns
- Subsets of events are selected by region in
high-dimensional property space for analysis - 10,000 - 50,000 out of total of 108
- Data is randomly scattered all over the tapes
- Goal Optimize access from tape systems
8EXAMPLE OF EVENT PROPERTY VALUES
I event 1 I N(1) 9965 I N(2) 1192 I N(3)
1704 I Npip(1) 2443 I Npip(2) 551 I Npip(3)
426 I Npim(1) 2480 I Npim(2) 541 I Npim(3)
382 I Nkp(1) 229 I Nkp(2) 30 I Nkp(3) 50 I
Nkm(1) 209 I Nkm(2) 23 I Nkm(3) 32 I Np(1)
255 I Np(2) 34
I Np(3) 24 I Npbar(1) 94 I Npbar(2) 12 I
Npbar(3) 24 I NSEC(1) 15607 I NSEC(2) 1342 I
NSECpip(1) 638 I NSECpip(2) 191 I NSECpim(1)
728 I NSECpim(2) 206 I NSECkp(1) 3 I NSECkp(2)
0 I NSECkm(1) 0 I NSECkm(2) 0 I NSECp(1) 524 I
NSECp(2) 244 I NSECpbar(1) 41 I NSECpbar(2) 8
R AVpT(1) 0.325951 R AVpT(2) 0.402098 R
AVpTpip(1) 0.300771 R AVpTpip(2) 0.379093 R
AVpTpim(1) 0.298997 R AVpTpim(2) 0.375859 R
AVpTkp(1) 0.421875 R AVpTkp(2) 0.564385 R
AVpTkm(1) 0.435554 R AVpTkm(2) 0.663398 R
AVpTp(1) 0.651253 R AVpTp(2) 0.777526 R
AVpTpbar(1) 0.399824 R AVpTpbar(2) 0.690237 I
NHIGHpT(1) 205 I NHIGHpT(2) 7 I NHIGHpT(3) 1 I
NHIGHpT(4) 0 I NHIGHpT(5) 0
54 Properties, as many as 108 events
9Opportunities for optimization
- Prevent / eliminate unwanted queriesgt query
estimation (fast estimation index) - Read only events qualified for a query from a
file (avoid reading irrelevant events)gt exact
index over all properties - Share files brought into cache by multiple
queriesgt look ahead for files needed and cache
management - Read files from same tape when possiblegt
coordinating file access from tape
10The Storage Access Coordination System (STACS)
Query estimation / execution requests
Query Estimator (QE)
Bit- Sliced index
Users Application
open, read, close
Caching Policy Module
Query Monitor (QM)
Disk Cache
file purging
file caching
Cache Manager (CM)
file caching request
File Catalog (FC)
11A typical SQL-like Query
SELECT FROM star_dataset WHERE
500lttotal_trackslt1000 energylt3
-- The index will generate a set of files F6
E4,E17,E44, F13 E6,E8,E32, , F1036
E503,E3112 that the query needs -- The
files can be returned to the application in
any order
12File Tracking (1)
13File Tracking (2)
14File Tracking
query1 start
query2 start
query3 start
All 3 queries
15Typical Processing Flow
STACS
1 new Query Quick Estimate
2 Execute Full Estimate
3 execute
18 done
whichFileToCache
12 retrieve
5
4 request
6
14 release
FileID ToCache
13
7 stage
11 staged
15 purge
17 purged
16 purge
Local Disk
8 file info
9 File Caching Request
10 File Caching
16The Storage Access Coordination System (STACS)
Query estimation / execution requests
Query Estimator (QE)
Bit- Sliced index
Users Application
open, read, close
Caching Policy Module
Query Monitor (QM)
Disk Cache
file purging
file caching
Cache Manager (CM)
file caching request
File Catalog (FC)
17Bit-Sliced Indexused by Query Estimator
- Index size
- property space
- 108 events x 100 properties x 4 bytes 40 GB
- index requirements
- range queries (10 lt Np lt 20) (0.1 lt AVpT lt .2)
- number of properties involved is small 3-5
- Problem
- how to organize property space index
18indexing over all properties
- Multi-dimensional index methods
- partitioning MD space (KD-trees, n-QUAD-trees,
...) - for high dimensionality - either fanout or tree
depthtoo large - e.g. symmetric n-QUAD-trees require 2100 fanout
- non-symmetric solutions are order dependent
19Partitioning property space
- One possible solution
- partition property space into subsets
- e.g. 7 dimensions at a time
- Performance
- good for non-partial range queries (full
hypercube) - bad if only few of the dimensions in
eachpartition are involved in query - S. Berchtold, C. Bohm, H. Kriegel, The
Pyramid-Technique Towards Breaking the Curse of
Dimensionality, SIGMOD 1998 - best for non-skewed (random) data
- best for full hypercube queries
- for partial range (e.g. 3 out 100) close to
sequential scan
20Bit-Sliced Index
- Solution Take advantage that index need to be is
append only - partition each property into bins
- (e.g. for 0ltNplt300, have 20 equal size bins)
- for each bin generate a bit vector
- compress each bit vector (run length encoding)
21Run Length Compression
Uncompressed 0000000000001111000000000
......0000001000000001111111100000000 ....
000000 Compressed 12, 4, 1000,1,8,1000 Store
very short sequences as-is
Advantage Can perform AND, OR, COUNT
operations on compressed data
22Bit-Sliced Index
Advantages
- Advantages
- space for index very small - can fit in memory
- Need only touch properties involved in queries
(vertical partitioning) - Need only touch bins involved
min-max
Query Estimation in memory only !!
23Inner Bins vs. Edge Bins
Edge bin
Edge bin
Range(x)
Range(y)
24Vertical
Bit-Sliced
Event
partitions
index
list
(on disk)
(in memory)
edge
edge
bin
bin
events
in bin1
events
in bin2
file list,
and events
properties
that qualify
range
in each
conditions
Events
in edge
bin
20-40 GB
50-100 MB
25Experimental Results on Index
- Simulated dataset (hijing)
- 10 million events
- 70 properties
- property space
- BSI 2.5 GB
- Oracle 3.5 GB
- index size
- BSI 280 MB (4 MB/property)
- Oracle 7 GB (100 MB/property)
- index creation time
- BSI 3 hours (2.5 min / property)
- Oracle 47 hours (40 min / property)
26Experimental Results on Index
- Run a count query (preliminary)
- BSI
- 1 property 14 - 70 sec (depending on size of
range) - 2 properties 90 sec (both about half the range)
- gt linear with of bins touched
- Oracle
- 1 property comparable (counts only)
- 2 properties gt 2 hours !
- use one index, loop on table
- gt Need to tune Oracle
- run analyze on indexes, choose policy
- bitmap index - did not help
- gt After tuning 12 Min
27The Storage Access Coordination System (STACS)
Query estimation / execution requests
Query Estimator (QE)
Bit- Sliced index
Users Application
open, read, close
Caching Policy Module
Query Monitor (QM)
Disk Cache
file purging
file caching
Cache Manager (CM)
file caching request
File Catalog (FC)
28File BundlesMultiple Event Components
29A typical SQL-like Queryfor Multiple Components
SELECT Vertices, Raw FROM star_dataset WHERE
500lttotal_trackslt1000 energylt3
-- The index will generate a set of bundles
F7, F16 E4,E17,E44, F13, F16 E6,E8,E32,
that the query needs -- The bundles can
be returned to the application in any
order -- Bundle the set of files that need to
be in cache at the same time
30File Weight Policy for ManagingFile Bundles
- File weight (bundle) 1 if it appears in a
bundle, 0
otherwise - Initial file weight SUM (all bundles for each
query) over all queries - Example
- query 1 file FK appears is 5 bundles
- query 2 file FK appears is 3 bundles Then, IFW
(FK) 8
31File Weight Policy for ManagingFile Bundles
(contd)
- Dynamic file weight the file weight for a file
in a bundle that was processed is decremented by
1 - Dynamic Bundle Weight
32How file weights are usedfor caching and purging
- Bundle caching policy
- For each query, in turn, cache the bundlewith
the most files in cache - In case of a tie, select the bundle with the
highest weight - Ensures that a bundle that include files needed
by other bundles/queries have priority - File purging policy
- No file purging occurs till space is needed
- Purge file not in use with smallest weight
- Ensures that files needed in other bundles stay
in cache
33Other policies
- Pre-fetching policy
- queries can request pre-fetching of
bundlessubject to a limit - Currently, limit set to two bundles
- multiple pre-fetching useful for parallel
processing - Query service policy
- queries serviced in Round Robin fashion
- queries that have all their bundles cachedand
are still processing are skipped
34Managing the queues
Files
Being
Processed
Query
Bundle
File
Queue
Set
Set
Files
in Cache
35File Tracking of Bundles
Bundle (3 files) formed, then passed to query
Bundle shared by two queries
Bundle was found in cache
Query 1 starts here
Query 2 starts here
36Summary
- The key to managing bundle caching and purging
policies is weight assignment - caching - based on bundle weight
- purging - based on file weight
- Other file weight policies are possible
- e.g. based on bundle size
- e.g. based on tape sharing
- Proving which policy is best - a hard problem
- can test in real system - expensive, need stand
alone - simulation - too many parameters in query profile
can vary processing time, inter-arrival time,
number of drive, size of cache, etc. - model with a system of queues - hard to model
policies - we are working on last two methods
37The Storage Access Coordination System (STACS)
Query estimation / execution requests
Query Estimator (QE)
Bit- Sliced index
Users Application
open, read, close
Caching Policy Module
Query Monitor (QM)
Disk Cache
file purging
file caching
Cache Manager (CM)
file caching request
File Catalog (FC)
38Queuing File Transfers
- Number of PFTPs to HPSS are limited
- limit set by a parameter - NoPFTP
- parameter can be changed dynamically
- CM is multi-threaded
- issues and monitors multiple PFTPs in parallel
- All requests beyond PFTP limit are queued
- File Catalog used to provide for each file
- HPSS path/file_name
- Disk cache path/file_name
- File size
- tape ID
39File Queue Management
- Goal
- minimize tape mounts
- still respect the order of requests
- do not postpone unpopular tapes forever
- File clustering parameter - FCP
- If the file at top of queue is in Tapei and FCP
gt 1 (e.g. 5) then up to 4 files from Tapei will
be selected to be transferred next - then, go back to file at top of queue
- Parameter can be set dynamically
4
F(Ti)
3
F(Ti)
2
F(Ti)
5
1
F(Ti)
40File Queue Management
- Goal
- minimize tape mounts
- still respect the order of requests
- do not postpone unpopular tapes forever
- File clustering parameter - FCP
- If the file at top of queue is in Tapei and FCP
gt 1 (e.g. 4) then up to 4 files from Tapei will
be selected to be transferred next - then, go back to file at top of queue
- Parameter can be set dynamically
4
F4(Ti)
3
F3(Ti)
2
F2(Ti)
5
1
F1(Ti)
Order of file service
41File Caching Order fordifferent File Clustering
Parameters
File Clustering Parameter 1
File Clustering Parameter 10
42Transfer Rate (Tr) Estimates
- Need Tr to estimate total time of a query
- Tr is average over recent file transfers from
thetime PFTP request is made to the time
transfer completes. This includes - mount time, seek time, read to HPSS
Raid,transfer to local cache over network - For dynamic network speed estimate
- check total bytes for all file being
transferredover small intervals (e.g. 15 sec) - calculate moving average over n intervals(e.g.
10 intervals) - Using this, actual time in HPSS can be estimated
43Dynamic Display of Various Measurements
44Query Estimate
- Given transfer rate Tr.
- Given a query for which
- X files are in cache
- Y files are in the queue
- Z files are not scheduled yet
- Let s(file_set) be the total byte size of all
files in file_set - If Z 0, then
- QuEst s(Y)/Tr
- If Z 0, then
- QuEst (s(T) q.s(Z))/Trwhere q is the number
of active queries
F4(Y)
T
F3(Y)
F2(Y)
F1(Y)
45Reason for q.s(Z)
20 Queries of length 20 minutes launched 20
minutes apart
20 Queries of length 20 minutes launched 5
minutes apart
Estimate bad - request accumulate in queue
Estimate pretty close
46Error Handling
- 5 generic errors
- file not found
- return error to caller
- limit PFTP reached
- cant login
- re-queue request, try later (1-2 min)
- HPSS error (I/O, device busy)
- remove part of file from cache, re-queue
- try n times (e.g. 3), then return
errortransfer_failed - HPSS down
- re-queue request, try repeatedly till successful
- respond to File_status request with HPSS_down
47Summary
- HPSS Hierarchical Resource Manager (HRM)
- insulates applications from transient HPSS and
network errors - limits concurrent PFTPs to HPSS
- manages queue to minimize tape mounts
- provides file/query time estimates
- handles errors in a generic way
- Same API can be used for any MSS, suchas
Unitree, Enstore, etc.
48Web pointers
- http//gizmo.lbl.gov/stacs
- http//gizmo.lbl.gov/arie/download.papers.html
-- to download papers - http//gizmo.lbl.gov/stacs/stacs.slides/index.htm
-- a STACS presentation