Title: SCANLite: Enterprisewide analysis on the cheap
1SCAN-LiteEnterprise-wide analysis on the cheap
- Craig Soules, Kimberly Keeton, Brad Morrey
2Enterprise information management
- Search
- Clustering
- Provenance
- Classification
- IT Trending
- Virus scanning
Metadata Server
3Enterprise information management
Metadata Server
Data is duplicated across machines! Duplicate
analysis is wasted work
4Issues
- Analysis programs conflict on clients
- Contend for system resources (memory, disk)
- Clients repeat work
- Duplicate files on multiple clients
- Client foreground workloads are impacted
- Work exceeds available idle time on busy clients
5Approaches
- Reduce resource contention
Client
6Approaches
7Approaches
- Leverage duplication to balance client load
- Delay analysis to identify all duplicates
Clients
Global Scheduler
8Solutions
- Local scheduler
- Coordinates analyses to reduce resource
contention - Up to 60 improvement
- Global scheduler
- Identifies duplicates to remove work
- Balance load
- 40 reduction in impact to foreground tasks
9Local scheduling
- Traditionally, analyses are separate programs
- Scheduling left to the operating system
- Potentially at different times
- Each program identifies files to scan
- Each program opens and reads file data
Disk
10Unified local scheduling
- Each analysis routine is a separate thread
- Control thread manages shared tasks
- Identify files to scan, and open/read file data
- Shared memory buffer distributes file data
ControlThread
Disk
Shared Memory
11Local scheduling performance
- Ran a fitness test using 7 analysis routines
- 42 data sets, each containing files of a fixed
size - Ran both approaches over each data set
- Calculated per-file elapsed scan time
- Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
- Seven-at-once
- Run each analysis routine separately at the same
time - Unified
- SCAN-Lites unified local scheduling approach
12Elapsed time vs. CPU time
- Original fitness test used CPU time
- Gave less variable performance curves for
modeling - Disk contention shows up in elapsed time
- CPU time is multiplexed
- Elapsed time is not
This is very bad
13Local scheduling results
17 - 60 improvement
Seven-at-once benefits from deep disk queues, but
this hurts foreground apps
Small random I/Os have worse interaction than
larger ones
14Global scheduler
- Two goals
- Reduce additional work from duplicate files
- Utilize duplication to schedule work to the
best client - Two-phase scanning
- Phase one identify duplicate files using content
hashing - Phase two analyze one copy at the appropriate
client - Delaying between phase one and two provides
opportunity for additional duplication and
deletion
15Traditional scanning
Clients
Server
16Phase one Duplicate detection
Clients
Server
17Phase two Scheduling
Clients
Server
18When to schedule
- Clients upload hashes each scheduling period
- The freshness specifies a deadline by which new
data must be analyzed
Schedule before this period
Scheduling here gives one option
Scheduling here gives three options
Scheduling Period
Time
19How to schedule
- Scheduling is a bin packing problem
- Files are balls, clients are bins
- Size of bins is available idle time
- Color of balls/bins equates to location of
duplicates - Size of balls is time required for analysis
20How to schedule
- We use a greedy heuristic for scheduling
- Consider idle time and machine priorities
- See paper for details
21Work ahead
- Start by scheduling all work that meets freshness
- Schedule additional work on still idle machines
- Any remaining idle time can be used for
additional work - We refer to this as work ahead
22Two-phase scanning Trade-offs
Clients
23Two-phase scanning Trade-offs
Clients
24Two-phase scanning Trade-offs
- If cost of hashing exceeds the additional work
from duplicates, then one-phase scanning is
better - Analysis of hashing costs using SHA-1 indicate
that 3 data duplication is the minimum - Do we see that in practice?
25Duplication in enterprise data
- Examined two data sources
- 100 user home directories from a central server
- 12 user productivity machines
- In both datasets, saw 10 duplication
- Even more with system files, email servers,
sharepoints, etc. - This is sufficient duplication for work reduction
4/7 duplication
26Global scheduling policies
- Traditional
- One-phase scanning, scan all copies
- Rand
- Two-phase scanning, random scheduling
- BestPlace
- Two-phase scanning, greedy scheduling
- BestPlaceTime
- Two-phase scanning, greedy scheduling work
ahead - Opt
- Unreplicated data only, delayed work ahead
27Metrics
- Total Work
- Total elapsed time spent on analysis and hashing
- Client Impact
- Time spent that exceeded client idle time
28Metrics
- Metrics calculated for each day
- Summed over the entire simulation period
29Experimental setup
- Implemented a simulator to test a variety of
machine configurations and scheduling policies - Config 50 high priority blades, 50 low priority
laptops - Blades were modeled after
- Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
- Laptops were modeled after
- 2GHz Pentium M, 1.5GB RAM, 60GB SATA
- Simulated 30 days
- Daily creation rates and layouts from traced
workloads - Freshness of 3 days, scheduling period of 1 day
30Total work
Prefers faster blade machines over laptops,
increasing their total work to reduce client
impact
Doing work ahead of the freshness delay means
analyzing files that would have been deleted
Removes duplicate work, reducing the total work
done
31Client impact
By doing work ahead of the freshness deadline,
SCAN-Lite takes better advantage of idle time
Choosing the best place helps hit the idle time
targets, reducing average client impact
Less work means less impact
Theoretical OPT only 8 better than BestPlaceTime
32Summary
- Reducing local scanning interference is critical
- 17 - 60 improvement from reduced contention
- Two-phase scanning reduces analysis overheads
- Reduce total work to near single-copy costs
- Reduced client impact by up to 40 on our workload
33Future work
- This is an initial system for reducing analysis
costs - Many improvements remain!
- Vary freshness delays
- Different applications may have different
requirements - Provide freshness and scan priorities to clients
- Could prioritize scan order to not exceed client
idle times - Try more workloads
- May need better bin packing algorithms
34Summary
- Ever increasing number of analyses in the
enterprise - Search, provenance, trending, clustering,
classification, etc. - Local scheduling to reduce resource contention on
clients - Up to 60 performance improvement
- Two-phase scanning to reduce work and balance
load - Delay analysis work to identify duplicate work
- Global scheduling to balance load
- Reduced client impact by up to 40 on our workload
35Getting a handle on enterprise data
- Unstructured information growing at XX per year
- Increasing number of needs for metadata
- eDiscovery
- Worker productivity and search
- IT trending and historical analysis
- Lots of different analysis to perform
- Term vectors, fingerprints, feature vectors,
usage statistics, etc. - Data is spread across file servers, web servers,
email servers, laptops, desktops, backups, etc.
36Where to perform analysis?
- On backups?
- Not all data is backed up, encrypted, utilized
- On idle servers?
- Requires data migration strategies, may break
privacy - On end nodes?
- May interrupt foreground workloads, frustrate
users - All solutions desire minimizing work and
balancing load to reduce required resources
37The problems
- Most analysis tools run in isolation
- Tools compete for resources locally, create
interference - Replicated data creates replicated work
- Tools produce the same results in multiple
locations - Machines have difference characteristics
- Creation rates, performance, idle time, etc.
- Goal perform analysis at the best time and place
38Best place and time?
A
B
C
D
39Solution Improve scheduling
- Local scheduler to coordinate analysis tasks
- Single resource controller to prevent competition
- Global scheduler to single-instance analysis
- Centralize decision of when and where to analyze
40Local scheduling
- Prefetch thread reads data from disk once
- Analysis routines run in separate parallel
threads - Shared memory buffer distributes data to routines
Analysis Threads
Prefetch Thread
Producer/Consumer Buffer
41Traditional One-phase scanning
Server
Client
Metadata Store
Metadata
42SCAN-Lite Two-phase scanning
Server
Client
Metadata Store
43Global scheduling
- Time broken into scheduling periods based on some
freshness delay (max time until data scan) - Starting each scheduling period, the global
scheduler picks which client will scan which data - First schedule data that has met its freshness
delay - Idle time, priorities, worst-fit, and ordering
- Second schedule any possible additional data
- Work-ahead
44Idle time, priorities, and worst-fit
- For a given piece of data
- Choose the set of machines that have available
idle time - If none, then choose all machines
- From that, choose the machines with the highest
priority - From that, choose the machine with the most idle
time - If none, choose the machine with the least client
impact
45Ordering
IdleTime
P2
P1
46Ordering
- Assign each piece of data a number based on the
number of machines at each priority class - Order all data by its ordering number
IdleTime
P2
P1
47Work ahead
- Once all data that has met its freshness delay
has been scheduled, assign additional data to any
machines with available idle time
48How to schedule
- First, schedule any work that will meet its
freshness deadline during this scheduling period - Second, schedule any additional work that will
fit within the remaining idle time of clients
49Local scheduling results
50Local performance improvements
- What happens when one or more analysis routines
try to improve performance? - For example, using direct I/O to reduce memory
footprint, and thus impact on client workloads - Seven Direct
- Analysis programs implement direct I/O
- Unified Direct
- SCAN-Lite implements direct I/O
51Local scheduling with direct I/O