SCANLite: Enterprisewide analysis on the cheap - PowerPoint PPT Presentation

About This Presentation

Title:

SCANLite: Enterprisewide analysis on the cheap

Description:

Ran a fitness test using 7 analysis routines ... Original fitness test used CPU time. Gave less variable performance curves for modeling ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 35

Provided by: craigs54

Category:

more less

Transcript and Presenter's Notes

Title: SCANLite: Enterprisewide analysis on the cheap

1
SCAN-LiteEnterprise-wide analysis on the cheap

Craig Soules, Kimberly Keeton, Brad Morrey

2
Enterprise information management

Search
Clustering
Provenance
Classification
IT Trending
Virus scanning

Metadata Server
3
Enterprise information management
Metadata Server
Data is duplicated across machines! Duplicate
analysis is wasted work
4
Issues

Analysis programs conflict on clients
Contend for system resources (memory, disk)
Clients repeat work
Duplicate files on multiple clients
Client foreground workloads are impacted
Work exceeds available idle time on busy clients

5
Approaches

Reduce resource contention

Client
6
Approaches

Avoid duplicate work

7
Approaches

Leverage duplication to balance client load
Delay analysis to identify all duplicates

Clients
Global Scheduler
8
Solutions

Local scheduler
Coordinates analyses to reduce resource
contention
Up to 60 improvement
Global scheduler
Identifies duplicates to remove work
Balance load
40 reduction in impact to foreground tasks

9
Local scheduling

Traditionally, analyses are separate programs
Scheduling left to the operating system
Potentially at different times
Each program identifies files to scan
Each program opens and reads file data

Disk
10
Unified local scheduling

Each analysis routine is a separate thread
Control thread manages shared tasks
Identify files to scan, and open/read file data
Shared memory buffer distributes file data

ControlThread
Disk
Shared Memory
11
Local scheduling performance

Ran a fitness test using 7 analysis routines
42 data sets, each containing files of a fixed
size
Ran both approaches over each data set
Calculated per-file elapsed scan time
Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
Seven-at-once
Run each analysis routine separately at the same
time
Unified
SCAN-Lites unified local scheduling approach

12
Elapsed time vs. CPU time

Original fitness test used CPU time
Gave less variable performance curves for
modeling
Disk contention shows up in elapsed time
CPU time is multiplexed
Elapsed time is not

This is very bad
13
Local scheduling results
17 - 60 improvement
Seven-at-once benefits from deep disk queues, but
this hurts foreground apps
Small random I/Os have worse interaction than
larger ones
14
Global scheduler

Two goals
Reduce additional work from duplicate files
Utilize duplication to schedule work to the
best client
Two-phase scanning
Phase one identify duplicate files using content
hashing
Phase two analyze one copy at the appropriate
client
Delaying between phase one and two provides
opportunity for additional duplication and
deletion

15
Traditional scanning
Clients
Server
16
Phase one Duplicate detection
Clients
Server
17
Phase two Scheduling
Clients
Server
18
When to schedule

Clients upload hashes each scheduling period
The freshness specifies a deadline by which new
data must be analyzed

Schedule before this period
Scheduling here gives one option
Scheduling here gives three options
Scheduling Period
Time
19
How to schedule

Scheduling is a bin packing problem
Files are balls, clients are bins
Size of bins is available idle time
Color of balls/bins equates to location of
duplicates
Size of balls is time required for analysis

20
How to schedule

We use a greedy heuristic for scheduling
Consider idle time and machine priorities
See paper for details

21
Work ahead

Start by scheduling all work that meets freshness
Schedule additional work on still idle machines
Any remaining idle time can be used for
additional work
We refer to this as work ahead

22
Two-phase scanning Trade-offs
Clients
23
Two-phase scanning Trade-offs
Clients
24
Two-phase scanning Trade-offs

If cost of hashing exceeds the additional work
from duplicates, then one-phase scanning is
better
Analysis of hashing costs using SHA-1 indicate
that 3 data duplication is the minimum
Do we see that in practice?

25
Duplication in enterprise data

Examined two data sources
100 user home directories from a central server
12 user productivity machines
In both datasets, saw 10 duplication
Even more with system files, email servers,
sharepoints, etc.
This is sufficient duplication for work reduction

4/7 duplication
26
Global scheduling policies

Traditional
One-phase scanning, scan all copies
Rand
Two-phase scanning, random scheduling
BestPlace
Two-phase scanning, greedy scheduling
BestPlaceTime
Two-phase scanning, greedy scheduling work
ahead
Opt
Unreplicated data only, delayed work ahead

27
Metrics

Total Work
Total elapsed time spent on analysis and hashing
Client Impact
Time spent that exceeded client idle time

28
Metrics

Metrics calculated for each day
Summed over the entire simulation period

29
Experimental setup

Implemented a simulator to test a variety of
machine configurations and scheduling policies
Config 50 high priority blades, 50 low priority
laptops
Blades were modeled after
Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
Laptops were modeled after
2GHz Pentium M, 1.5GB RAM, 60GB SATA
Simulated 30 days
Daily creation rates and layouts from traced
workloads
Freshness of 3 days, scheduling period of 1 day

30
Total work
Prefers faster blade machines over laptops,
increasing their total work to reduce client
impact
Doing work ahead of the freshness delay means
analyzing files that would have been deleted
Removes duplicate work, reducing the total work
done
31
Client impact
By doing work ahead of the freshness deadline,
SCAN-Lite takes better advantage of idle time
Choosing the best place helps hit the idle time
targets, reducing average client impact
Less work means less impact
Theoretical OPT only 8 better than BestPlaceTime
32
Summary

Reducing local scanning interference is critical
17 - 60 improvement from reduced contention
Two-phase scanning reduces analysis overheads
Reduce total work to near single-copy costs
Reduced client impact by up to 40 on our workload

33
Future work

This is an initial system for reducing analysis
costs
Many improvements remain!
Vary freshness delays
Different applications may have different
requirements
Provide freshness and scan priorities to clients
Could prioritize scan order to not exceed client
idle times
Try more workloads
May need better bin packing algorithms

34
Summary

Ever increasing number of analyses in the
enterprise
Search, provenance, trending, clustering,
classification, etc.
Local scheduling to reduce resource contention on
clients
Up to 60 performance improvement
Two-phase scanning to reduce work and balance
load
Delay analysis work to identify duplicate work
Global scheduling to balance load
Reduced client impact by up to 40 on our workload

35
Getting a handle on enterprise data

Unstructured information growing at XX per year
Increasing number of needs for metadata
eDiscovery
Worker productivity and search
IT trending and historical analysis
Lots of different analysis to perform
Term vectors, fingerprints, feature vectors,
usage statistics, etc.
Data is spread across file servers, web servers,
email servers, laptops, desktops, backups, etc.

36
Where to perform analysis?

On backups?
Not all data is backed up, encrypted, utilized
On idle servers?
Requires data migration strategies, may break
privacy
On end nodes?
May interrupt foreground workloads, frustrate
users
All solutions desire minimizing work and
balancing load to reduce required resources

37
The problems

Most analysis tools run in isolation
Tools compete for resources locally, create
interference
Replicated data creates replicated work
Tools produce the same results in multiple
locations
Machines have difference characteristics
Creation rates, performance, idle time, etc.
Goal perform analysis at the best time and place

38
Best place and time?
A
B
C
D
39
Solution Improve scheduling

Local scheduler to coordinate analysis tasks
Single resource controller to prevent competition
Global scheduler to single-instance analysis
Centralize decision of when and where to analyze

40
Local scheduling

Prefetch thread reads data from disk once
Analysis routines run in separate parallel
threads
Shared memory buffer distributes data to routines

Analysis Threads
Prefetch Thread
Producer/Consumer Buffer
41
Traditional One-phase scanning
Server
Client
Metadata Store
Metadata
42
SCAN-Lite Two-phase scanning
Server
Client
Metadata Store
43
Global scheduling

Time broken into scheduling periods based on some
freshness delay (max time until data scan)
Starting each scheduling period, the global
scheduler picks which client will scan which data
First schedule data that has met its freshness
delay
Idle time, priorities, worst-fit, and ordering
Second schedule any possible additional data
Work-ahead

44
Idle time, priorities, and worst-fit

For a given piece of data
Choose the set of machines that have available
idle time
If none, then choose all machines
From that, choose the machines with the highest
priority
From that, choose the machine with the most idle
time
If none, choose the machine with the least client
impact

45
Ordering

There is still a problem

IdleTime
P2
P1
46
Ordering

Assign each piece of data a number based on the
number of machines at each priority class
Order all data by its ordering number

IdleTime
P2
P1
47
Work ahead

Once all data that has met its freshness delay
has been scheduled, assign additional data to any
machines with available idle time

48
How to schedule

First, schedule any work that will meet its
freshness deadline during this scheduling period
Second, schedule any additional work that will
fit within the remaining idle time of clients

49
Local scheduling results
50
Local performance improvements

What happens when one or more analysis routines
try to improve performance?
For example, using direct I/O to reduce memory
footprint, and thus impact on client workloads
Seven Direct
Analysis programs implement direct I/O
Unified Direct
SCAN-Lite implements direct I/O

51
Local scheduling with direct I/O

Write a Comment

User Comments (0)