File Classification in self-* storage systems - PowerPoint PPT Presentation

About This Presentation
Title:

File Classification in self-* storage systems

Description:

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer Introduction Self-* infrastructure need ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 14
Provided by: ChiY152
Category:

less

Transcript and Presenter's Notes

Title: File Classification in self-* storage systems


1
File Classification in self- storage systems
  • Michael Mesnier, Eno Thereska, Gregory R. Ganger,
    Daniel Ellard, Margo Seltzer

2
Introduction
  • Self- infrastructure need information about
  • Users
  • Applications
  • Policies
  • Not readily provided, and cannot depend on them
    to provide them
  • So? Must be learned

3
Self- storage systems
  • Sub-problem of the self- structure
  • Key to get hints based on what creators
    associate with their files
  • File size
  • File names
  • Lifetimes
  • Intentions determined, then decisions can be made
  • Results better file organization, performance

4
Classifying Files
  • Current rule-of-thumb policy selection
  • Generic, not optimized
  • Better distinguish classes
  • Finer grained policies
  • Ideally assigned at file creation
  • Determine classes at creation
  • Self- must learn this association
  • 1) traces 2)running fs

5
So, how?
  • Create model that classify based on (some
    attribs)
  • Name
  • Owner
  • Permissions
  • Must filter out irrelevant attribs
  • Classifier must learn rules to do so
  • Based on test set
  • Then inference happens

6
The right model
  • Model must be
  • Scalable
  • Dynamic
  • Cost-sensitive (mis-prediction cost)
  • Interpretable (human)
  • Model selected decision trees

7
ABLE
  • Attribute-based learning environment
  • 1. obtain traces
  • 2. make decision tree
  • 3. make predictions
  • Top down, until all attribs are used
  • Split sample until leaves have similar file
    attribs
  • After creation, query begins

8
Tests
  • Based on several systems to make sure it is
    workload-independent
  • DEAS03
  • EECS03
  • CAMPUS
  • LAB
  • The control MODE algorithm places all files in
    a single cluster

9
Results
  • Prediction results quite good
  • 90 - 100 claimed
  • Clustering files by attribs are clear
  • Predict that a models ruleset will converge over
    time

10
Benefits of incremental learning
  • Dynamically refines model as samples become
    available
  • Generally better than one-shot learners
  • Sometimes one-shot performs poorly
  • Ruleset of incremental learners are smaller

11
On accuracy
  • More attributes chance of over-fitting
  • More rules -gt smaller ratios
  • Loses compression benefits
  • Predictive models can have false predictions
  • Can impact performance
  • Things that should be in RAM is placed on disk
    instead etc.
  • Solution cost functions
  • Penalize errors
  • Create biased tree
  • System goals will need to be translated into it

12
Conclusion
  • These trees provide prediction accuracies in the
    90 range
  • Adaptable via incremental learning
  • Continued work integration into self-
    infrastructure

13
Questions?
Write a Comment
User Comments (0)
About PowerShow.com