Exploiting Clustering Techniques for Web Session Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Clustering Techniques for Web Session Inference

Description:

Exploiting Clustering Techniques for Web Session Inference A.Bianco, G. Mardente, M. Mellia, M.Munaf , L. Muscariello (Politecnico di Torino) Outline Web Session ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 19
Provided by: mard8
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Clustering Techniques for Web Session Inference


1
Exploiting Clustering Techniquesfor Web Session
Inference
  • A.Bianco, G. Mardente, M. Mellia, M.Munafò, L.
    Muscariello
  • (Politecnico di Torino)

2
Outline
  • Web Session Model
  • Clustering techniques
  • The proposed algorithm
  • Performance of the algorithm
  • Session statistics

3
Web session definition
  • A single web client generates a succession of
    TCP flows and think times

think time Toff
think time Toff
  • A session here is defined as the set of TCP
    flows arriving close enough one to each other
  • For example a threshold can be used to
    discriminate between think times and inter
    arrivals of TCP flows

4
Algorithms
  • A threshold based approach needs a priori
    knowledge of the source
  • An adaptive algorithm should be capable to catch
    traffic variations
  • This is supposed to be less sensitive to traffic
    characteristics
  • Clustering is the chosen approach

5
Proposed algorithm
  • Three steps
  • A K-means is used on all samples to obtain a
    first clustering, K is chosen very large
  • A hierarchical clustering is used only on
    representatives of each cluster, K is reduced
  • A K-means is used on all samples again
  • To test the algorithm we need a priori known
    traffic, that is artificially generated

6
First Step K-means
  • K is chosen large enough but significantly
    smaller than the number of samples
  • The K farthest flows determine the first
    partition
  • K-means is performed 1000 iterations on all
    samples
  • Each cluster is then represented using a subset
    of samples, one or two in our algorithm
  • The mean value (Centroid method)
  • The gth and (100-g)th percentiles (Single linkage
    method if g0)

g-th percentile
(100-g)-th percentile
7
Second step a hierarchical method
  • A hierarchical method is used on only
    representatives
  • This method merges clusters until a quality
    function determines that the optimal number of
    clusters Nc has been found

8
Gamma function typical behaviour
9
Third Step K-means
  • A K-means is performed on all samples
  • This last step is not critical but rearranges
    samples positions within clusters that is flows
    within sessions
  • It is not CPU time consuming, than it is not
    critical to use it

10
Performance evaluation
  • Artificial traffic is generated according to an
    ON/OFF process
  • During ON periods a succession of flows is
    generated using i.i.d. inter-arrivals
  • In this model inferring is to recognize if an
    inter arrival is an OFF period or an inter
    arrival between flows within an ON period
  • Every time the algorithm does not guess
    correctly, an error is counted
  • Suppose all variables are exponentially
    distributed

11
First step sensitivity (1/2)
  • If the initial number of clusters is chosen large
    enough the method is less error prone
  • The algorithm is much more sensitive to the value
    of the idle period

12
First step sensitivity (2/2)
  • Performance is sensitive to the choice of the
    percentile g
  • When clusters are represented through flows at
    the border of the session the method is less
    sensitive to traffic, i.e. g1
  • This is due to the fact
  • that cluster has a long
  • and narrow shape and
  • those representatives
  • well model this fact

13
Comparison with threshold based algorithms
exponential case
  • Threshold based algorithms work well if traffic
    characteristics are known
  • But they are very sensitive to the threshold
    value
  • If sessions are already
  • well clustered because
  • idle periods are large
  • enough compared to
  • flows inter arrivals,
  • our algorithm is very
  • good

14
Comparison with threshold based algorithms
Pareto case
  • Threshold based algorithms work well if traffic
    characteristics are known
  • But they are very sensitive to the threshold
    value
  • If sessions are already
  • well clustered because
  • idle periods are large
  • enough compared to
  • flows inter arrivals,
  • our algorithm is very
  • good

15
Some statistics on aggregated sessions
  • The session sizes are heavy tailed (broadly)
  • Usually each session is made of a few TCP flows
  • Flow termination definition is not that important

16
Some statistics on aggregated sessions
  • Similar results concerning server to client and
    client to server data
  • Similar distribution law, asymetries on volume
    only

17
Flows and sessions inter-arrivals
  • The method infers session which are similar even
    when considering very different traces
  • Tarr and Toff are well identified

18
Conclusions
  • Clustering techniques could be easily used to
    infer web-session
  • The proposed algorithm is a mix a known
    clustering approaches
  • It is able to deal with huge amount of data
  • Sessions seems to be very well recognized
Write a Comment
User Comments (0)
About PowerShow.com