KDD Cup - PowerPoint PPT Presentation

About This Presentation
Title:

KDD Cup

Description:

KDD Cup '99: Classifier Learning. Predictive Model for ... KDD Cup Overview. Held Annually in conjunction with Knowledge Discovery and Data Mining Conference ... – PowerPoint PPT presentation

Number of Views:1226
Avg rating:3.0/5.0
Slides: 14
Provided by: clif8
Category:
Tags: kdd | cup

less

Transcript and Presenter's Notes

Title: KDD Cup


1
KDD Cup 99 Classifier LearningPredictive
Model for Intrusion Detection
  • Charles Elkan
  • 1999 Conference on Knowledge Discovery and Data
    Mining
  • Presented by Chris Clifton

2
KDD Cup Overview
  • Held Annually in conjunction with Knowledge
    Discovery and Data Mining Conference (now
    ACM-sponsored)
  • Challenge problem(s) released well before
    conference
  • Goal is to give best solution to problem
  • Relatively informal contest
  • Gives standard test for comparing techniques
  • Winner announced at KDD conference
  • Lots of recognition to winner

3
Classifier Learning forIntrusion Detection
  • One of two KDD99 challenge problems
  • Other was a knowledge discovery problem
  • Goal is to learn a classifier to define TCP/IP
    connections as intrusion/okay
  • Data Collection of features describing TCP
    connection
  • Class Non-attack or type of attack
  • Scoring Cost per Test Sample
  • Wrong answers penalized based on type of wrong

4
Data TCP connection information
  • Dataset developed for 1998 DARPA Intrusion
    Detection Evaluation Program
  • Nine weeks of raw TCP dump data from simulated
    USAF LAN
  • Simulated attacks to give positive examples
  • Processed into 5 million training connections,
    2 million test
  • Some attributes derived from raw data
  • Twenty-four attack types in training data, four
    classes
  • DOS denial-of-service, e.g. syn flood
  • R2L unauthorized access from a remote machine,
    e.g. guessing password
  • U2R  unauthorized access to local superuser
    (root) privileges, e.g., various buffer
    overflow'' attacks
  • probing surveillance and other probing, e.g.,
    port scanning.
  • Test set includes fourteen attack types not found
    in training set

5
Basic features of individual TCP connections
feature name description  type
duration  length (number of seconds) of the connection  continuous
protocol_type  type of the protocol, e.g. tcp, udp, etc.  discrete
service  network service on the destination, e.g., http, telnet, etc.  discrete
src_bytes  number of data bytes from source to destination  continuous
dst_bytes  number of data bytes from destination to source  continuous
flag  normal or error status of the connection  discrete 
land  1 if connection is from/to the same host/port 0 otherwise  discrete
wrong_fragment  number of wrong'' fragments  continuous
urgent  number of urgent packets  continuous
6
Content features within a connection suggested by
domain knowledge
feature name description  type
hot  number of hot'' indicators continuous
num_failed_logins  number of failed login attempts  continuous
logged_in  1 if successfully logged in 0 otherwise  discrete
num_compromised  number of compromised'' conditions  continuous
root_shell  1 if root shell is obtained 0 otherwise  discrete
su_attempted  1 if su root'' command attempted 0 otherwise  discrete
num_root  number of root'' accesses  continuous
num_file_creations  number of file creation operations  continuous
num_shells  number of shell prompts  continuous
num_access_files  number of operations on access control files  continuous
num_outbound_cmds number of outbound commands in an ftp session  continuous
is_hot_login  1 if the login belongs to the hot'' list 0 otherwise  discrete
is_guest_login  1 if the login is a guest''login 0 otherwise  discrete
7
Traffic features computed using a two-second time
window
feature name description  type
count  number of connections to the same host as the current connection in the past two seconds  continuous
Note The following  features refer to these same-host connections.
serror_rate  of connections that have SYN'' errors  continuous
rerror_rate  of connections that have REJ'' errors  continuous
same_srv_rate  of connections to the same service  continuous
diff_srv_rate  of connections to different services  continuous
srv_count  number of connections to the same service as the current connection in the past two seconds  continuous
Note The following features refer to these same-service connections.
srv_serror_rate  of connections that have SYN'' errors  continuous
srv_rerror_rate  of connections that have REJ'' errors  continuous
srv_diff_host_rate  of connections to different host continuous
8
Scoring
  • Each prediction gets a score
  • Row is correct answer
  • Column is prediction made
  • Score is average over all predictions

normal probe DOS U2R R2L
normal 0 1 2 2 2
probe 1 0 2 2 2
DOS 2 1 0 2 2
U2R 3 2 2 0 2
R2L 4 2 2 2 0
9
Results
  • Twenty-four entries, scores0.2331 0.2356 0.2367
    0.2411 0.2414 0.2443 0.2474 0.2479 0.2523 0.2530
    0.2531 0.2545 0.2552 0.2575 0.2588 0.2644 0.2684
    0.2952 0.3344 0.3767 0.3854 0.3899 0.5053 0.9414
  • 1-Nearest Neighbor scored 0.2523

10
Winning MethodBagged Boosting
  • Submitted by Bernhard Pfahringer, ML Group,
    Austrian Research Institute for AI
  • 50 samples from the original 5 million odd
    examples set
  • Contrary to standard bagging the sampling was
    slightly biased
  • all of the examples of the two smallest classes
    U2R and R2L
  • 4000 PROBE, 80000 NORMAL, and 400000 DOS examples
  • duplicate entries in the original data set
    removed
  • Ten C5 decision trees induced from each sample
  • used both C5's error-cost and boosting options.
  • Final predictions computed from 50 single
    predictions of each training sample by minimizing
    conditional risk
  • minimizes sum of error-costs times
    class-probabilities
  • Took approximately 1 day of 200MHz 2 processor
    Sparc to train

11
Confusion Matrix(Breakdown of score)
12
Analysis of winning entry
  • Result comparable to 1-NN except on rare
    classes
  • Training sample of winner biased to rare classes
  • Does this give us a general principle?
  • Misses badly for some attack categories
  • True for 1-NN as well
  • Problem with feature set?

13
Second and Third places(Probably not
statistically significant)
  • Itzhak Levin, LLSoft, Inc. Kernel Miner
  • Link broken?
  • Vladimir Miheev, Alexei Vopilov, and Ivan
    Shabalin, MP13, Moscow, Russia
  • Verbal rules constructed by an expert
  • First echelon of voting decision trees
  • Second echelon of voting decision trees
  • Steps sequentially
  • Branch to the next step occurs whenever the
    current one has failed to recognize the
    connection
  • Trees constructed using their own (previously
    developed) tree learning algorithm
Write a Comment
User Comments (0)
About PowerShow.com