- PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Description:

Title: MULTICLASS SUPPORT VECTOR MACHINES: A COMPARATIVE STUDY OF KERNELS Author: Comp1 Last modified by: NSharma Created Date: 12/8/2006 10:42:33 AM – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 53
Provided by: com8156
Category:

less

Transcript and Presenter's Notes

Title:


1
 
  •  
  •   Data Mining Techniques for Malware Detection
  •  
  • R. K. Agrawal
  •  
  • School of Computer and Systems Sciences
  • Jawaharlal Nehru University
  • NewDelhi-110067
  •  

2
Outline
  • Data Mining
  • Classification
  • Clustering
  • Association Rules
  • Experimental Results
  • Conclusion and Future Work

3
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools lead to
    tremendous amounts of data stored in databases
    and other information repositories
  • We are drowning in data, but starving for
    knowledge!
  • Solution data mining
  • Extraction of interesting knowledge (rules,
    regularities, patterns, constraints) from data
    in large databases

4
Commercial Viewpoint
  • Lots of data is being collected and warehoused
  • Web data, e-commerce
  • purchases at department/grocery stores
  • Bank/Credit Card transactions
  • Computers have become cheaper and more powerful
  • Competitive Pressure is Strong
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

5
Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • Network related Log files
  • microarrays generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in classifying and segmenting data
  • in Hypothesis Formation

6
What Is Data Mining?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information or patterns from data in large
    databases
  • Alternative names
  • Knowledge discovery(mining) in databases (KDD),
    knowledge extraction, data/pattern analysis, data
    archeology, business intelligence, etc.

7
Data Mining Tasks
  • Prediction Tasks
  • Use some variables to predict unknown or future
    values of other variables
  • Description Tasks
  • Find human-interpretable patterns that describe
    the data.
  • Common data mining tasks
  • Classification Predictive
  • Clustering Descriptive
  • Association Rule Discovery Descriptive
  • Sequential Pattern Discovery Descriptive
  • Regression Predictive
  • Deviation Detection Predictive

8
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class label.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.

9
ClassificationA Two-Step Process
  • Model construction describing a set of
    predetermined classes
  • Each tuple/sample is assumed to belong to a
    predefined class, as determined by the class
    label attribute
  • The set of tuples used for model construction is
    training set
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Model usage for classifying future or unknown
    objects
  • Estimate accuracy of the model
  • The known label of test sample is compared with
    the classified result from the model
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model
  • If the accuracy is acceptable, use the model to
    classify data tuples whose class labels are not
    known

10
Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
11
Process (2) Using the Model in Prediction
(Jolly Professor, 5)
Tenured?
12
Classification Application
  • Malware Detection
  • Goal Predict whether the given binary is Malware
    or not.
  • Approach
  • Use both kind of binaries (Normal and Malware)
  • Learn a model for the class of the binaries.
  • Use this model to detect malware by observing a
    binary.

13
Clustering Definition
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • Data points in one cluster are more similar to
    one another.
  • Data points in separate clusters are less similar
    to one another.
  • Similarity Measures
  • Euclidean Distance if attributes are continuous.
  • Other Problem-specific Measures.

14
Illustrating Clustering
  • Euclidean Distance Based Clustering in 3-D space.

Intracluster distances are minimized
Intercluster distances are maximized
15
Clustering Application
  • Binaries Segmentation
  • Goal subdivide a given set of binaries into
    distinct subsets of binaries

16
Association Rule Discovery Definition
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Bread --gt Milk
Diaper --gt Beer
17
The Sad Truth About Diapers and Beer
  • So, dont be surprised if you find six-packs
    stacked next to diapers!

18
Association Rule Discovery Application
  • Malware Rules
  • Goal To identify activities that are happen
    together in a given malware.

19
Sequential Pattern Discovery Definition
  • Given is a set of objects, with each object
    associated with its own timeline of events, find
    rules that predict strong sequential dependencies
    among different events
  • In telecommunications alarm logs,
  • (Inverter_Problem Excessive_Line_Current)
  • (Rectifier_Alarm) --gt (Fire_Alarm)
  • In point-of-sale transaction sequences,
  • Computer Bookstore
  • (Intro_To_Visual_C) (C_Primer) --gt
    (Perl_for_dummies)
  • Athletic Apparel Store
  • (Shoes) (Racket, Racketball) --gt
    (Sports_Jacket)

20
Classification Example
height
weight
Training examples
Linear classifier
21
Classification Techniques
  • Decision Trees
  • Naïve Bayes
  • Support Vector Machines
  • Neural Networks
  • Parzen Window
  • K-nearest neigbor

22
Issues Data Preparation
  • Data cleaning
  • Preprocess data in order to reduce noise and
    handle missing values
  • Relevance analysis (feature selection)
  • Remove the irrelevant or redundant attributes
  • Data transformation
  • Generalize and/or normalize data

23
Issues Evaluating Classification Methods
  • Accuracy
  • classifier accuracy predicting class label
  • predictor accuracy guessing value of predicted
    attributes
  • Speed
  • time to construct the model (training time)
  • time to use the model (classification/prediction
    time)
  • Robustness handling noise and missing values
  • Scalability efficiency in disk-resident
    databases
  • Interpretability
  • understanding and insight provided by the model
  • Other measures, e.g., goodness of rules, such as
    decision tree size or compactness of
    classification rules

24
Decision Tree Induction Training Dataset
This follows an example of Quinlans ID3
25
A Decision Tree for buys_computer
26
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

27
Attribute Selection Measure Information Gain
(ID3/C4.5)
  • Select the attribute with the highest information
    gain
  • Let pi be the probability that an arbitrary tuple
    in D belongs to class Ci, estimated by Ci,
    D/D
  • Expected information (entropy) needed to classify
    a tuple in D
  • Information needed (after using A to split D into
    v partitions) to classify D
  • Information gained by branching on attribute A

28
Attribute Selection Information Gain
  • Class P buys_computer yes
  • Class N buys_computer no
  • means age lt30 has 5 out of 14
    samples, with 2 yeses and 3 nos. Hence
  • Similarly,

29
Computing Information-Gain for Continuous-Value
Attributes
  • Let attribute A be a continuous-valued attribute
  • Must determine the best split point for A
  • Sort the value A in increasing order
  • Typically, the midpoint between each pair of
    adjacent values is considered as a possible split
    point
  • (aiai1)/2 is the midpoint between the values of
    ai and ai1
  • The point with the minimum expected information
    requirement for A is selected as the split-point
    for A
  • Split
  • D1 is the set of tuples in D satisfying A
    split-point, and D2 is the set of tuples in D
    satisfying A gt split-point

30
Linear Classifiers
f(x,w,b) sign(w x b)
denotes 1 denotes -1
Any of these would be fine.. ..but which is best?
31
Support Vector Machine
MMargin Width
x
Predict Class 1 zone
Support Vectors are those datapoints that the
margin pushes up against
X-
Predict Class -1 zone
wxb1
wxb0
wxb-1
  • What we know
  • w . x b 1
  • w . x- b -1
  • w . (x-x-) 2

32
Linear SVM Mathematically
  • Goal 1) Correctly classify all training data

  • if yi 1

  • If yi -1

  • for all i
  • 2) Maximize the Margin
  • same as minimize
  • We can formulate a Quadratic Optimization Problem
    and solve for w and b
  • Minimize
  • subject to

33
Linear SVM. Cont.
  • Requiring the derivatives with respect to w,b to
    vanish yields
  • KKT conditions yield
  • Where

34
Linear SVM. Cont.
  • The resulting separating function is

35
Linear SVM. Cont.
  • Requiring the derivatives with respect to w,b to
    vanish yields
  • KKT conditions yield
  • Where

36
Linear SVM. Cont.
  • The resulting separating function is
  • Notes
  • The points with a0 do not affect the solution.
  • The points with a?0 are called support vectors.
  • The equality conditions hold true only for the
    Support Vectors.

37
Non-separable case
  • The modifications yield the following problem

38
Non Linear SVM
  • Note that the training data appears in the
    solution only in inner products.
  • If we pre-map the data into a higher and sparser
    space we can get more separability and a stronger
    separation family of functions.
  • The pre-mapping might make the problem
    infeasible.
  • We want to avoid pre-mapping and still have the
    same separation ability.
  • Suppose we have a simple function that operates
    on two training points and implements an inner
    product of their pre-mappings, then we achieve
    better separation with no added cost.

39
Non-linear SVMs Feature spaces
  • General idea the original feature space can
    always be mapped to some higher-dimensional
    feature space where the training set is separable

F x ? f(x)
40
The Kernel Trick
  • The linear classifier relies on inner product
    between vectors K(xi,xj)xiTxj
  • If every datapoint is mapped into
    high-dimensional space via some transformation F
    x ? f(x), the inner product becomes
  • K(xi,xj) f(xi) Tf(xj)
  • A kernel function is a function that is
    equivalent to an inner product in some feature
    space.
  • Example
  • 2-dimensional vectors xx1 x2 let
    K(xi,xj)(1 xiTxj)2,
  • Need to show that K(xi,xj) f(xi) Tf(xj)
  • K(xi,xj)(1 xiTxj)2, 1 xi12xj12 2 xi1xj1
    xi2xj2 xi22xj22 2xi1xj1 2xi2xj2
  • 1 xi12 v2 xi1xi2 xi22 v2xi1
    v2xi2T 1 xj12 v2 xj1xj2 xj22 v2xj1 v2xj2
  • f(xi) Tf(xj), where f(x) 1 x12
    v2 x1x2 x22 v2x1 v2x2
  • Thus, a kernel function implicitly maps data to a
    high-dimensional space (without the need to
    compute each f(x) explicitly).

41
Mercer Kernels
  • A Mercer kernel is a function
  • for which there exists a function
  • such that
  • A function k(.,.) is a Mercer kernel if
  • for any function g(.), such that
  • the following holds true

42
Commonly used Mercer Kernels
  • Homogeneous Polynomial Kernels
  • Non-homogeneous Polynomial Kernels
  • Radial Basis Function (RBF) Kernels

43
Solution of non-linear SVM
  • The problem
  • The separating function

44
Multi-Class SVM
  • Approaches
  • One against One ( K (K-1) / 2 ) binary
    Classifiers required
  • Outputs of the classifiers are aggregated to
    make the final decision.
  • One against All (K binary Classifiers required)
  • It trains k binary classifiers, each of which
    separates one class from the other (k-1) classes.
    Given a data point X , the binary classifier with
    the largest output determines the class of X.

45
Why Is SVM Effective on High Dimensional Data?
  • The complexity of trained classifier is
    characterized by the of support vectors rather
    than the dimensionality of the data
  • The support vectors are the essential or critical
    training examples they lie closest to the
    decision boundary (MMH)
  • If all other training examples are removed and
    the training is repeated, the same separating
    hyperplane would be found
  • The number of support vectors found can be used
    to compute an (upper) bound on the expected error
    rate of the SVM classifier, which is independent
    of the data dimensionality
  • Thus, an SVM with a small number of support
    vectors can have good generalization, even when
    the dimensionality of the data is high

46
Experiments
  • Source of data Preprocessed data in terms of API
    Calls taken from data collected from C-Dac
    Mohali.
  • Description of data

Sample Space Training set Testing set
Benign 534 50 484
Malicious 168 50 118
Total 702 100 602
47
Classifier Accuracy Measures
C1 C2
C1 True positive False negative
C2 False positive True negative
  • Performance measures
  • sensitivity t-pos/pos / true
    positive recognition rate /
  • specificity t-neg/neg / true
    negative recognition rate /
  • accuracy sensitivity pos/(pos neg)
    specificity neg/(pos neg)

48
Experimental Results

Classifier sensitivity sensitivity sensitivity specificity specificity specificity
k5 K6 K7 K5 K6 K7
C4.5 70.86 71.23 69.68 68.62 69.96 61.05

SVM 75.26 76.79 75.18 73.54 78.34 74.46


49
Observations
  • The performance of SVM classifier is
    significantly better in comparison to C4.5.
  • The performance is dependent on the size of
    feature size
  • SVM requires less training samples in comparison
    C4.5. Hence, svm is a better choice as collecting
    malicious samples is difficult.

50
Conclusion Future Work
  • SVM is a better classification technique which
    can be used for detection of Malware.
  • Needs attention to construct better feature
    representation for better generalization
  • How to extend it to multi-class malware problem

51
References
  • C. J. C. Burges. A Tutorial on Support Vector
    Machines for Pattern Recognition. Data Mining and
    Knowledge Discovery, 2(2) 121-168, 1998.
  • J. R. Quinlan. C4.5 Programs for Machine
    Learning. Morgan Kaufmann, 1993.
  • P. Tan, M. Steinbach, and V. Kumar. Introduction
    to Data Mining. Addison Wesley, 2005.
  • I. H. Witten and E. Frank. Data Mining Practical
    Machine Learning Tools and Techniques, 2ed.
    Morgan Kaufmann, 2005.
  • Han and Kamber, Data Mining Concepts
  • B. Zhang, J. Yin, J. Hao, D. Zhang, S. Wang,
    Using Support Vector Machine to detect unknown
    computer viruses, Int. Journal of Computational
    Intelligence Research, vol. 2, No. 1, pp.
    100-104, 2006.
  • Szappanos,G. Are There Any Polymorphic Macro
    Viruses at ALL (and What to Do with Them).in
    Proceedings of the 12th International Virus
    Bulletin Conference, 2001.
  • Forrest,S., Hofmeyr, S. A., Somayaji, A.
    Computer immunology. Communications of the ACM.
    10, pp. 8896, 1997.
  • Lee,W., Dong,X. Information-Theoretic measures
    for anomaly detection. In Needham,R., Abadi M,
    (eds). Proceedings of the 2001 IEEE Symposium on
    Security and Privacy Oakland, CA IEEE Computer
    Society Press, pp. 130-143, 2001.
  • LIBSVM. http//www.csie.ntu.edu.tw/cjlin/.

52
References (4)
  • P. Tan, M. Steinbach, and V. Kumar. Introduction
    to Data Mining. Addison Wesley, 2005.
  • S. M. Weiss and C. A. Kulikowski. Computer
    Systems that Learn Classification and
    Prediction Methods from Statistics, Neural Nets,
    Machine Learning, and Expert Systems. Morgan
    Kaufman, 1991.
  • S. M. Weiss and N. Indurkhya. Predictive Data
    Mining. Morgan Kaufmann, 1997.
  • I. H. Witten and E. Frank. Data Mining Practical
    Machine Learning Tools and Techniques, 2ed.
    Morgan Kaufmann, 2005.
  • X. Yin and J. Han. CPAR Classification based on
    predictive association rules. SDM'03
  • H. Yu, J. Yang, and J. Han. Classifying large
    data sets using SVM with hierarchical clusters.
    KDD'03.
Write a Comment
User Comments (0)
About PowerShow.com