An Intrusion Detection System Based on Teiresias - PowerPoint PPT Presentation

About This Presentation
Title:

An Intrusion Detection System Based on Teiresias

Description:

A famous blind prophet in ancient Greek myth. Properties of Teiresias Algorithm. Maximal ... Alphabet: E={A,B,C,D} Sequence: s1 = 'AABDABBD' Pattern: P1= 'A.BD' ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 41
Provided by: jhjh
Category:

less

Transcript and Presenter's Notes

Title: An Intrusion Detection System Based on Teiresias


1
An Intrusion Detection System Based on Teiresias
  • Andreas Wespi, Marc Dacier and Herve Debar
  • IBM Research, Zurich, Switzerland
  • Presented By Hanping Feng

2
References
  • A. Wespi, M.Dacier and H. Debar. Intrusion
    Detection Using Variable-Length Audit Trail
    Patterns. RAID 2000, LNCS 1907, pp. 110-129, 2000
  • A. Wespi, M.Dacier and H. Debar. An
    Intrusion-Detection System Based on the Teiresias
    Pattern-Discovery Algorithm, EICAR 1999
  • Isidore Rigoutsos and Aris Floratos.
    Combinatorial Pattern Discovery in Biological
    Sequences. Bioinformatics, 14(1)55-67,1998
  • Isidore Rigoutsos, Aris Floratos et al. The
    Emergence of Pattern Discovery Techniques in
    Computational Biology. Metabolic Engineering, vol
    2, 159-177, 2000
  • Aris Floratos and Isidore Rigoutsos. On the time
    compexity of the Teiresias Algorithm. IBM
    Research Report. 1998.
  • http//cbcsrv.watson.ibm.com/Tspd.html, IBM.

3
Outline
  • Intrusion-Detection based on Teiresias
  • Introduction
  • Architecture
  • Training System
  • Detection System
  • Results
  • Conclusion
  • Teiresias

4
Introduction
  • Forrests Fixed-length pattern method
  • Why variable-length
  • It is hard to choose the pattern length for
    fixed-length pattern methods
  • More naturally function parts will generate
    variable-length sequences of events
  • Very long subsequences repeatedly occur

5
Architecture
6
Architecture
  • Modules
  • Filtering sort the events by process id while
    keeping the chronological event order
  • Translation translate the events into an
    internal format. Basically one character per
    event
  • Aggregation aggregate consecutive occurrences of
    the same character
  • Reduction Remove duplicate sequences

7
Training System
  • Maximal Pattern Generation
  • All maximal variable-length patterns contained in
    the set of training sequences are determined
  • Pattern Reduction
  • To obtain the minimum pattern set that still
    covers all training sequences

8
Maximal Pattern Generation
  • Teiresias been used here
  • Able to find all maximal patterns within the
    input sequences that satisfy some user-
    determined requirement.
  • Here they defined the requirement
  • A minimum length of two
  • Occurs at least twice

9
Pattern Reduction
  • Step 1
  • The function bCover(p, s) (boundary coverage)
    returns the number of characters covered at the
    beginning and end of a sequence s by a pattern
    p. bCover(AB, ABCDEABAB) 6
  • The pattern with the highest boundary coverage is
    added to the reduced pattern set.

10
Pattern Reduction
  • Step 2
  • All the subsequences at the beginning and end of
    the training sequences that matched the pattern
    determined in Step 1 are removed
    ABABCDAB ? CD
  • Avoid training sequences being reduced to
    sequences that are too short
    ABD ? D

11
Pattern Reduction
  • Step 3
  • Remove matching subsequences p that are not
    adjacent to the boundary
  • Results in splitting the original sequence into
    new shorter sequencesCDABABEFABGH ? CD, EF, GH

12
Pattern Reduction
  • Step 4
  • No further transformation can be applied to
    sequences whose length is less than two times the
    minimal pattern length
  • Any sequence that cannot be further reduced will
    be added to the reduced pattern set
  • Repeat these steps until all input sequences are
    covered
  • Importance of the reduction
  • For their training data 23,302 events 167,187
    patters 554 maximal patterns 71 covering
    patterns

13
Detection System
  • Pattern matching
  • Exactly one pattern matches at a given position
    jump to right after the last event of the pattern
  • Several patterns match a look-ahead algorithm
    determines whether a sequence of n patterns can
    be found that matches the continuation of the
    sequence the one with the longest match is
    chosen and skipped
  • No matching pattern the event is marked as
    unmatched and skipped

14
Detection System
  • Metric
  • The maximal length of consecutive unmatched
    events
  • The pattern-matching algorithm returns the g
    groups of consecutive uncovered events and the
    length li, i 1g Tmax(li) , i 1g
  • A threshold is set and any sequence with T above
    the threshold will raise an alarm signal

15
Results
  • Training
  • 487 tests 68 unique sequences 23,302 events

16
Results
  • Testing

17
Conclusion
  • They claimed that their variable-length pattern
    method works better than Forrests fixed-length
    method, but it is not obviously so from their
    results
  • However, personally believe variable-length is a
    good idea.

18
Teiresias Algorithm for Pattern Discovery
  • Isidore Rigoutsos
  • Computational Biology Center, IBM
  • Aris Floratos
  • New York University

19
Outline
  • Background
  • Teiresias Algorithm
  • Properties
  • Terminology
  • Algorithm
  • Examples
  • Performance
  • Applications
  • References

20
Background
  • Algorithms for discovery of sequences similarity
  • Originated from Computational Biology
  • some new applications in other areas
  • Proposed Algorithms
  • Global string alignment
  • Local Similarity Discovery

21
Background
  • Global string alignment
  • Dynamic programming on edit operations such as
    mutation, insertion and deletion associated with
    costs
  • Can reveal only global similarities
  • Can not discover distantly related similarity or
    deal with domain swapping

22
Background
  • Local Similarity Discovery
  • Focus on the discovery of regional patterns
    shared by the input sequences
  • Enumerating the entire solution space
  • Computational expensive with complete results
  • Heuristic methods
  • Sacrifice the completeness of results for saving
    computation effort
  • Patterns are required to be as specific as
    possible

23
Teiresias Algorithm
  • Who is Teiresias?
  • A famous blind prophet in ancient Greek myth
  • Properties of Teiresias Algorithm
  • Maximal
  • Output sensitive
  • Able to process large data sets and large
    alphabets
  • Allow to impose user-specific structural
    restriction on output patterns

24
Terminology
  • Alphabet E the set of permissible symbols
  • Pattern P E(E.)E
  • any string that begins and ends with a symbol,
    and contains arbitrary combination of symbols and
    dont-care character represented by .
  • Language G(p)
  • The collection of strings that are defined by
    pattern P by substituting each dont-care
    character with an arbitrary symbol from E
  • Subpattern

25
Examples for Terminology
  • Alphabet EA,B,C,D
  • Sequence s1 AABDABBD
  • Pattern P1 A.BD, P2AB
  • Language
  • G(P1)AABD, ABBD, ACBD, ADBD
  • Subpatterns of P1
  • A.B, BD, B, D, A

26
Terminology
  • P is an ltL,Wgt pattern (LltW)
  • If every subpattern of P with length gtW contains
    at least L symbols.
  • L/W reflects the density of symbols in a
    pattern.
  • Match
  • Given a pattern P, a string is said to match P if
    it contains at least one substring that belongs
    to G(P)
  • Offset list
  • Given a pattern P and a set of sequences Ssi,
    define the offset list of P w.r.t. S to be
  • Ls(P) (i,j) sequences si matches P at offset
    j

27
Examples for Terminology
  • ltL,Wgt pattern
  • L3, W3 ABCDB, DCA
  • L3, W4 A.BC.D
  • Subpatterns with length 4 A.BC, BC.D
  • Offset List
  • S ABCD, ACDABDAB ,
  • PAB
  • Ls(P) (1,1), (2,4), (2,7)

28
Terminology
  • More Specific
  • Q is said to more specific than P, if Q can be
    obtained from P by
  • Changing dont-care characters to symbols
  • Appending an arbitrary combination of symbols
    and/or dont-care characters to the left/right of
    P
  • Number of Occurrences of Q lt Number of
    Occurrences of P
  • Maximal
  • A pattern P is called maximal w.r.t. S, if there
    exists no pattern Q which is more specific than P
    and and with the same number of occurrences as P.

29
Algorithm
  • Problem Formulation
  • Given a set S of input sequences and parameters
    L,W,K, find all maximal ltL,Wgt patterns that have
    support at least K times.
  • Two main stages of Teiresias Algorithm
  • Scanning stage
  • Convolution stage

30
Algorithm
  • Scanning Stage
  • Get a complete collection of elementary patterns
    with the required support
  • Elementary pattern an ltL,Wgt pattern with exactly
    L symbols
  • Basically by a guided enumeration
  • Example 1
  • SABCD, CDABCAAB ,
  • L2, W2, K2
  • elementary patterns found

31
Algorithm
  • AB ABCD, CDABCAAB
  • CD ABCD, CDABCAAB
  • BC ABCD, CDABCAAB
  • Example 2
  • SABCDAC, CDABADDCAB, DBC ,
  • L2, W3, K3
  • Example elementary patterns
  • AB ABCDAC, CDABDDCAB, DBC
  • D.C ABCDAC, CDABDDCAB, DBC

32
Algorithm
  • Convolution Phase
  • Extend the elementary patterns to find all
    maximal patterns
  • Definitions
  • prefix(P) A uniquely defined prefix subpattern
    of P that contains exactly (L-1) symbols
  • Suffix(P) A uniquely defined suffix subpattern
    of P that contains exactly (L-1) symbols

33
Algorithm
  • Convolution
  • An operation between patterns, P and Q, with at
    least L symbols each.
  • P Q PQ, if suffix(P)prefix(Q)
  • ø, otherwise
  • where Q is the remaining part of Q after
    removing prefix(Q)
  • Examples for L3
  • ABCCEDø, F.A.TA.TSE F.A.TSE

34
Algorithm
  • At convolution stage, we need to generate all
    patterns, and be able to identify and discard
    non-maximal patterns before further convolution
  • Partial Ordering
  • Prefix-wise orderly
  • Align P and Q to the leftmost and examine from
    left to right until reach a position where one of
    the two patterns has a symbol and the other has a
    dont-care character.
  • If the symbol comes from P, then P is said to be
    prefix-wise less than Q.
  • Suffix-wise orderly similarly defined.

35
Algorithm
  • Convolution Phase
  • A stack is used
  • The stack is initialized with the prefix-wise
    smallest elementary pattern P with support at
    least K
  • The algorithm always works with the pattern T
    which is at the top of stack
  • T is extended to the left by convoluting it will
    all the elementary patterns Q that are
    convolvable with T in the suffix-wise order. The
    result pattern is discarded if it doesnt has
    enough support or is non-maximal. Otherwise it
    becomes the new stack top and the procedure
    starts over.

36
Algorithm
  • Convolution Phase
  • After the pattern T at the top of the stack can
    no longer be extended to the left, the similar
    process is performed to extend T to the right.
  • When extension in both directions has been
    completed, the current top is popped and placed
    in the output.

37
Algorithm
  • The procedure can be shown to
  • Finishes
  • produces all maximal ltL,Wgt patterns without
    reporting non-maximal patterns.
  • Any maximal pattern will always be generated
  • It will be generated before any non-maximal
    pattern that is its subpattern is reported

38
Examples
  • S ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • L2, W2, K2
  • Maximal patterns
  • ABCD ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • DEA ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • FDE ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • BC ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • DE ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • EA ABCDEAFDE, BCFDEABCD, BCEADEFDE
  • See 6 for the online Teiresias server.

39
Performance
  • The single most important factor which affects
    the performance of the algorithm is the amount of
    similarity between the input sequences
  • Time complexity 5 pseudo linear to the output
    size

40
Applications
  • Computational Biology
  • Find patterns from DNA/protein sequences for
    understanding and interpretation of biological
    similarities
  • Computer Security 1
  • Use Teiresias algorithm to find variable-length
    patterns from training data which consists of
    system call traces of commands under normal
    execution.
Write a Comment
User Comments (0)
About PowerShow.com