Event correlation and data mining for event logs

About This Presentation
Title:

Event correlation and data mining for event logs

Description:

... the Sys Admin Magazine, USENIX LISA'04 (both technical sessions and tutorials), IEEE IPOM'02. ... nodes for the candidate generation in Apriori fashion. ... –

Number of Views:275
Avg rating:3.0/5.0
Slides: 40
Provided by: RistoVa9
Category:

less

Transcript and Presenter's Notes

Title: Event correlation and data mining for event logs


1
Event correlation and data mining for event logs
  • Risto Vaarandi
  • SEB Eesti Ühispank
  • risto.vaarandi_at_seb.ee

2
Outline
  • Event logging and event log monitoring
  • Event correlation concept and existing
    solutions
  • Simple Event Correlator (SEC)
  • Frequent itemset mining for event logs
  • Data clustering for event logs
  • Discussion

3
Event logging
  • Event a change in the system state, e.g., a
    disk failure when a system component
    (application, network device, etc.) encounters an
    event, it could emit an event message that
    describes the event.
  • Event logging a procedure of storing event
    messages to a local or remote (usually flat-file)
    event log.
  • Event logs play an important role in modern IT
    systems
  • many system components like applications,
    servers, and network devices have a builtin
    support for event logging (with the BSD syslog
    protocol being a widely accepted standard),
  • since in most cases event messages are appended
    to event logs in real-time, event logs are an
    excellent source of information for monitoring
    the system (a number of tools like Swatch and
    Logsurfer have been developed for log
    monitoring),
  • information that is stored to event logs can be
    useful for analysis at a later time, e.g., for
    audit procedures.

4
Centralized logging infrastructure
Central log server
Application
syslog server
log file monitor
notifications to the monitoring console
Server
events
Netdevice
network
Event logs
Applications, servers, and network devices use
the syslog protocol for logging their events to
the central log server that runs a syslog server.
Log monitoring takes place on the central log
server and alerts are sent to the monitoring
console.
5
Event log monitoring
  • Dec 18 084726 myhost daemon.info sshd15172
    log Connection from 10.2.211.19 port 1304
  • Dec 18 084739 myhost daemon.info sshd15172
    log Password authentication for alex accepted.
  • Dec 18 085009 myhost kern.crit vmunix
    /var/tmp file system full
  • Dec 18 085010 myhost mail.debug imapd15399
    imap service init from 10.2.213.2
  • Dec 18 085010 myhost mail.info imapd15399
    Login userristo hostristo2 10.2.213.2
  • Commonly used log monitoring tools Swatch,
    Logsurfer, etc. (see http//www.loganalysis.org
    for more information).
  • Current log monitoring practice match logged
    lines in real time with regular expressions and
    perform an action (e.g., send an alert) when a
    matching line is observed.
  • Open issues in order to write rules (regexp ?
    action) for log monitoring tools, one must have a
    good knowledge about the IT system and log
    messages existing tools dont support event
    correlation well.

6
Event correlation
  • Event correlation a conceptual interpretation
    procedure where new meaning is assigned to a set
    of events that happen within a predefined time
    interval Jakobson and Weissman, 1995. During
    the event correlation process, new events might
    be inserted into the event stream and original
    events might be removed.
  • Examples
  • if 10 login failure events occur for a user
    within 5 minutes, generate a security attack
    event
  • if both device internal temperature too high and
    device not responding events have been observed
    within 5 seconds, replace them with the event
    device down due to overheating.
  • A number of approaches have been proposed for
    event correlation (rule-based, codebook based,
    neural network based etc. methods), and a number
    of event correlation products are available on
    the market (HP ECS, SMARTS, NerveCenter,
    RuleCore, LOGEC, etc.)

7
Event correlation approaches
  • Rule-based (HP ECS, IMPACT, RuleCore, etc.)
    events are correlated according to the rules
    condition ? action that are specified by the
    human analyst.
  • Codebook based (SMARTS) if a set of events
    e1,...,ek must be interpreted as event A, then
    e1,...,ek are stored to the codebook as a
    bit-vector pointing to A. In order to correlate a
    set of events, look for the most closely matching
    vector in the codebook, and report the
    interpretation that corresponds to the vector.
  • Graph based find all dependencies between
    system components (network devices, hosts,
    services, etc.) and construct a graph with each
    node representing a system component and each
    edge a dependency between two components. When a
    set of fault events occurs, use the graph for
    finding possible root cause(s) of fault events
    (e.g., 10 HTTP server not responding events
    were caused by the failure of a single network
    link).
  • Neural network based a neural net is trained
    for the detection of anomalies in the event
    stream, root cause(s) of fault events, etc.

8
Motivation for developing SEC
  • Existing event correlation products have the
    following drawbacks
  • Complex design and resource requirements
  • they are mostly heavyweight solutions that are
    difficult to deploy and maintain, and that
    require extensive user training
  • they are not very suitable for application on
    nodes with limited computing resources (e.g.,
    nodes in sensor and ad hoc networks)
  • many products are based on the client-server
    model which is inconvenient for fully distributed
    event correlation,
  • Platform and domain dependence
  • they are usually distributed in a binary form for
    a limited number of OS platforms
  • some products are designed for one system
    management platform only
  • some products have been designed for network
    fault management and their application in other
    domains (including event log monitoring) is
    cumbersome,
  • The issue of pricing they are quite expensive
    (at the present time, SEC is the only freely
    available event correlation engine).
  • Summary there is a need for lightweight,
    platform-independent, and open source event
    correlation solutions, since heavyweight
    proprietary systems are infeasible for many tasks
    and environments.

9
Key features of SEC
  • Uses rule-based approach for event correlation
    this approach was chosen because of its
    naturalness of knowledge representation and
    transparency of the event correlation process to
    the end user,
  • Written in Perl, thus open-source and
    cross-platform,
  • Licensed under the terms of GNU GPL,
  • Easy to install and configure no need for
    compiling and linking the source, no dependencies
    on other software, configuration files can be
    edited with any text editor like vi,
  • Small in size and doesnt consume much system
    resources (CPU time and memory),
  • Reads input from log files, named pipes, and
    standard input (arbitrary number of input sources
    can be specified) employs regular expression
    patterns, Perl subroutines, substrings, and truth
    values for matching input events,
  • Can be used as a standalone event log monitoring
    solution, but also integrated with other
    applications through file/named pipe interface.

10
SEC configuration
  • SEC event correlation rules are stored in regular
    text files.
  • Rules from one configuration file are used
    sequentially in the same order as they are given
    in the file, rule sets from different
    configuration files are applied virtually in
    parallel.
  • Most rules have the following components
  • event matching pattern
  • optional Boolean expression of contexts
  • event correlation key
  • correlation information (e.g., event counting
    threshold and window)
  • list of actions
  • With appropriate patterns, context expressions,
    and action lists, several rules can be combined
    into one event correlation scheme.
  • When an event matches a rule, event correlation
    key is calculated, and if there exists an event
    correlation operation with the same key, event is
    correlated by that operation. If there is no such
    operation and the rule specifies a correlation of
    events over time, the rule starts a new operation
    with the calculated key.

11
SEC rule types
  • Single execute an action list when matching
    event is observed.
  • SingleWithScript like Single, but also use an
    external script for matching.
  • SingleWithSuppress like Single, but ignore
    following matching events for t seconds.
  • Pair execute an action list on event A and
    ignore following instances of A until event B
    arrives. On the arrival of B execute another
    action list.
  • PairWithWindow after observing event A, wait
    for t seconds for event B to arrive. If B does
    not arrive on time, execute an action list,
    otherwise execute another action list.
  • SingleWithThreshold count matching input events
    during t seconds and if a given threshold is
    exceeded, execute an action list.
  • SingleWith2Thresholds like SingleWithThreshold,
    but with the second round of counting with a
    falling theshold.
  • Suppress suppress matching input events.
  • Calendar execute an action list at specific
    times.

12
SEC pattern types
  • Regular expressions (Perl dialect), Perl
    subroutines (precompiled), substrings, and truth
    values are supported as patterns. Examples
  • Sample matching input line
  • Dec 18 085009 myhost kern.crit vmunix
    /var/tmp file system full
  • Note that the 1 variable will be set to the
    file system name (/var/tmp)
  • pattern(\S) fFile system full
  • Sample matching input line
  • Oct 13 103451 sensor1 auth.alert snort
    Portscan detected from
  • 213.31.250.205 Talker(fixed 1 sliding 1)
    Scanner(fixed 15 sliding 15)
  • 1 scanner IP address, 2 scanner hostname,
    3 portscan description
  • patternsub if (_0 /Portscan detected
    from (\S) (.)/) \
  • my(ip, text) (1, 2) my(name) ip
    my(rr) \
  • my(res) NetDNSResolver-gtnew my(query)
    res-gtsearch(ip, "PTR")\
  • if (query) foreach rr (query-gtanswer) \
  • if (rr-gttype eq "PTR") name rr-gtptrdname
    last \
  • return (ip, name, text) else return 0

13
SEC contexts and actions
  • SEC context a logical entity that can be
    created or deleted from a rule (or internally by
    SEC for tagging input). At creation, the context
    lifetime can be set to a certain finite value
    (e.g., 20 seconds).
  • The presence or absence of a context can decide
    whether a rule is applicable or not (e.g., if a
    rule definition has A  OR  B specified for its
    context expression, and neither context A nor B
    exist, the rule will not be applied).
  • A context can act as an event store events can
    be associated with a context, and all the
    collected events supplied for an external
    processing at a later time (e.g., collect events
    from a suspicious FTP session, and mail them to
    the security administrator at the end of the
    session).
  • SEC actions invoke external programs (shellcmd,
    spawn, etc.), generate synthetic events (event),
    reset event correlation operations (reset),
    perform context operations (create, report,
    etc.), set user-defined variables (assign, etc.),
    write to files or FIFOs (write), execute Perl
    programs or precompiled subroutines (eval, call).

14
A sample SEC rule
  • This example assumes that SEC has been started
    with intcontexts option
  • typeSingleWithSuppress
  • ptypeRegExp
  • pattern(\S) \kern\.crit\ vmunix (\S)
    fFile system full
  • context_FILE_EVENT_/logs/srv1.messages
    _FILE_EVENT_/logs/srv2.messages
  • desc12 file system full
  • actionpipe File system 2 at host 1 full mail
    s FS full root
  • window900
  • When a file system full message is logged to
    either /logs/srv1.messages or /logs/srv2.messages,
    SEC will send an e-mail alert to the local root
    user, and ignore repeated file system full
    messages for the same host and same file system
    for 900 seconds.
  • The desc parameter defines the event correlation
    key and the scope of event correlation if we
    replace the key 12 file system full with just
    file system full, we would get just one alert for
    the following messages at 1230
  • Oct 13 123000 srv1 kern.crit vmunix /tmp
    file system full
  • Oct 13 123500 srv2 kern.crit vmunix /home
    file system full

15
A sample SEC rule base for Cisco router/switch
syslog messages
If a router interface is in down state for less
than 15 seconds, generate event "ltroutergt
INTERFACE ltinterfacegt SHORT OUTAGE" otherwise
generate event "ltroutergt INTERFACE ltinterfacegt
DOWN".   typePairWithWindow ptypeRegExp
pattern(\S) \d LINK-3-UPDOWN Interface
(.), changed state to down desc1 INTERFACE
2 DOWN actionevent s ptype2RegExp
pattern2(1) \d LINK-3-UPDOWN Interface
(2), changed state to up desc21 INTERFACE
2 SHORT OUTAGE action2event s window15  
If "ltroutergt INTERFACE ltinterfacegt DOWN" event is
received from the previous rule, send a
notification and wait for "interface up" event
for the next 24 hours.  
typePair ptypeRegExp pattern(\S) INTERFACE
(\S) DOWN desc1 interface 2 is down
actionshellcmd notify.sh "s" ptype2RegExp
pattern2(1) \d LINK-3-UPDOWN Interface
(2), changed state to up desc21 interface
2 is up action2shellcmd notify.sh "s"
window86400   If ten "short outage" events
from the first rule have been observed in the
window of 6 hours, send a notification.
  typeSingleWithThreshold ptypeRegExp
pattern(\S) INTERFACE (\S) SHORT OUTAGE
descInterface 2 at node 1 is unstable
actionshellcmd notify.sh "s" window21600
thresh10  
16
The work of the sample rule base
Interface A_at_B down
Interface A_at_B up
17
SEC application experience
  • Employed by many companies and other institutions
    around the world major North-American telecom
    companies with very large networks, universities,
    financial institutions, a major European car
    manufacturer, a major US manufacturer of medical
    equipment, etc.
  • Has been employed for event log monitoring, as an
    event correlator for major network and system
    management platforms (HP OpenView, CiscoWorks,
    BMC Patrol, Nagios), event correlator for Snort
    IDS and other security systems, event correlator
    in medical systems and weather forecast systems,
    etc.
  • Has been used on Linux, Solaris, HP-UX, AIX,
    FreeBSD, Tru64 UNIX, Mac OS X, and Windows (with
    CygWin Perl).
  • Part of the Debian and Gentoo Linux distribution.
  • As of October 2005, the SEC mailing list had more
    than 260 subscribers and contained more than
    1,000 posts.
  • Papers, book chapters, etc. Hardening Linux
    Chapter 5 (Apress 2005), the March 2005 issue of
    the Sys Admin Magazine, USENIX LISA04 (both
    technical sessions and tutorials), IEEE IPOM02.

18
Data mining for event logs
  • Data mining for event logs has been identified as
    an important system management task detected
    knowledge can be used for building rules for
    event correlation systems or event log monitoring
    tools, improving the design of web sites, etc.
  • Recently proposed mining algorithms mostly
    based on the Apriori algorithm for mining
    frequent itemsets, designed for mining frequent
    patterns of event types. Event log is viewed as a
    sequence E1,...,En, where Ei (ti, ei), ti
    time of occurrence of Ei, ei type of Ei, and if
    i lt j then ti tj. Frequent pattern can be
    defined in several ways, with most common
    definitions being window- and slice-based.
  • Shortcomings of existing mining approaches
  • Apriori is known to be inefficient for mining
    longer patterns,
  • infrequent events remain undetected but are often
    interesting (e.g., fault events are normally
    infrequent but highly interesting),
  • focused on mining event type patterns from
    preprocessed event logs, ignoring patterns of
    other sorts (in particular, line patterns from
    raw event logs help one to find event types or
    write rules for log monitoring tools).

19
Frequent itemset mining problem
  • In this talk, an efficient frequent itemset
    mining algorithm will be presented that can be
    employed for mining both line and event type
    patterns from event logs.
  • Let I i1,...,in be a set of items. If X ? I,
    X is called an itemset, and if X k, X is
    called a k-itemset.
  • A transaction is a tuple (tid, X), where tid is a
    transaction identifier and X is an itemset. A
    transaction database D is a set of transactions
    (with each transaction having a unique id).
  • The cover of an itemset X is the set of
    identifiers of transactions that contain X
    cover(X) tid (tid, Y) ? D, X ? Y
  • The support of an itemset X is the number of
    elements in its cover supp(X) cover(X)
  • The frequent itemset mining problem given the
    transaction database D and the support threshold
    s, find the set of frequent itemsets X (tid,
    X) ? D, supp(X) s and the supports of frequent
    itemsets.

20
Event logs as transaction databases
  • For mining frequent event type patterns, the
    algorithm uses a slice-based approach
  • Event log L E1,...,En will be divided into
    non-overlapping slices S1,...,Sm L ?im1 Si,
    Si ? Sj ? for i ? j.
  • Slices are formed according to some criteria that
    is specified by the end user, e.g., each slice
    contains events from one session in the internet
    bank.
  • All possible event types act as the set of items
    I, and each slice can be considered a transaction
    with its tid between 1 and m. For the slice Si
    E1,...,Ek, where Ej (tj,ej) is an event
    from Si (tj - occurrence time of Ej ej - type
    of Ej), the transaction corresponding to Si is
    (i, ?jk1 ej).
  • For mining line patterns from raw event logs, the
    algorithm
  • views lines and patterns as sets of (word,
    position) pairs, e.g., the pattern Router
    interface down corresponds to a set (Router,
    1), (interface, 3), (down, 5)
  • all distinct (word, position) pairs from the
    event log form the set of items I, and each event
    log line forms a transaction (if there are n
    lines, we use line numbers from 1 to n for tids).

21
Frequent itemset mining algorithms
  • For the sake of efficient implementation, most
    algorithms order the items according to certain
    criteria. In the rest of this presentation, we
    assume that the set of items I is an ordered set
    and if X x1,...,xn is an itemset, then x1 lt
    ... lt xn.
  • Apriori algorithm (works in a breadth-first
    manner) discovered frequent k-itemsets are used
    to form candidate k1-itemsets, and frequent
    k1-itemsets are found from the set of candidates
    by making a pass over the database and counting
    the support of candidates.
  • Candidate generation is based on the following
    observation each subset of a frequent itemset
    must also be frequent. For each pair of frequent
    k-itemsets x1,...,xk and x1,...,xk-1,yk,
    generate a candidate by joining them
    x1,...,xk,yk. According to the observation, all
    frequent k1-itemsets are included in the set of
    candidates.

22
Apriori itemset trie
a
d
c
b
4
4
5
2
b
d
d
c
c
Transaction db (1, abcde) (2, abc) (3, bcd) (4,
abc) (5, ab) supp.thresh.2 a lt b lt c lt d lt e
3
2
4
4
2
c
d
Remove candidate with insufficient support
3
2
Each edge in the trie is labeled with a name of a
certain item. Apriori builds the itemset trie
layer by layer when the node layer at depth k
is complete, each node at depth k represents a
frequent k-itemset, where path to that node
identifies the items in the set and counter in
that node the support of the set. Candidate
generation for each node N at depth k, create
candidate child nodes by inspecting sibling nodes
of N.
23
Breadth-first vs. depth-first
  • Shortcomings of Apriori when the database
    contains larger frequent itemsets (e.g.,
    containing 30-40 items), the trie will become
    very large (there will be 2k-1 nodes in the trie
    for each frequent k-itemset). As a result, the
    runtime and memory cost of the algorithm will be
    prohibitive.
  • Eclat and FP-growth algorithms the algorithms
    first load the transaction database into main
    memory. At each step they will search for
    frequent itemsets p1,...,pk-1,x with a certain
    prefix P p1,...,pk-1, where P is a previously
    detected frequent itemset. The in-memory
    representation of the database allows the
    algorithm to search only transactions that
    contain P. After the search, the prefix for the
    next step will be chosen from the detected
    frequent itemsets (or found by backtracking).
  • Drawback the transaction database must fit into
    the main memory, but this is not always the case
    (also for many event log data sets!).
  • Proposed solution use the breadth-first
    approach and the itemset trie data structure,
    with special techniques for speeding up the
    mining process and reducing its memory
    consumption.

24
Properties of event log data
  • For raw event logs, the transaction database can
    contain a large number of items (e.g., values
    exceeding 10 million are common). However, only
    few items are relatively frequent (e.g., occur at
    least once per 1,000 transactions). Also, most
    items appear only few times in the data set.
  • Frequent itemsets may contain many items (e.g.,
    for the support threshold 0.1 we have found that
    values 20-30 are quite common). This means that
    Apriori is not always adequate for processing
    event log data.
  • There are often strong correlations between
    frequent items in transactions, e.g., item A
    never appears without item B. In the case of raw
    event logs, this is often caused by the log
    message formatting, e.g.,
  • syslog(LOG_INFOLOG_DAEMON, Connection from
    s, ipaddress)

25
Mining frequent items
The mining of frequent items (1-itemsets) the
first step of any breadth-first
algorithm. Problem the number of items in the
data set can be very large, and the memory cost
of item counting (a database pass with item
counters kept in memory) can be very
high. Solution make an extra pass over the
database before the counting, and hash each item
to an integer value 0..m-1 (m ltlt the number of
items). Each time the value i is calculated by
the hashing function, increment the i-th counter
in the summary vector. Then make another database
pass and count only those items for which the
counter values are not below the support
threshold. Example if the support threshold is
10, only items D, E, F can possibly be
frequent! Since one of the properties of the
event log data is that majority of items are very
infrequent, most counters are likely to remain
below the support threshold. Because the vector
itself requires small amount of memory, the
overall memory savings are usually substantial.
26
Transaction cache
  • Motivation keep most frequently used
    transaction data in memory for speeding up the
    work of the algorithm.
  • Observation if F is the set of frequent items
    and (tid, X) is a transaction, we only need to
    consider items X n F (frequent items of the
    transaction) during the mining process.
  • Transaction cache use the summary vector
    technique for detecting which sets X n F have a
    chance to correspond to C or more transactions,
    and load them to main memory (identical sets are
    stored as a single record with an occurrence
    counter) write the rest of the sets to disk. The
    value of C is given by the user.
  • Result the cache is guaranteed to contain most
    frequently used transaction data (Y ? I
    (tid, X) ? D X n F Y C), and the
    amount of data stored to the cache is controlled
    by the user. There will be no dependency on the
    amount of main memory like with depth-first
    algorithms, while the cache hit ratio is likely
    to be quite high.

27
Reducing the size of the itemset trie
  • Motivation with a smaller trie less memory/CPU
    time is consumed.
  • Observation when there are many strong
    correlations between frequent items in
    transactions, many parts of the Apriori itemset
    trie contain redundant information.
  • Let F f1,...,fn be the set of all frequent
    items. We call the set dep(fi) fj fi ? fj,
    cover(fi) ? cover(fj) the dependency set of fi,
    and say that fi has m dependencies if dep(fi)
    m.
  • Dependency prefix of item fi pr(fi) fj fj ?
    dep(fi), fj lt fi
  • Dependency prefix of itemset fi1,...,fik
    pr(fi1,...,fik) ?jk1 pr(fij)
  • The trie reduction technique if pr(X) ? X,
    dont create a node for X in the trie. In order
    to maximize the efficiency of the technique,
    reorder frequent items, so that if fi lt fj, then
    dep(fi) ? dep(fj)
  • Note that we cant generate candidates as a
    separate step like Apriori, since some nodes
    needed for the step could be missing from the
    trie due to the trie reduction technique. Thus,
    we generate candidates on-the-fly during support
    counting.

28
Building the reduced trie
Transaction db (1, abcde) (2, abc) (3, bcd) (4,
abc) (5, ab) supp.thresh.2 b lt c lt a lt d
F a, b, c, d, dep(d) b, c, dep(c)
dep(a) b, dep(b) ?. This yields the ordering
b lt c lt a lt d, and dependency prefixes are pr(b)
?, pr(c) pr(a) b, pr(d) b, c. Trie
building when the depth of the trie is k, make
a data pass and count all k1-itemsets of
frequent items that begin with frequent
k-itemsets present in the trie and that contain
their dependency prefixes. Candidate nodes are
generated during the counting.
29
Deriving all itemsets from the trie and further
optimizations
  • It can be shown that non-root nodes of the trie
    represent all frequent itemsets that contain
    their dependency prefixes, and all frequent
    itemsets can be derived from non-root nodes if
    a node represents an itemset X, we can derive
    frequent itemsets X \ Y Y ? pr(X) from X,
    with all such itemsets having a support supp(X).
  • Observation if the trie reduction technique was
    not applied at node N for reducing the number of
    its child nodes, and M is a child of N, then the
    sibling nodes of M contain all necessary nodes
    for the candidate generation in Apriori fashion.
  • If we augment the algorithm with such an
    optimization, the algorithm becomes a
    generalization of Apriori
  • if at node N the algorithm discovers that the
    trie reduction technique is no longer effective,
    it switches to Apriori for the subtrie that
    starts from N,
  • if there are no dependencies between frequent
    items (i.e., frequent items are weakly
    correlated), the algorithm switches to Apriori at
    the root node, i.e., it behaves like Apriori from
    the start.

30
The summary of the algorithm
  1. Make a pass over the database, detect frequent
    items, and order them in lexicographic order (if
    the number of items is very large, the summary
    vector technique can be used for filtering out
    irrelevant items). If no frequent items were
    found, terminate.
  2. Make a pass over the database, in order to
    calculate dependency sets for frequent items and
    to build the transaction summary vector.
  3. Reorder frequent items in dependency ascending
    order and find their dependency prefixes.
  4. Make a pass over the database, in order to create
    the cache tree and the out-of-cache file.
  5. Create the root node of the itemset trie and
    attach nodes for frequent items with empty
    dependency prefixes to the root node. If all
    frequent items have empty dependency prefixes,
    set the APR-flag in the root node.
  6. Let k 1.
  7. Check all nodes in the trie at depth k. If the
    parent of a node N has the APR-flag set, generate
    candidate child nodes for the node N in Apriori
    fashion (node counters are set to zero), and set
    the APR-flag in the node N.
  8. Build the next layer of nodes in the trie using
    the trie reduction technique with the following
    additional condition if the APR-flag is set in
    a node at depth k, don't attach any additional
    candidate nodes to that node.
  9. Remove the candidate nodes (nodes at depth k1)
    with counter values below the support threshold.
    If all candidate nodes were removed, output
    frequent itemsets and terminate.
  10. Find the nodes at depth k for which the trie
    reduction technique was not applied during step 8
    for reducing the number of their child nodes, and
    set the APR-flag in these nodes. Then let k k
    1 and go to step 7.

31
LogHound and examples of detected patterns
  • The frequent itemset mining algorithm for event
    logs has been implemented in a tool called
    LogHound (written in C, distributed under the
    terms of GNU GPL).
  • Includes several features for preprocessing raw
    event logs (support for regular expression
    filters, line conversion templates, etc.)
  • Sample frequent line patterns detected with
    LogHound
  • Dec 18 myhost connect from
  • Dec 18 myhost log Connection from port
  • Dec 18 myhost fatal Did not receive ident
    string.
  • Dec 18 myhost log authentication for
    accepted.
  • Dec 18 myhost fatal Connection closed by
    remote host.
  • Sample frequent event type pattern detected with
    LogHound (the CodeRed worm footprint from the
    Snort IDS log)
  • WEB-IIS CodeRed v2 root.exe access
  • WEB-IIS cmd.exe access
  • HTTP Double Decoding Attack
  • WEB-IIS unicode directory traversal attempt

32
Data clustering for event logs
  • The data clustering problem divide a set of data
    points into groups (clusters), where points
    within each cluster are similar to each other.
    Points that do not fit well to any of the
    detected clusters are considered to form a
    special cluster of outliers.
  • The data clustering algorithm presented in this
    talk has been designed for dividing event log
    lines into clusters, so that
  • each regular cluster corresponds to a certain
    frequently occurring line pattern (e.g.,
    Interface down),
  • the cluster of outliers contains rare lines
    (which possibly represent fault conditions or
    unexpected behavior of the system).
  • Traditional clustering methods assume that data
    points belong to space Rn, and similarity between
    data points is measured in terms of distance
    between points many algorithms use a variant of
    Lp norm as a distance function
  • Lp(x, y) (?in 1 ?xi yi?p) 1/p
  • (L1 Manhattan distance, L2 Euclidean
    distance)

33
Traditional clustering methods
  • Weaknesses of traditional data clustering
    methods
  • - they are unable to handle non-numerical
    (categorical) data,
  • - they dont work well in high-dimensional data
    spaces (n gt 10),
  • they are unable to detect clusters in subspaces
    of the original data space.
  • When we view event log lines as data points with
    categorical attributes, where the
  • m-th word of the line is the value for the m-th
    attribute, e.g.,
  • (Password, authentication, for, john,
    accepted),
  • then all the weaknesses listed above are also
    relevant for event log data sets.

34
Recent clustering algorithms that address the
problems
  • CLIQUE and MAFIA employ Apriori-like algorithm
    for detecting clusters. Instead of measuring the
    distance between individual points, algorithms
    identify dense regions in the data space, and
    form clusters from these regions. First clusters
    in 1-dimensional subspaces are detected. After
    clusters C1,,Cm in (k-1)-dimensional subspaces
    are detected, cluster candidates for
    k-dimensional subspaces are formed from C1,,Cm.
    The algorithm then checks which candidates are
    actual clusters, etc. Unfortunately, Aprioris
    performance deteriorates as k increases (for
    detecting a cluster in k-dimensional space, 2k-2
    of its superclusters must be produced first).
  • CACTUS makes a pass over the data and builds a
    data summary, and then makes another pass over
    the data to find clusters using the summary.
    Although fast, CACTUS generates clusters with
    stretched shapes, which is undesirable for log
    file data clustering.
  • PROCLUS uses K-medoid method to partition the
    data space into K clusters. However, it is not
    obvious what is the right value for K.

35
The algorithm features and definitions
  • The algorithm views every event log line as a
    data point with categorical attributes, where the
    m-th word of the line is the value for the m-th
    attribute.
  • Makes few passes over the data (like the CACTUS
    algorithm).
  • Uses density-based approach for clustering (like
    CLIQUE and MAFIA algorithms) identifies dense
    regions in the data space and forms clusters from
    them.
  • Region S a subset of the data space, where
    certain attributes i1,,ik of all points that
    belong to the region S have identical values
    v1,,vk ?x?S, xi1v1,xikvk. (Note that each
    region corresponds to a line pattern, e.g.,
    Password authentication for accepted.)
  • 1-region a region with one fixed attribute i1
    (i.e., k1).
  • Dense region a region that contains at least N
    points, where N is the support threshold value
    given by the user.

36
The algorithm basic steps
  1. Make a pass over the data, and identify all dense
    1-regions (note that dense 1-regions correspond
    to frequent words in the data set).
  2. Make another pass over the data, and generate
    cluster candidates. For every line that contains
    dense 1-regions, create a cluster candidate by
    combining the fixed attributes of these regions.
    (For example, if the line is Password
    authentication for john accepted, and words
    Password, authentication, for, and accepted are
    frequent, then the candidate is Password
    authentication for accepted.) If the candidate
    is not present in the candidate table, it will be
    put there with a support value 1 otherwise its
    support value will be incremented.
  3. Optional step for each candidate C, find all
    candidates representing more specific patterns,
    and add their support values to the value of C.
  4. Find which candidates in the candidate table have
    support values equal or greater than the support
    threshold, and output them as clusters.
  5. Detect outliers during a separate data pass.

37
SLCT Simple Logfile Clustering Tool
  • The event log clustering algorithm has been
    implemented in a tool called SLCT (written in C,
    distributed under the terms of GNU GPL).
  • Supports regular expression filters, so that
    particular lines (or particular cluster) can be
    inspected more closely.
  • Supports line conversion templates (i.e., before
    inspection, certain parts of a line are assembled
    into a new line), e.g., with the filter
    sshd\0-9\ (.) and the template 1 the
    line sshd2781 connect from 10.1.1.1 will be
    converted to connect from 10.1.1.1
  • Custom word delimiter can be specified (default
    is whitespace, but a custom regular expression
    can be given).
  • SLCT can refine variable parts of patterns by
    looking for constant heads and tails, e.g., the
    pattern connect from is converted to connect
    from 192.168.
  • After the first run, SLCT can be applied
    iteratively to the file of outliers, until the
    file is relatively small and can be inspected
    manually for unknown faults and anomalies.

38
Examples of patterns detected with SLCT
  • Sample clusters detected with SLCT
  • sshd connect from 1
  • sshd log Connection from 1 port
  • sshd log authentication for accepted.
  • sshd log Closing connection to 1
  • Sample outliers detected with SLCT
  • sendmail NOQUEUE SYSERR() can not
    chdir(/var/spool/mqueue/) Permission denied
  • sendmail SYSERR(root) collect I/O
    error on connection from , fromltgt
  • sendmail SYSERR(root) putbody write
    error Input/output error
  • login FAILED LOGIN 1 FROM (null) FOR root,
    Authentication failure
  • sshd Failed password for root from
    port ssh2
  • imapd Unable to load certificate from ,
    host
  • imapd Fatal disk error user host
    mbx Disk quota exceeded

39
References
  • SEC home page http//simple-evcorr.sourceforge.n
    et
  • LogHound home page http//kodu.neti.ee/risto/lo
    ghound/
  • SLCT home page http//kodu.neti.ee/risto/slct/
  • My publications see http//kodu.neti.ee/risto
Write a Comment
User Comments (0)
About PowerShow.com