XRules: An Effective Structural Classifier for XML Data - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

XRules: An Effective Structural Classifier for XML Data

Description:

Many prior classification systems treat text files as arbitrary ... Induced Isomorphic with same labels. Embedded Induced but with possible interlopers ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 24
Provided by: people98
Category:

less

Transcript and Presenter's Notes

Title: XRules: An Effective Structural Classifier for XML Data


1
XRules An Effective Structural Classifier for
XML Data Mohammed J. Zaki Rensselaer
Polytechnic Institute Charu C. Aggarwal IBM
T.J. Watson Research Center charu_at_us.ibm.com
2
Motivation
  • Many prior classification systems treat text
    files as arbitrary signals or vectors.
  • XML implies a meaningful structure on text
    documents.
  • XRULES seeks to take advantage of the structural
    information.

3
XML
  • eXtensible Markup Language
  • Structure is not enforced, but usually followed.
  • DTD
  • Database
  • Document tree
  • Fragments vs Documents

4
XRULES Approach
  • The appearance and/or frequency of a structural
    pattern in an XML document implies the document's
    class.
  • Example Letters vs Journal article.
  • Example
  • Applied Spectroscopy numerous and large tabular
    data (tables)?
  • IEEE Micro has applied spectroscopy small tables
  • Arabian has very few but large tables

5
STRUCTURAL RULES CONCEPTS
  • Model XML documents as ordered, labeled, rooted
    trees, i.e., child order matters, and each node
    has a label.
  • Do not distinguish between attributes and
    elements (tags) of an XML document both are
    mapped to the label set.

6
Attributes vs ElementsltFrom W3C.orggt
  • Attribute
  • ltperson sex"female"gt
  • ltfirstnamegtAnnalt/firstnamegt
  • ltlastnamegtSmithlt/lastnamegt
  • lt/persongt
  • Element
  • ltpersongt
  • ltsexgtfemalelt/sexgt
  • ltfirstnamegtAnnalt/firstnamegt
  • ltlastnamegtSmithlt/lastnamegt
  • lt/persongt

7
XMINER
  • XMINER is a tool to detect structural patterns in
    data.
  • Reasonable stipulation that data is ordered,
    labeled, and rooted. --is a convention of XML,
    and is easily validated.
  • Embedded subtrees.
  • Uses a prefix scheme for subtree comparisons
  • Compute pattern frequency based on joins on scope
    lists
  • Enumerate subtrees using DFS

8
XMINERalgorithm
  • Start with leaf nodes
  • Combine elements pair-wise and find frequent
    patterns
  • Then repeatedly generate larger subtrees by
    combining frequent subtrees at the current level
  • Continue until all frequent subtrees are
    enumerated

9
Structural Rules Concepts
  • XML as Trees
  • Subtrees
  • Isomorpic same shape
  • Induced Isomorphic with same labels
  • Embedded Induced but with possible interlopers

10
B
B
B
A
B
C
A
C
A
C
A
C
Embedded Subtrees BAC
11
Embedded Subtrees
  • Important to consider embedded subtrees because
    of the nature of XML and its structural rules
  • Ex
  • person male firstname lastname
  • is similar to
  • person firstname lastname
  • is similar to
  • person name firstname lastname

12
XMINERSupport
  • Weighted support how often a subtree appears in
    a document.
  • Relative support what is the ratio of a subtree
    to all subtrees in a document.
  • Frequent A subtree is frequent if its ratio is
    higher than a user specified value.

13
XMINERCost based Classification
  • Can change weighting to favor particular
    classifications
  • Example more likely class is weighted more
    heavily
  • Example Costly mis-classifications are weighted
    less, so are less likely to false-positive.

14
XMINERRule Support
  • A rule relates a frequent structure to a class
  • Global Support the joint probability of a
    subtree and a class. i.e. the percentage of the
    trees in the database containing a subtree T and
    having class label ci

15
XMINERRule Strength/Support
  • Three methods for evaluating rules
  • Confidence The ratio of trees containing a
    subtree T and labeled as Ci to all of the trees
    in the data that contain T
  • Likelihood The ratio of how often T is in a tree
    of class Ci to how often it is in a tree that is
    not of class Ci
  • Weighted Confidence combination of the previous
    two
  • Which to use is data set dependent

16
Rule Metrics Charactaristics
  • Confidence measures strength across entire
    database (global)?
  • Likelyhood measures local tendency of
    pattern/class association vs pattern/negative-clas
    s association
  • Weighted confidence works as confidence but
    without a global bias

17
XRULES
  • Structural rule-based classification
  • Two phases Training and testing
  • Training using database with all classes
    represented
  • Goal is to learn a rule set with strength and
    support characteristics above some threshold

18
XRULESTraining Phase
  • Mine frequent structural rules (using XMINER)?
  • Order resultant rules and prune unpredictive
    rules
  • Determine a 'default' class for cases that the
    rules do not classify basically situational,
    could be majority case, or cost dependent.

19
XRULESTesting Phase
  • Rule Retrieval Using XMINER, find the set of all
    rules that match to the test data sample. Then,
    use one of several methods of combining the
    statistics of the matching rule set to find the
    correct classification.
  • Average strength compute rule strengths for
    each class across all matches (subtrees), highest
    average wins (most strongest matches)?
  • Best since rules are ranked, pick the first
    (highest) match (strongest match)?
  • Best K combination of first two, but average
    across the best 'K' rules, not all

20
Results
  • Compared to
  • IR classifier (IRC)?
  • CBA
  • Tested on real and synthetic data sets of
    browsing records (XML web browsing traces/logs)?
  • Two classes educational browsers (.edu) and
    other (not .edu)?

21
Results
  • Browse traces converted to trees by only reading
    forward link-follows, eliminating
    cycles/back-links
  • IRC comparison used only text of XML files
    (treated all tags as text)?

22
(No Transcript)
23
Results
  • In this application the XRULES system was
    superior in most cases
  • In synthetic tests with only structure, IRC did
    not work at all, and CBA was only as good as
    random
  • Overall XRULES was always better than CBA, and
    equal to or better than IRC in all cases.
  • Costs were comparable for all methods
Write a Comment
User Comments (0)
About PowerShow.com