OLAP on - PowerPoint PPT Presentation

About This Presentation
Title:

OLAP on

Description:

Adidas Shoes, Puma Shoes, Adidas Shoes 4,331 ... Nike Shoes, Puma Shoes, Nike Shoes Nike Shoes, Nike Shoes, Nike Shoes ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 76
Provided by: iCs8
Category:
Tags: olap | puma

less

Transcript and Presenter's Notes

Title: OLAP on


1
OLAP on
Sequence Data
Published in SIGMOD 2008 Vancouver, Canada.
Authors
Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun
Kit Chui and David W. Cheung
Presenter
Chun Kit Chui (Kit),The University of Hong
Kongckchui_at_cs.hku.hk
2
OLAP on
Sequence Data
Problem Motivation
Sequence Data Cube and Cuboids
New OLAP operations
System architecture
Experimental evaluations
Future works
3
OLAP on
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
4
Web server access logs (Web retailor selling
sports wear products)
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
5
Web server access logs (Web retailor selling
sports wear products)
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

From the access logs we can trace back the
browsing sequences of all members.
Web server access logs
Browsing Sequence
6
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Manager
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Browsing Sequence
7
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Manager
Sequence Data
Browsing Sequence
The query is referring to a particular kind of
pattern in the browsing sequences. The
comparison shopping semantics can be expressed by
the pattern template lt X, Y, X gt.
8
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Manager
Sequence Data
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
possesses the pattern, the sequence contributes
to 1 count in the cell.
Browsing Sequence
9
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Manager
Sequence Data
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
possesses the pattern, the sequence contributes
to 1 count in the cell.
Browsing Sequence
10
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Sequence OLAP system
Query
  • Support pattern based grouping and aggregation.

Manager
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
Result
11
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
Sequence OLAP system
Follow up Query
  • Support pattern based grouping and aggregation.

Manager
  • Obtain query results in real time (OLAP feature).

Result
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
12
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
Sequence OLAP system
Follow up Query
  • Support pattern based grouping and aggregation.

Manager
  • Obtain query results in real time (OLAP feature).

This manager find out the Adidas T-shirts page is
the most popular page for the members who did
comparison shopping between Nike shoes and Adidas
shoes pages.
Result
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
13
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
The comparison shopping patterns displayed in the
product type abstraction level is too detailed,
I would like to view some higher level
statistics.
Sequence OLAP system
Query
  • Support pattern based grouping and aggregation.

Manager
  • Obtain query results in real time (OLAP feature).
  • Provide OLAP operations to ease sequence
    analysis.

Result
A simple roll up operation on the pattern
template transforms the summary statistics to the
brand abstraction level.
Product type abstraction level
brand abstraction level
14
Research Objective
  • To design and implement an OLAP system that is
    able to
  • support pattern based grouping and aggregation.
  • obtain query results in real-time.
  • Especially optimized for interactive/iterative
    queries.
  • provide OLAP operations to ease explorative
    analysis of sequence data.

15
RFID Logs
  • Radio-frequency identification (RFID) is an
    automatic identification method, relying on
    storing and remotely retrieving data using
    devices called RFID tags.
  • The smart card system in public transits
  • Octopus card Hong Kong, Orca card in Seattle
    (2009)etc
  • Electronic money
  • Travel history of passengers are logged in a
    database.
  • Generate massive amount of sequence data.

16
RFID Logs
Event Database
  • Radio-frequency identification (RFID) is an
    automatic identification method, relying on
    storing and remotely retrieving data using
    devices called RFID tags.
  • The smart card system in public transits
  • Octopus card Hong Kong, Orca card in Seattle
    (2009)etc
  • Electronic money
  • Payment can be done easily by waving the card
    over the card reader.
  • Travel history of passengers are logged in a
    database.
  • Generate massive amount of sequence data .

17
Event Database
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
18
Event Database
Round trip statistics (Stations level)
Result
Query
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
19
Sequence Data Cuboid
A logical view of sequence data at a particular
degree of summarization.
20
Preliminary
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
  • Sequence Cuboid (S-Cuboid)
  • a logical view of sequence data at a particular
    degree of summarization.
  • sequences can be characterized by
  • attributes values of the events in the sequence
    (e.g. time, spending, product type)
  • the subsequence/ substring patterns they possess.
    (e.g. ltX,Y,Xgt , ltX,Y,Y,Xgt)

Sequence OLAP
An S-Cuboid
21
Phase 1. Sequence Formation
Event Database
Event Selection
An event selection step to select a set of a
relevant records and attributes.
22
Phase 1. Sequence Formation
Event Database
Event Selection
A sequence formation step to form sequences from
the event dataset.
Sequence Formation
Sequences can be formed per day and for each
individual user. By doing this, we have a number
of daily travel sequences of each user. E.g. S1
is Kits trip on Monday
User Individual, Time Day
23
Phase 1. Sequence Formation
Event Database
Event Selection
Sequences can also be formed according to time
dimension at the abstraction level of year and
per individual user.
Sequence Formation
User Individual, Time Day
User Individual, Time Year
24
Phase 2. S-Cuboid construction
User Individual, Time Day
Monday
25
Phase 2. S-Cuboid construction
A sequence grouping step to group the sequences
that share the same dimensions values into a
sequence group. E.g. travel sequences are grouped
according to their fare groups.
Sequence Grouping
User Individual, Time Day
Monday
26
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Sequence Grouping
The pattern grouping step further groups the
sequences according to the patterns they
possess.
User Individual, Time Day
27
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltShatin, Central, Central, Shatingt We assign
sequences to a cell if that sequence contains the
instantiated pattern.
Pattern Grouping
The pattern grouping step further groups the
sequences according to the patterns they
possess.
S1
S3
Central
Shatin
28
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltShatin, Central, Central, Shatingt We assign
sequences to a cell if that sequence contains the
instantiated pattern.
Pattern Grouping
Aggregated Value
Finally, an aggregation function is applied to
the sequences in each cuboid cell.
Count 2
S1
Central
S3
Shatin
29
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Central
S3
4D S-Cuboid
Shatin
4D S-Cuboid
30
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Central
S3
4D S-Cuboid
Shatin
4D S-Cuboid
31
Sequence Cuboid query language
This query specifies the construction of the
S-Cuboid that answer the round trip query in the
running example.
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
32
Sequence Cuboid query language
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
33
Sequence Cuboid query language
4D S-Cuboid
The predicates further increases the expression
power of pattern matching in the query
language. What exactly is a round-trip pattern?
34
Sequence Cuboid query language
Sequence Formation
Sequence Grouping
Global dimensions
Pattern template
Pattern dimensions
Pattern Grouping
E.g. Kit ltShatin, Central, Central, Shatin,
Shatin, Central, Central, Shatin gt
4D S-Cuboid
The cell restriction defines how to deal with the
situations when a data sequence contains multiple
occurrences of a cells pattern. E.g. A sequence
contribute to 1 count whenever we can find one
match of the pattern in the sequence.
35
Sequence Cuboid query language
Any changes to the cuboid specification
transforms the S-Cuboid to another. E.g. changing
the pattern template to (X,Y,Y,X,Z) generates
another S-Cuboid.
Sequence Formation
Sequence Grouping
Global dimensions
Pattern template
Pattern dimensions
Pattern Grouping
E.g. Kit ltShatin, Central, Central, Shatin,
Shatin, Central, Central, Shatin gt
4D S-Cuboid
The cell restriction defines how to deal with the
situations when a data sequence contains multiple
occurrences of a cells pattern. E.g. A sequence
contribute to 1 count whenever we can find one
match of the pattern in the sequence.
36
Properties of S-Cuboids
  • Exponential number of S-cuboids
  • The length of the pattern template is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

Recall that changing the pattern template
essentially changes the cuboid specification and
thus generates a new cuboid.
37
Properties of S-Cuboids
  • Exponential number of S-cuboids
  • The length of the pattern template is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

In traditional OLAP systems, data are
summarizable. i.e. Summaries in finer abstraction
level can be used to construct the summary in
higher abstraction level.
38
Properties of S-Cuboids
Sequence Database
S-Cuboid (Finer aggregates)
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

The S-Cuboid with pattern template ltX,Y,Zgt
Traditional OLAP
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
39
Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

The S-Cuboid with pattern template ltX,Y,Zgt
Traditional OLAP
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
40
Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
41
Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Non-Summarizable!
Whole week
42
Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable
  • Coarser aggregates cannot be computed solely from
    the corresponding finer aggregates.

S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
43
Properties of S-Cuboids
  • Exponential number of S-cuboids
  • The length of the pattern template is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Full materialization is impossible!
  • Non-summarizable
  • Coarser aggregates cannot be computed solely from
    the corresponding finer aggregates.
  • Partial materialization is infeasible!

44
Properties of S-Cuboids
  • Research direction
  • Precompute some other auxiliary data structures
    so that queries can be computed online using the
    pre-built data structures

45
S-OLAP Specific
Operations
Assist explorative analysis of the sequence data
46
S-OLAP specific operations
  • Navigate between cuboids with ease
  • Traditional OLAP operations for Global Dimensions
  • SLICE, DICE, ROLL-UP, DRILL-DOWN, etc.
  • New S-OLAP operations for Pattern Dimensions /
    Pattern Template
  • APPEND(X) (X,Y,Y) ? (X,Y,Y,X)
  • DE-TAIL (X,Y,Y,X) ? (X,Y,Y)
  • PREPEND(Z) (X,Y,Y,X) ? (Z,X,X,Y,Y)
  • DE-HEAD (Q,Y,Y,X) ? (Y,Y,X)
  • PATTERN-ROLL-UP(X) (X,Y,Y,X) ? (X,Y,Y,X)
  • PATTERN-DRILL-DOWN(X) (X,Y,Y,X) ? (x,Y,Y,x)

Coarser abstraction level
Finer abstraction level
47
Tell me the summary statistics of the single trip
travel patterns of passengers among different
Rail Lines, please ?.
Sequence OLAP
CUBOID by SUBSTRING(X,Y) WITH
X as location at Rail Lines,
Y as location at Rail Lines
LEFT-MAXIMALITY (x1, y1) WITH
x1.action in AND
y1.action out
48
S-Cuboid 1 (10 10 cells)
Sequence OLAP
49
S-Cuboid 1 (10 10 cells)
Sequence OLAP
More detailed statistics of passengers traveling
from the Tsuen Wan Line to each of the Island
Line stations, please ?.
50
S-Cuboid 1 (10 10 cells)
Sequence OLAP
S-Cuboid 2 (1 14 cells)
Instead of specifying the S-Cuboid construction
query, a SLICE plus a P-DRILL-DOWN(Y) is done.
51
S-Cuboid 1 (10 10 cells)
Sequence OLAP
S-Cuboid 2 (1 14 cells)
S-Cuboid 3 (1 14 14 cells)
52
S-Cuboid 1 (10 10 cells)
Sequence OLAP
S-Cuboid 2 (1 14 cells)
S-Cuboid 3 (1 14 14 cells)
The S-OLAP operations not only assists the
exploratory analysis of the sequence data, it
also hides all the technical details of
specifying the S-Cuboid query from the business
users.
53
System Architecture
Skip
54
System Architecture
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
55
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Sequence Query Engine
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
56
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Queries
Sequence Query Engine
User Interface
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
57
System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
58
System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
59
System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
60
Auxiliary Data Structures
Counter based approach
Inverted indices approach
61
Counter-Based approach
  • Counter-Based approach
  • Each cell in an S-cuboid is associated with a
    counter.
  • To determine the counters values, the entire set
    of sequences is scanned.
  • For each sequence s, we determine the cells whose
    associated patterns are contained in s and
    increment each of such counters by 1.
  • Basic and simple
  • But processing iterative queries requires
    Counting from scratch.

62
S-OLAP query evaluation
  • Inverted-Index Approach
  • Based on the fragment cube (X. Li, J. Han, and H.
    Gonzalez. VLDB 2004) concept.
  • A set of inverted indices are created by
    pre-processing the data offline.
  • Algorithm BuildIndex (see paper)
  • During query processing, the relevant inverted
    indices are joined based on the matching pattern,
    in real-time.
  • Algorithm QueryIndices (see paper)
  • By-products of answering a query is the creation
    of new inverted indices.
  • Newly built indices are useful to the processing
    of iterative S-OLAP operations (see paper for
    algorithms)

63
Experiments
  • A prototype S-OLAP system was implemented using
    C.
  • Real Data
  • Passenger traveling history.
  • KDD Cup 2000
  • Clickstream data from a web retailer selling
    legwear and legcare products.
  • 50,524 sequences.
  • KDD Cup 2000 Question 1
  • Look for page-click patterns
  • We answer this question in an exploratory way via
    three iterative queries.

64
Experiments
The corresponding pattern template to capture the
2 steps navigation semantics is ltX,Ygt.
Cuboid Qa (4444 cells)
Qa Look for the statistics of all 2- steps
navigations in the page category level.
  • KDD Cup 2000 Question 1
  • Look for page-click patterns
  • We answer this question in an exploratory way via
    three iterative queries

65
Experiments
2. P-DRILL-DOWN
Cuboid Qa (4444 cells)
Qa Look for the statistics of all 2- steps
navigations in the page category level.
1.SLICE
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Cuboid Qb (1279 cells)
The most popular product that visitors
browse from the catalog page is the product 34839
(DKNY skin legwear collection product)
66
Experiments
Cuboid Qa (4444 cells)
Qa Look for the statistics of all 2- steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
Cuboid Qc (1279279 cells)
67
Experiments
Cuboid Qa (4444 cells)
Qa Look for the statistics of all 2- steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
For the iterative queries, II takes the advantage
of processing only the sequences that possess the
pattern lt Product catalog, Legwear Productgt.
Cuboid Qc (1279279 cells)
68
Experiments on synthetic data
  • Study the scalability of Counter-Based approach
    (CB) and Inverted-Index approach (II) under a
    series of APPEND operations
  • QA1 SUBSTRING(X,Y)? SLICE APPEND ? QA2 (X,Y,Z)
    ? SLICE APPEND ? QA3 (X,Y,Z,A) ? SLICE
    APPEND ? QA4 (X,Y,Z,A,B) ? SLICE APPEND ? QA5
    (X,Y,Z,A,B,C)

69
Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t. number of
sequences. II outperformed CB in all datasets in
this experiment.
II precomputation time less than 4 secs in all
cases
70
Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t. number of
sequences. II outperformed CB in all datasets in
this experiment.
Cumulative sequence scanned
II precomputation time less than 4 secs in all
cases
CB scans the entire dataset once on each
iterative query. For Qa1, II does not need to
scan any data sequences because the query can be
answered by inverted indices directly.
71
Experiments on synthetic data
  • Vary
  • Average sequence length (L)
  • Data distribution (Skew factor)
  • Domain of the events (I)
  • P-ROLL-UP operation
  • P-DRILL-DOWN operation
  • ltX,Y,Y,Xgt pattern templates
  • Substring / Subsequence pattern templates
  • (See technical report)

72
Conclusion
  • We propose a new online analytical processing
    system for sequence data analysis (The S-OLAP
    system).
  • The proposed system is motivated by real-life
    problems.
  • Page click analysis
  • RFID log analysis
  • etc
  • We defined basic concepts
  • S-Cuboid, S-Cube
  • Identified two properties of S-Cube
  • Infinite number of S-Cuboid
  • Non-summarizable
  • Illustrated the usability of the proposed S-OLAP
    system through a prototype system that works on
    real data.

73
The End
Thank you!
74
Synthetic dataset generator
  • Synthetic sequence databases are synthesized in
    the following manner
  • The generated sequence database has D sequences.
  • Each sequence s in a dataset is generated
    independently
  • The sequence length l, with mean L, is first
    determined by a random variable following a
    Poisson distribution.
  • Then, we repeatedly add events to the sequence
    until the target length l is reached.
  • The first event symbol is randomly selected
    according to a pre-determined distribution
    following Zipfs law with parameter I and T
  • I is the number of possible symbols, and
  • T is the skew factor
  • Subsequent events are generated one after the
    other using a Markov chain of degree 1.
  • The conditional probabilities are pre-determined
    and are skewed according to Zipfs law.
  • All the generated sequences form a single
    sequence group and that is served as the input
    data to the algorithms.

75
Related Work
  • Sequence Databases
  • PREDATOR (Seshadri, Livny, and Ramakrishnan
    SIGMOD 94, VLDB 96)
  • DEVise (Ramakrishnan et al. SSDBM 98)
  • TS-SQL (Sadri et al. PODS 01)
  • OLAP
  • Data-cube operator (Gray et al. 95),
    iceberg-cube, star-schema, , etc.
  • OLAP on unconventional data
  • RFID-cube (Gonzalez, Han, and Li VLDB 06)
  • Stream-cube (Chen et al. VLDB 02)
  • XML-cube (Wiwatwattana el al. ICDE 07)
Write a Comment
User Comments (0)
About PowerShow.com