Title: OLAP on
1OLAP on
Sequence Data
Published in SIGMOD 2008 Vancouver, Canada.
Presenter
Chun Kit Chui (Kit),The University of Hong
Kongckchui_at_cs.hku.hk
Supervisor
Eric Lo,Hong Kong Polytechnic UniversityBen
Kao,The University of Hong Kong
2OLAP on
Sequence Data
Problem Motivation
Sequence Data Cube and Cuboids
New OLAP operations
System architecture
Experimental evaluations
3OLAP on
Sequence Data
- Many kinds of real-life data exhibit logical
ordering among their data items and are thus
sequential in nature.
Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
4Web server access logs (Web retailor selling
sports wear products)
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
- Many kinds of real-life data exhibit logical
ordering among their data items and are thus
sequential in nature.
Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
5Web server access logs (Web retailor selling
sports wear products)
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
- Many kinds of real-life data exhibit logical
ordering among their data items and are thus
sequential in nature.
From the access logs we can trace back the
browsing sequences of all members.
Web server access logs
Browsing Sequence
6Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
- Many kinds of real-life data exhibit logical
ordering among their data items and are thus
sequential in nature.
Web server access logs
Browsing Sequence
7Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt ?
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
Browsing Sequence
The query is referring to a particular kind of
pattern in the browsing sequences. The
comparison shopping semantics can be expressed by
the pattern template lt X, Y, X gt.
8Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
posses the pattern, the sequence contributes to 1
count in the cell.
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 1
lt Nike Shoes, Puma Shoes, Nike Shoes gt ?
lt Nike Shoes, Nike Shoes, Nike Shoes gt ?
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt ?
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt ?
Browsing Sequence
9Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
posses the pattern, the sequence contributes to 1
count in the cell.
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
Browsing Sequence
10Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma
2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike
2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas
2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Sequence OLAP system
Query
- Support pattern based grouping and aggregation.
Manager
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
11I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
Sequence OLAP system
Follow up Query
- Support pattern based grouping and aggregation.
Manager
- Obtain query results in real time (OLAP feature).
lt X, Y, X, Z gt XNike Shoes, YAdidas Shoes, ZAny Members
lt Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes gt 15,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts gt 180,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes gt 9,000
Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
12I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
Sequence OLAP system
Follow up Query
- Support pattern based grouping and aggregation.
Manager
- Obtain query results in real time (OLAP feature).
This manager find out that Adidas T-shirts page
is the most popular page for the members who did
comparison shopping between Nike shoes and Adidas
shoes pages.
lt X, Y, X, Z gt XNike Shoes, YAdidas Shoes, ZAny Members
lt Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes gt 15,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts gt 180,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes gt 9,000
Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
13I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members did comparison shopping
between Nike shoes and Addidas shoes, I would
like to further investigate whether those members
would browse one more product and if so what is
the product.
The comparison shopping patterns displayed in the
product type abstraction level is too detailed,
I would like to view some higher level
statistics.
Sequence OLAP system
Query
- Support pattern based grouping and aggregation.
Manager
- Obtain query results in real time (OLAP feature).
- Provide OLAP operations to ease sequence
analysis.
lt X, Y, X, Z gt XNike Shoes, YAdidas Shoes, ZAny Members
lt Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes gt 15,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts gt 180,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes gt 9,000
Result
A simple roll up operation on the pattern
template transforms the summary statistics to the
brand abstraction level.
Product type abstraction level
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200
lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
brand abstraction level
lt X, Y, X gt Members
lt Nike, Adidas, Nikegt 3,150,000
lt Nike, Puma, Nike gt 2,180,000
lt Nike, Nike, Nike gt 19,000,000
14Research Objective
lt X, Y gt Members
lt Nike, Adidasgt 1,315,000
lt Nike, Puma gt 6,480,000
lt Nike, Nikegt 3,189,000
lt X, Y, X gt Members
lt Nike, Adidas, Nikegt 315,000
lt Nike, Puma, Nike gt 2,180,000
lt Nike, Nike, Nike gt 189,000
- To design and implement an OLAP system that is
able to - support pattern based grouping and aggregation.
- obtain query results in real-time.
- Especially optimized for interactive/iterative
queries. - provide OLAP operations to ease explorative
analysis of sequence data.
15RFID Logs
- Radio-frequency identification (RFID) is an
automatic identification method, relying on
storing and remotely retrieving data using
devices called RFID tags. - The smart card system in public transits
- Octopus card Hong Kong, Orca card in Seattle
(2009)etc - Electronic money
- Travel history of passengers are logged in a
database. - Generate massive amount of sequence data.
16RFID Logs
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0223 Eric Chinatown Machine 10 Add value 100
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
- Radio-frequency identification (RFID) is an
automatic identification method, relying on
storing and remotely retrieving data using
devices called RFID tags. - The smart card system in public transits
- Octopus card Hong Kong, Orca card in Seattle
(2009)etc - Electronic money
- Payment can be done easily by waving the card
over the card reader. - Travel history of passengers are logged in a
database. - Generate massive amount of sequence data .
17Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0223 Eric Chinatown Machine 10 Add value 100
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
18Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0223 Eric Chinatown Machine 10 Add value 100
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Round trip statistics
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 12,032
lt Seattle Airport, University Street, University Street, Seattle Airport gt 982
lt Chinatown, University Street, University Street, Chinatown gt 822
lt Chinatown, Seattle Airport, Seattle Airport, Chinatowngt 1,020
Result
Query
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
19Sequence Data Cuboid
A logical view of sequence data at a particular
degree of summarization.
20Preliminary
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
- Sequence Cuboid (S-Cuboid)
- a logical view of sequence data at a particular
degree of summarization. - sequences can be characterized by
- attributes values (e.g.
time) - the subsequence/ substring patterns they possess.
(e.g. ltX,Y,Xgt , ltX,Y,Y,Xgt)
Sequence OLAP
An S-Cuboid
lt X, Y, Y, X gt Users
lt Airport, Chinatown, Chinatown, Airport gt 2
lt Airport, Westlake, Westlake, Airport gt 9
21Phase 1. Sequence Formation
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0223 Eric Chinatown Machine 10 Add value 100
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Event Selection
An event selection step to select a set of a
relevant records and attributes.
22Phase 1. Sequence Formation
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0223 Eric Chinatown Machine 10 Add value 100
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Event Selection
A sequence formation step to form sequences from
the event dataset.
Sequence Formation
Sequences can be formed per day and for each
individual user. By doing this, we have a number
of daily travel sequences of each user. E.g. S1
is Erics trip on Monday
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt
23Phase 1. Sequence Formation
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric University Street out -5
2008-6-14 0223 Eric University Street Machine 10 Add value 100
2008-6-14 0225 Eric University Street in 0
2008-6-14 1849 Eric Seattle Airport out -5
Time Card-ID Location Action Amount
2008-6-09 0001 Eric Seattle Airport in 0
2008-6-09 0225 Eric Chinatown out -5
2008-6-14 0225 Eric Chinatown in 0
2008-6-14 1849 Eric Seattle Airport out -5
Event Selection
Sequences can also be formed according to time
dimension at the abstraction level of year and
per individual user.
Sequence Formation
User Individual, Time Day
User Individual, Time Year
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt
Seq ID Sequence of events
S1 lt e1, e2 , e102, e180 , e1002, e1800 , e1801 , gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 , e2134, e2135 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292 , e352 , e3252,gt
24Phase 2. S-Cuboid construction
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt
Monday
25Phase 2. S-Cuboid construction
A sequence grouping step to group the sequences
that share the same dimensions values into a
sequence group. E.g. travel sequences are grouped
according to their fair groups.
Sequence Grouping
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt
Monday
26Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Sequence Grouping
The pattern grouping step further groups the
sequences according to the patterns they
possess.
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt
27Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltSeattle Airport, Chinatown, Chinatown, Seattle
Airportgt We assign sequences to a cell if that
sequence contains the instantiated pattern.
Pattern Grouping
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 Eric Seattle Airport in 0
e2 2008-6-09 0225 Eric Chinatown out -5
e102 2008-6-09 2225 Eric Chinatown in 0
e180 2008-6-09 2349 Eric Seattle Airport out -5
The pattern grouping step further groups the
sequences according to the patterns they
possess.
S1
Chinatown
S3
Seattle Airport
28Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltSeattle Airport, Chinatown, Chinatown, Seattle
Airportgt We assign sequences to a cell if that
sequence contains the instantiated pattern.
Pattern Grouping
Aggregated Value
Finally, an aggregation function is applied to
the sequences in each cuboid cell.
Count 2
S1
Chinatown
S3
Seattle Airport
29Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Chinatown
S3
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 2
lt Seattle Airport, University Street, University Street, Seattle Airport gt 9
Seattle Airport
4D S-Cuboid
30Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Chinatown
S3
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 2
lt Seattle Airport, University Street, University Street, Seattle Airport gt 9
Seattle Airport
4D S-Cuboid
31Sequence Cuboid query language
This query specifies the construction of the
S-Cuboid that answer the round trip query in the
running example.
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 2
lt Seattle Airport, University Street, University Street, Seattle Airport gt 9
32Sequence Cuboid query language
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 2
lt Seattle Airport, University Street, University Street, Seattle Airport gt 9
33Sequence Cuboid query language
Any changes to the cuboid specification
transforms the S-Cuboid to another. E.g. chaning
the pattern template to (X,Y,Y,X,Z) generates
another S-Cuboid.
Sequence Formation
Sequence Grouping
Global dimensions
Pattern template
Pattern dimensions
Pattern Grouping
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Seattle Airport, Chinatown, Chinatown, Seattle Airport gt 2
lt Seattle Airport, University Street, University Street, Seattle Airport gt 9
34Properties of S-Cuboids
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
Notice that modifying the pattern template
essentially changes the cuboid specification and
thus generates a new cuboid.
35Properties of S-Cuboids
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
In traditional OLAP systems, data are
summarizable. i.e. Summaries in finer abstraction
level can be used to construct the summary in
higher abstraction level.
1 1 1 1 1 1 1
36Properties of S-Cuboids
Sequence Database
S-Cuboid (Finer aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown gt
Kit lt Airport, Chinatown, Chinatown, Airport gt
The S-Cuboid with pattern template ltX,Y,Zgt
Traditional OLAP
1 1 1 1 1 1 1
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
37Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown gt
Kit lt Airport, Chinatown, Chinatown, Airport gt
lt X, Y gt Count
lt Airport, Chinatowngt ?
The S-Cuboid with pattern template ltX,Y,Zgt
Traditional OLAP
1 1 1 1 1 1 1
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
38Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown gt
Kit lt Airport, Chinatown, Chinatown, Airport gt
lt X, Y gt Count
lt Airport, Chinatowngt 2
S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown, Chinatown gt
Kit lt Airport, Westlake gt
lt X, Y gt Count
lt Airport, Chinatowngt 1
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
1 1 1 1 1 1 1
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Whole week
39Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown gt
Kit lt Airport, Chinatown, Chinatown, Airport gt
lt X, Y gt Count
lt Airport, Chinatowngt 2
S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown, Chinatown gt
Kit lt Airport, Westlake gt
lt X, Y gt Count
lt Airport, Chinatowngt 1
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
Traditional OLAP
1 1 1 1 1 1 1
Sales
Wed
Tue
Sat
Sun
Thur
Fri
Mon
7
Sales
Summarizable!
Non-Summarizable!
Whole week
40Properties of S-Cuboids
Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser summary) from the S-Cuboid with pattern
ltX,Y,Zgt (finer summary) without looking at the
sequence database?
Sequence Database
S-Cuboid (Finer aggregates)
S-Cuboid (Coarser aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Non-summarizable
- Coarser aggregates cannot be computed solely from
the corresponding finer aggregates.
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown gt
Kit lt Airport, Chinatown, Chinatown, Airport gt
lt X, Y gt Count
lt Airport, Chinatowngt 2
S-Cuboid (Coarser aggregates)
Sequence Database
S-Cuboid (Finer aggregates)
lt X, Y, Z gt Count
lt Airport, Chinatown, Airport gt 1
lt Airport, Chinatown, Chinatown gt 1
Seq ID Sequence of events
Eric lt Airport, Chinatown, Airport, Chinatown, Chinatown gt
Kit lt Airport, Westlake gt
lt X, Y gt Count
lt Airport, Chinatowngt 1
The problem is that we dont know if the counts
in these two patterns are generated from the same
sequence, or two different sequences.
41Properties of S-Cuboids
- Infinite number of S-cuboids
- The number of pattern dimensions is infinite
- Pattern Template (X,Y,Y,X,A,B,)
- Full materialization is impossible!
- Non-summarizable
- Coarser aggregates cannot be computed solely from
the corresponding finer aggregates. - Partial materialization is infeasible!
42Properties of S-Cuboids
- Research direction
- Precompute some other auxiliary data structures
so that queries can be computed online using the
pre-built data structures
43S-OLAP Specific
Operations
Assist explorative analysis of the sequence data
44S-OLAP specific operations
- Navigate between cuboids with ease
- Traditional OLAP operations for Global Dimensions
- SLICE, DICE, ROLL-UP, DRILL-DOWN, etc.
- New S-OLAP operations for Pattern Dimensions
- APPEND(X) (X,Y,Y) ? (X,Y,Y,X)
- DE-TAIL (X,Y,Y,X) ? (X,Y,Y)
- PREPEND(Z) (X,Y,Y,X) ? (Z,X,X,Y,Y)
- DE-HEAD (Q,Y,Y,X) ? (Y,Y,X)
- PATTERN-ROLL-UP(X) (X,Y,Y,X) ? (X,Y,Y,X)
- PATTERN-DRILL-DOWN(X) (X,Y,Y,X) ? (x,Y,Y,x)
Coarser abstraction level
Finer abstraction level
45Tell me the summary statistics of the single trip
travel patterns of passengers between different
location sections, please ?.
Sequence OLAP
46S-Cuboid 1 (44 cells)
lt X, Y gt , X and Y at section level Passenger
lt Airport section, Downtown sectiongt 120,000
lt Airport section, SODO section gt 8,000
Sequence OLAP
47S-Cuboid 1 (44 cells)
lt X, Y gt , X and Y at section level Passenger
lt Airport section, Downtown sectiongt 120,000
lt Airport section, SODO section gt 8,000
Sequence OLAP
More detailed statistics of passengers traveling
from the Airport section to each Downtown
station, please ?.
48S-Cuboid 1 (44 cells)
lt X, Y gt , X and Y at section level Passenger
lt Airport section, Downtown sectiongt 120,000
lt Airport section, SODO section gt 8,000
Sequence OLAP
S-Cuboid 2 (16 cells)
lt X, Y gt , X at section level, Y at station level XAirport section, YDowntown section Passenger
lt Airport station, Chinatowngt 100,000
lt Airport branch, Pioneer Square gt 8,300
lt Airport branch, University Street gt 4,030
lt Airport branch, Westlake gt 2,430
Instead of specifying the S-Cuboid construction
query, a SLICE plus a P-DRILL-DOWN(Y) is done.
49S-Cuboid 1 (44 cells)
lt X, Y gt , X and Y at section level Passenger
lt Airport section, Downtown sectiongt 120,000
lt Airport section, SODO section gt 8,000
Sequence OLAP
S-Cuboid 2 (16 cells)
lt X, Y gt , X at section level, Y at station level XAirport section, YDowntown section Passenger
lt Airport station, Chinatowngt 100,000
lt Airport branch, Pioneer Square gt 8,300
lt Airport branch, University Street gt 4,030
lt Airport branch, Westlake gt 2,430
S-Cuboid 3 (166 cells)
lt X, Y ,Ygt , X at section level, Y at station level XAirport section, YDowntown section Passenger
lt Airport section, Chinatown, Chinatown gt 90,000
lt Airport section, Pioneer Square, Pioneer Square gt 8,300
lt Airport section, University Street, University Street gt 4,030
lt Airport section, Westlake, Westlake gt 2,430
50S-Cuboid 1 (44 cells)
lt X, Y gt , X and Y at section level Passenger
lt Airport section, Downtown sectiongt 120,000
lt Airport section, SODO section gt 8,000
Sequence OLAP
S-Cuboid 2 (16 cells)
lt X, Y gt , X at section level, Y at station level XAirport section, YDowntown section Passenger
lt Airport station, Chinatowngt 100,000
lt Airport branch, Pioneer Square gt 8,300
lt Airport branch, University Street gt 4,030
lt Airport branch, Westlake gt 2,430
S-Cuboid 3 (166 cells)
lt X, Y ,Ygt , X at section level, Y at station level XAirport section, YDowntown section Passenger
lt Airport section, Chinatown, Chinatown gt 90,000
lt Airport section, Pioneer Square, Pioneer Square gt 8,300
lt Airport section, University Street, University Street gt 4,030
lt Airport section, Westlake, Westlake gt 2,430
The S-OLAP operations not only assists the
exploratory analysis of the sequence data, it
also hides all the technical details of
specifying the S-Cuboid query from the business
users.
51System Architecture
Skip
52System Architecture
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
53System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Sequence Query Engine
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
54System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Queries
Sequence Query Engine
User Interface
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
55System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
56System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
57System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
58Auxiliary Data Structures
Counter based approach
Inverted indices approach
59Counter-Based approach
- Counter-Based approach
- Each cell in an S-cuboid is associated with a
counter. - To determine the counters values, the entire set
of sequences is scanned. - For each sequence s, we determine the cells whose
associated patterns are contained in s and
increment each of such counters by 1. - Basic and simple
- But processing iterative queries requires
Counting from scratch.
60S-OLAP query evaluation
- Inverted-Index Approach
- Based on the fragment cube (X. Li, J. Han, and H.
Gonzalez. VLDB 2004) concept. - A set of inverted indices are created by
pre-processing the data offline. - Algorithm BuildIndex (see paper)
- During query processing, the relevant inverted
indices are joined based on the matching pattern,
in real-time. - Algorithm QueryIndices (see paper)
- By-products of answering a query is the creation
of new inverted indices. - Newly built indices are useful to the processing
of iterative S-OLAP operations (see paper for
algorithms)
61Experiments
- A prototype S-OLAP system was implemented using
C. - Real Data
- Passenger traveling history.
- KDD Cup 2000
- Clickstream data from a web retailer selling
legwear and legcare products. - 50,524 sequences.
- KDD Cup 2000 Question 1
- Look for page-click patterns
- We answer this question in an exploratory way via
three iterative queries.
62Experiments
The corresponding pattern template to capture the
2 steps navigation semantics is ltX,Ygt.
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524
lt Product Catalog, Legwear Product gt 2,201
lt Main page, Promotion ad gt 852
lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2- steps
navigations in the page category level.
- KDD Cup 2000 Question 1
- Look for page-click patterns
- We answer this question in an exploratory way via
three iterative queries
63Experiments
2. P-DRILL-DOWN
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524
lt Product Catalog, Legwear Product gt 2,201
lt Main page, Promotion ad gt 852
lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2- steps
navigations in the page category level.
1.SLICE
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Cuboid Qb (1279 cells)
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163
The most popular product that visitors
browse from the catalog page is the product 34839
(DKNY skin legwear collection product)
64Experiments
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524
lt Product Catalog, Legwear Product gt 2,201
lt Main page, Promotion ad gt 852
lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2- steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163
Cuboid Qc (1279279 cells)
lt X, Y, Z gt (sliced) X at page category level Y, Z at page level User sessions
lt Product Catalog, PID - 34839, PID - 34839 gt 17
lt Product Catalog, PID - 34839, PID - 34897 gt 14
65Experiments
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524
lt Product Catalog, Legwear Product gt 2,201
lt Main page, Promotion ad gt 852
lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2- steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
For the iterative queries, II takes the advantage
of processing only the sequences that possess the
pattern lt Product catalog, Legwear Productgt.
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163
Cuboid Qc (1279279 cells)
lt X, Y, Z gt (sliced) X at page category level Y, Z at page level User sessions
lt Product Catalog, PID - 34839, PID - 34839 gt 17
lt Product Catalog, PID - 34839, PID - 34897 gt 14
66Experiments on synthetic data
- Study the scalability of Counter-Based approach
(CB) and Inverted-Index approach (II) under a
series of APPEND operations - QA1 (X,Y)? SLICE APPEND ? QA2 (X,Y,Z) ? SLICE
APPEND ? QA3 (X,Y,Z,A) ? SLICE APPEND ? QA4
(X,Y,Z,A,B) ? SLICE APPEND ? QA5 (X,Y,Z,A,B,C)
67Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t. number of
sequences. II outperformed CB in all datasets in
this experiment.
II precomputation time less than 4 secs in all
cases
68Experiments on synthetic data
Cumulative runtime
Both CB and II scale linearly w.r.t. number of
sequences. II outperformed CB in all datasets in
this experiment.
Cumulative sequence scanned
II precomputation time less than 4 secs in all
cases
CB scans the entire dataset once on each
iterative query. For Qa1, II does not need to
scan any data sequences because the query can be
answered by inverted indices directly.
69Experiments on synthetic data
- Vary
- Average sequence length (L)
- Data distribution (Skew factor)
- Domain of the events (I)
- P-ROLL-UP operation
- P-DRILL-DOWN operation
- ltX,Y,Y,Xgt pattern templates
- Substring / Subsequence pattern templates
- (See technical report)
70Conclusion
- We propose a new online analytical processing
system for sequence data analysis (The S-OLAP
system). - The proposed system is motivated by real-life
problems. - Page click analysis
- RFID log analysis
- etc
- We defined basic concepts
- S-Cuboid, S-Cube
- Identified two properties of S-Cube
- Infinite number of S-Cuboid
- Non-summarizable
- Illustrated the usability of the proposed S-OLAP
system through a prototype system that works on
real data.
71Ongoing/ Future works
- Performance
- Auxiliary data structures
- How to efficiently compute the S-Cuboids
- How to support the proposed S-OLAP operations
such that a sequence of S-OLAP queries can be
efficiently evaluated? - Streaming model
- Review on the entire OLAP research history
- Iceberg cube
- Approximation cube
- Incremental update
- Security and privacy
- and much moreany suggestions? ?
72The End
Thanks you!
Microsoft, Anton, Eric, You?
73Related Work
- Sequence Databases
- PREDATOR (Seshadri, Livny, and Ramakrishnan
SIGMOD 94, VLDB 96) - DEVise (Ramakrishnan et al. SSDBM 98)
- TS-SQL (Sadri et al. PODS 01)
- OLAP
- Data-cube operator (Gray et al. 95),
iceberg-cube, star-schema, , etc. - OLAP on unconventional data
- RFID-cube (Gonzalez, Han, and Li VLDB 06)
- Stream-cube (Chen et al. VLDB 02)
- XML-cube (Wiwatwattana el al. ICDE 07)
74Sequence Data Cube
Infinite number of S-Cuboids
Non - summarizable
75Sequence Data Cube
- Given
- A set of global dimensions
- A set of pattern dimensions
- A set of concept hierarchies that is associated
with the dimensions - We can define an S-cuboid for each of the
possible subsets of the given dimensions and
abstraction levels. - S-Cuboid with ltX,Y,Z,A,Bgt, ltX,Y,Z,Agt, ltX,Y,Zgt,
ltX,Ygt , ltXgt - The set of S-cuboids forms a lattice
- ? Sequence Data Cube (S-cube)
76Preliminary
- Event
- a tuple inside a fact table
- Dimension (associated with concept hierarchy)
- Time time ? day ? week
- Location station ? branch
- Card-id individual ? fare-group
(student/regular/senior) - Measure
- Count, Amount
- If there is a logical ordering among a set of
events, the events can form a sequence
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Seattle Airport in 0
e2 2008-6-09 0225 688 Chinatown out -5
e101 2008-6-09 2223 688 Chinatown Machine 10 Add value 100
e102 2008-6-09 2225 688 Chinatown in 0
e180 2008-6-09 2349 688 Seattle Airport out -5
77Preliminary
- Event
- a tuple inside a fact table
- Dimension (associated with concept hierarchy)
- Time time ? day ? week
- Location station ? branch
- Card-id individual ? fare-group
(student/regular/senior) - Measure
- Count, Amount
- If there is a logical ordering among a set of
events, the events can form a sequence
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Seattle Airport in 0
e2 2008-6-09 0225 688 Chinatown out -5
e101 2008-6-09 2223 688 Chinatown Machine 10 Add value 100
e102 2008-6-09 2225 688 Chinatown in 0
e180 2008-6-09 2349 688 Seattle Airport out -5
78Preliminary
An event dataset
- Event
- a tuple inside a fact table
- Dimension (associated with concept hierarchy)
- Time time ? day ? week
- Location station ? branch
- Card-id individual ? fare-group
(student/regular/senior) - Measure
- Count, Amount
- If there is a logical ordering among a set of
events, the events can form a sequence
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Seattle Airport in 0
e2 2008-6-09 0225 688 Chinatown out -5
e101 2008-6-09 2223 688 Chinatown Machine 10 Add value 100
e102 2008-6-09 2225 688 Chinatown in 0
e180 2008-6-09 2349 688 Seattle Airport out -5