Title: WHOWEDA : Warehouse of Web Data
1- WHOWEDA Warehouse of Web Data
- Sanjay Kumar Madria
- Department of Computer Science
- Purdue University, West Lafayette, IN 47907
- skm_at_cs.purdue.edu
2WHOWEDA -Key Objectives
- Design a suitable data model to represent web
information - development of web algebra and query language
- Maintenance of Web data
- Development of knowledge discovery and web mining
tools - Web warehouse
3WHOWEDA - What?
- WareHouse Of Web Data
- Subject - oriented
- Integrated
- Temporal
- Granularity - Lower, higher
- Some summary
- Not updatable
- Alternative information sources
4Web Warehouse?
- Subject-oriented, integrated, time-variant,
non-volatile repository of web data for direct
querying and analysis for some sort of decision
making - A process whereby organizations or individuals
extract value from their Web informational assets
through the use of special stores called web
warehouses
5WHOWEDA! www.cais.ntu.edu.sg8000/whoweda
- A WareHouse Of WEb DAta
- Web Information Coupling Model (WICM)
- Web Objects
- Web Schema
- Web Information Coupling Algebra
- Web Information Maintenance
- Web Mining and Knowledge discovery
6User
WWW
Warehouse Concept Mart
Web Querying Analysis Component
Web Information Coupling System
Web Information Maintenance System
Web Information Mining System
Web Mart
Web Mart
Web Warehouse
Web Mart
Web Mart
7User
WWW
Web Query Display
Warehouse Concept Mart
Global Web Manipulation
Global Web Coupling
Pre processing
Global Ranking
Data Visualization
Schema Tightness
Web Warehouse
Data Visualization
Web Union
Web Select
Web Intersection
Web Project
Local Web Manipulation
Local Web Coupling
Schema Tightness
Local Ranking
Schema Search
Web Join
Schema Match
8Web Objects
- Node - url, title, format, size, date, text
- Link - source-url, target-url, label, link-type
- Web tuple
- Web table
- Web schema
- Web database
9Web Schema
- Metadata in the warehouse
- Structural summary of web table
- Information Coupling using a Query graph
- Query graph -gtWeb schema
- directed graph represented by Ordered 4-tuple
- Set of node variables
- Set of link variables
- Connectivities
- Predicates
10(No Transcript)
11(No Transcript)
12url contains headlines
13(No Transcript)
14Schema- example
- Node variables Xn x, y, z, w
- Link variable Xl e, f, g
- Connectivities C xltegty and xltfg-gtz and
xltfh-gtw - The symbol represents an unbound node variable
or link variable a variable not restricted by
any predicate. - - represents one unbound links
- - represents more than one unbound links
15 - Predicates
- Px.urlhttp//www.mediacity.com.sg/i-square,
- y.url CONTAINS headlines
- e.target_url CONTAINS "article",
- f.target.url CONTAINS "newshub/specials",
- g.label CONTAINS "Local News",
- z.url CONTAINS "local",
- h.label CONTAINS "World News",
- w.url CONTAINS "world"
16Query Graph - Example 1
- Query graph - same as schema except that it has
one more parameter to control the results
returned. - Informally, it is directed connected graph
consists of nodes, links and keywords imposed on
them. - Produce a list of diseases with their symptoms,
evaluation procedures and treatment starting from
the web site at http//www.panacea.org/ - Web table Diseases
17Treatment list
q
Treatment
g
http//www.panacea.org/
Issues
Symptoms list
f
y
x
z
Symptoms
List of Diseases
e
Evaluation
Evaluation
w
p
18Treatment list
q1
g1
Treatment
http//www.panacea.org/
Issues
f1
Symptoms list
x0
z1
y1
Symptoms
AIDS
List of Diseases
e1
Evaluation
Evaluation
w1
p2
Elisa Test
19Example 2
- Produce a list of drugs, and their uses and side
effects starting from the web site at
http//www.panacea.org/ - Web table Drugs
20(No Transcript)
21Side effects of Indavir
Drug list
http//www.panacea.org/
Issues
r1
AIDS
a0
b1
c1
d1
Indavir
Side effects
List of Diseases
Use
s1
k1
Uses of Indavir
22Query Language
- Starting from the CS dept. home page at NTU, find
all documents that are linked through paths of
length less than two containing only local links,
and have in their text database.
23- COUPLE WEBTABLE W FROM WWW
- SUCH THAT NODE I, J IN WWW and LINK e,f,g IN WWW
AND Iltef,ggtJ WHERE I.url EQUALS
http//www.ntu.edu.sg AND J.text CONTAINS
database AND f.link-type EQUALS local AND
g.link-type EQUALS local
24Web Algebra
- Formal foundation of data representation and
manipulation in a web warehouse - Web operators
- Information access operator
- Information manipulation operators
- Web schema operators
- Data visualization operators
25Information access operator
26Information Manipulation
- - Web select
- Web project
- Local web coupling
- Web join
- Web Cartesian product
- Web union
- Web intersect
- Local Web coupling
27Web Select
- Extracts web tuples from web tables satisfying
certain conditions on node and link variables and
on connectivities - Input is select Schema
- Output is a web table satisfying the select schema
28 - select W1 tuples that contain world news about
Indonesia since May 1 1998. - sMsW1 where
- Ms lt Xsn, Xsl, Cs, Ps gt,
- Xsn x, w , Xsl ,
- Cs ,
- Ps x.date gt "1May1998", w.text CONTAINS
Indonesia
29- Xn x, y, z, w ,Xl e, f, g
- C xltegty and xltfg-gtz and xltfh-gtw
- Px.urlhttp//www.mediacity.com.sg/i-square,
x.date gt "1May1998", - e.target_url CONTAINS "article", f.target.url
CONTAINS "newshub/specials", - g.label CONTAINS "Local News",
- z.url CONTAINS "local",
- h.label CONTAINS "World News",
- w.url CONTAINS "world",
- w.text CONTAINS Indonesia
30Web Information Coupling System
- A database system to couple related web
information - Global web Coupling and Local Web Coupling
31Global Coupling - Information Access
- To integrate data from the Web
- To create historical data
- To couple related information from the WWW
satisfying a query graph - Operator to create web tables
- From web with no schema to web table with web
schema
32Why local web coupling?
- Directly querying the WWW to gather these
information is an expensive and repetitive affair
- Web documents containing similar information can
reside in different web tables in a web warehouse
- A mechanism to gather these similar information
by additional manipulation of the materialized
web tables
33Local Web Couple operator
- Two web tuples and can be coupled if
there exist atleast one pair of nodes from
and which contains similar information.
34Local Web Couple operator
- The web couple operator is basically a web
cartesian product followed by web select - We denote web couple by the symbol
35Web Coupling
36Example 1
- Produce a list of diseases and their symptoms
starting from the web site at http//www.panacea.o
rg/ - Web table Diseases
37Issues
http//www.panacea.org/
symptoms
e
z
x
y
symptoms
List of Diseases
Web Schema or Query Graph of Diseases
38Web table Diseases
39Example 2
- Produce a list of drugs, and their side effects
starting from the web site at http//www.panacea.o
rg/ - Web table Drugs
40Drug list
Side effects
http//www.panacea.org/
Issues
r
c
a
b
d
Side effects
List of Diseases
Web Schema or Query Graph of Drugs
41Web table Drugs
42Issues
http//www.panacea.org/
Symptoms of AIDS
e0
AIDS
z0
x0
y0
symptoms
List of Diseases
Side effects of Ritonavir
Drug list
http//www.panacea.org/
Issues
r2
AIDS
a0
b1
c2
d2
Ritonavir
Side effects
Issues
http//www.panacea.org/
Symptoms of Cancer
e1
Cancer
z1
x0
y1
symptoms
List of Diseases
Side effects of betacarotene
http//www.panacea.org/
Issues
Heart Disorder
r4
a0
b4
c4
d4
Side effects
Beta Carotene
Symptoms Side effects
43- M2 lt Xn, Xl, C,P gt for W2
- Xn s, t, u, Xl k, l, m, n ,
- C sltklgtt and sltmngtu ,
- Ps.url http//www.asia1.com.sg/straitstimes/,
- k.label REGION,
- l.target_url http//www.asia1.com.sg/straitstime
s/pages/sea.html, m.label WORLD, - n.target_urlhttp//www.asia1.com.sg/straitstimes
/pages/wrld.html
44- W1 qq W2 where
- q (x.dates.date) (w.text CONTAINS
Indonesia) (t.text CONTAINS Indonesia) - Schema of the coupled table is
45- Xn x, y, z, w, s, t, u , Xl e, f,
g, k, l, m, n , C xltegty and xltfg-gtz and
xltfh-gtw and sltklgtt and sltmngtu - P x.urlhttp//www.mediacity.com.sg/i-square
, e.target_url CONTAINS "article", - f.target.url CONTAINS "newshub/specials",
- g.label CONTAINS "Local News",
- z.url CONTAINS "local",
- h.label CONTAINS "World News",
- w.url CONTAINS "world",
- s.url http//www.asia1.com.sg/straitstimes/,
46- k.label REGION, l.target_url
http//www.asia1.com.sg/straitstimes/pages/sea.h
tml, - m.label WORLD,
- n.target_url http//www.asia1.com.sg/straitstim
es/pages/world.html, - x.date s.date,
- w.text CONTAINS Indonesia,
- t.text CONTAINS Indonesia"
47Local Web Coupling
- Initiated explicitly by the user
- User provides the pair of node variables and the
keyword set based on which coupling is to be
performed - Coupling nodes in each pair of web tuples in the
input web tables must satisfy one of the coupling
conditions
48Types of web coupling
- System driven web coupling system to decide the
coupling nodes. If atleast a pair of coupling
nodes cannot be identified then the web tables
cannot be coupled. - User driven web coupling user decides the
coupling nodes. - Coupling is performed only on those user
specified node variable(s).
49Attribute driven web coupling
- Attribute driven web coupling user specifies the
coupling attributes and coupling is performed
only on those user specified coupling
attribute(s). - COUPLE TABLE3
- FROM TABLE1 AND TABLE 2
- ON ATTRIBUTE TEXT
- AT SCHEMA/TUPLE(optional)
50Value Driven web coupling
- Value driven web coupling user specifies the
values of the attributes of the nodes on which
coupling should be performed. - COUPLE TABLE3
- FROM TABLE1 AND TABLE 2
- ON VALUE Software Agents
- AT SCHEMA/TUPLE(optional)
51Schema level web coupling
- We inspect the schemas to decide whether the two
web tables can be coupled. - If coupling conditions cannot be identified then
the two web tables cannot be coupled. - We do not inspect the web tuples in the web
table. - Number of web tuples coupled will be nm.
52Tuple level web coupling
- We inspect the web tuples of the two input web
tables to identify nodes with similar
information. - The number of web tuples in the coupled web table
ltnm
53Why two levels?
- A schema does not capture all the information of
the web documents in a web table not always
possible to identify coupling condition by
inspecting the schemas. - possible to find existence of coupling nodes
which are not defined in the schemas.
54Why two levels?
- Tuple level coupling gives us a mean to correlate
web documents containing similar information from
the web tables (that cannot be identified from
their schemas) at the expense of additional
processing.
55Conditions for web coupling
- The coupling nodes are and
56Conditions for web coupling
- The coupling nodes are and
57Conditions for web coupling
- The coupling nodes are and
58Conditions for web coupling
- The coupling nodes are and
59Conditions for web coupling
- The coupling nodes are and
60Conditions for web coupling
- The coupling nodes are and
61Conditions for web coupling
- The coupling nodes are and
- For example computer.html
62Conditions for web coupling
- The coupling nodes are and
-
63Conditions for web coupling
- URLs with same directory name such as
/computer/ may contain similar information. - Paths with /cgi-bin/ are not considered.
- Include all conditions for web join.
64Construction of coupled schema (schema level)
- When atleast a pair of coupling nodes are
identical (same url). - When none of the pair are identical.
65Case 1
- In case there exist at least one pair of coupling
nodes which are identical to one another then we
construct the coupled schema as discussed in web
join paper (DEXA98).
66Case 2
67Join Processing in Web Databases
68Web Join
- Concatenate tuples based on identical nodes or
documents - Input are two web tables and their schemas
- Output is a joined table
- Types
- Pi-web join, theta-web join, outer joins, web
composition, semi web join
69Web Join
- Used for combining related data from various web
tables - Mechanism to detect changes
- Mechanism to find alternative web document in
case of Document Not Found error
70Web Join Operator
- Information manipulation operator
- Manipulate information residing in a web database
to derive additional information - Harness useful, composite information from two
web tables - Capitalize on the reuse of retrieved data from
the WWW in order to reduce execution time of
queries
71Joinable Nodes
- Node variables participating in the web join
process - Expressed as a pair
- Each node in the pair should have identical URLs
72Web Join
- Combine two web tables by concatenating a web
tuple of one web table with a web tuple of other
web table whenever there exist joinable nodes - Joinable nodes are identified from the schemas of
the two web tables - URLs of the joinable nodes are identical
73Treatment list
q
Treatment
g
http//www.panacea.org/
Symptoms list
Issues
List of Diseases
f
y
x
z
Symptoms
e
Evaluation
Evaluation
Drug list
w
p
Issues
r
Side effects
b
c
d
Side effects
Use
s
Uses
k
74AIDS treatment
q1
g1
Symptoms of AIDS
http//www.panacea.org/
f1
y1
x0
z1
AIDS
e1
AIDS
Evaluation
Elisa Test
w1
p2
r1
Side effects of Indavir
b1
c1
d1
Indavir
s1
Uses of Indavir
k1
75Pi-Web Join
76Example 1
- Produce a list of diseases with their symptoms,
evaluation procedures and treatment starting from
the web site at http//www.panacea.org/ - Web table Diseases
77http//www.panacea.org/
z
Query Graph (Web Schema) for Example 1
78Treatment list
q1
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Evaluation
p2
Elisa Test
A web tuple in Diseases
79Example 2
- Produce a list of drugs, and their uses and side
effects starting from the web site at
http//www.panacea.org/ - Web table Drugs
80Query Graph (Web Schema) of Drugs
81A web tuple in Drugs
82Web Project
- Eliminate nodes from web tuples which are
irrelevant - Based on project conditions
- Set of node variables
- Start node variable and end-node variable
- Node variable and depth of links
- Used to isolate data of interest in a web table,
allowing subsequent web queries to run over
smaller, more structured web table
83http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Evaluation
p2
A web project on Diseases
84treatment
q
http//www.panacea.org/
z
x
symptoms
Disease List
p
evaluation
Side effects
Drug list
b
d
Joined schema
k
Uses
85Treatment list
q1
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
AIDS
Evaluation
p2
Side effects of Indavir
Drug list
Elisa Test
b1
d1
Indavir
Side effects
Use
k1
Uses of Indavir
Joined Tuple
86Motivation of Pi-web Join
- Quite often web join operation couples irrelevant
nodes - In a complex web query with several web join
operation, the size of the resultant web table
can become very large with many contaminated
nodes - Pi-web join resolves the above limitation by
eliminating contaminated nodes - Reduces the size of joined web table
87Pi-web Join
- Web join followed by web project
- The projection conditions are specified by the
user conditions are similar to web project - We do not eliminate the joinable nodes
- By retaining the joinable nodes we preserve the
correlation between the information captured from
two web tables - Pi-web join may result in a web bag
88Example 3
- Produce a list of diseases with their symptoms
and side-effects starting from the web site at
http//www.panacea.org/
89Procedure
- Perform web join on Diseases and Drugs
- Project node variables b, k, q, p, node variables
between a and q, node variables between b and k,
node variables between b and d
90http//www.panacea.org/
z
x
symptoms
Disease List
Side effects
d
Pi-joined schema
91http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Side effects of Indavir
d1
Pi-joined Tuple
92Benefits of Pi-web Join
- Minimize the amount of data transmitted over the
network in distributed web join processing - Reduction in storage cost associated with a
joined web table - Reduces cognitive overhead associated with
locating relevant nodes - Improve completeness of schema by removing
unbound nodes and links
93Web Bags
- Existence of identical web tuples.
- Created due to web project operation.
- Structure based mining
- Used for discovering
- Visible nodes
- Luminous nodes
- Luminous paths
94Definitions
- Visibility of a web document or node D in a web
table W measures the number of different web
documents in W that have links to D - Luminosity - Reverse of visibility, the number of
other distinct documents that are linked from D - Luminous paths - a set of inter-linked nodes
which occurs number of times in a web table
95Inter-site Support
- Quantify the inter-site connectivity of a node in
a web table - let x be a node and hx denote the host name of
node x. Let H be a bag of host names of all nodes
in W that have direct link to/from x. Let Ch be
the number of times hx appears in H. then we
define I as - 1- Ch /H
96Steps to find visible nodes
- Input Web table W, node variable x, visibility
threshold v - Output Set of visible nodes and inter-site
support for each node - Create a web table from W where each web tuple
contains distinct instances of node x and the
preceding node which is linked to x (use project
and create distinct tuples if node x has more
than 1 incoming edge) - Eliminate the nodes linked to x in each tuple of
the web table using web project
97- Check if the collection of web tuples of node x
thus created is a web bag by comparing their URLs - Create multiplets for each collection of
identical nodes - For each multiplet calculate the node visibility
(using the mathematical formula defined, see
FODO-98) - Determine the multiplets with node visibility
greater than the threshold - Create the visible node set and calculte the
inter-site support
98Steps to find luminous nodes
- Input Web table W, node variable x, luminosity
threshold l - Output Set of luminous nodes with inter-site
support - Steps are similar to that of visible node
discovery - We consider the nodes linked from x in place of
nodes linked to x
99Steps to find luminous paths
- Input - web table W, nodes x and y
- Output - threshold value for luminous path
- Project nodes between x and y and check for web
bag else go to next slide - Create the collection of multiplets
- Compute path luminosity for each multiplet using
the formula - If the path luminosity value of a multiplet is
greater than or equal to threshold then a path
in the multiplet is a luminous path
100Steps to find luminous paths
- Otherwise, we create a collection of linear web
tuples from the above collection of web tuples - This is to identify if there exist a subset of
inter-linked nodes between x and y that are
luminous paths - We repeat the procedure to compute path
luminosity for these set of inter-linked nodes
101Web Schema
Cancer
http//www.panacea.org/
e
f
x
y
z
Cancer
Diseases
102Cancer
http//www.panacea.org/
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z1
x0
y0
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z2
x0
y0
Cancer
e0
Cancer
Diseases
f0
x0
y0
z1
Cancer
e0
http//www.cancer.org/desc.html
Cancer
Diseases
f0
z4
x0
y0
Cancer
e0
Web Table
103Projected schema
104Cancer
Web Table after eliminating x and y
105Projected schema
Cancer
http//www.panacea.org/
e
z
x
y
Diseases
106http//www.cancer.org/desc.html
http//www.cancer.org/desc.html
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
http//www.panacea.org/
Diseases
Cancer
x0
y0
z4
Web Bag
107After removal of identical tuples
http//www.cancer.org/desc.html
108Cancer
z1
http//www.cancer.org/desc.html
Cancer
http//www.cancer.org/desc.html
z1
http//www.disease.com/cancer/skin.htm
http//www.cancer.org/desc.html
http//www.jhu.edu/medical/research/cancer.htm
109http//www.cancer.org/desc.html
110Visible Nodes
Cancer
http//www.cancer.org/desc.html
z1
Cancer
z2
http//www.disease.com/cancer/skin.htm
Cancer
z1
http//www.cancer.org/desc.html
Cancer
z4
http//www.jhu.edu/medical/research/cancer.htm
111Luminous Paths
112Change Management
- Detect web deltas - w.r.t to user query
- Changes in inter-linked web documents - insert
path, delete path, update path - Representing changes
- web algebraic operators - Web Join, web outer
join - Querying Changes
113Mining in Web Warehouse
- web structure mining Web structure mining
involves mining the web documents structures and
links. - web content mining Web content mining
describes the automatic search of information
resources available on-line. - web usage mining Web usage mining includes the
data from server access logs, user registration
or profiles, user sessions or transactions etc.
114(No Transcript)
115- From the results returned, find most visible
pages. Assume Z1 is the most visible page with
the given threshold. - This gives estimates about different restaurants
selling pizzas. - Lower threshold gives you set (Z1, Z2) as visible
pages, which sells both pizza and pasta. - Generalize rules such as out of 66 of
restaurants which offer pizza to their customers,
33 also offers pasta.
116 Application - Luminosity
- Association rules such as X of all the companies
which makes a product A, Y of them also makes
a set of products B and C. - Exmple - certain companies (33) if they make a
product A also make products B and C. - the company C makes only the product A.
- That is, 66 of companies which make a product
A , 33 of them also make products B and C.
117(No Transcript)
118More Operators . . .
- Web schema operators
- Schema tightness operator, Schema match operator,
Schema search operator - Data visualization operators
- Ranking operators (Global Local), Web Nest, Web
Un-nest, Web Coalesce, Web Expand, Web Pack, Web
Unpack, Web Sort
119Partitioning of web tables
- Partitioning web tables
- restructured easily
- indexed easily
- monitored easily
- reorganized easily
- By
- time
- schema tree structure
- keywords
120Warehouse Concept Mart (WCMart)
- Subject oriented
- Concept generation.
- Manually -gt Autonomous.
- Used for
- Ranking tuples
- Global web coupling
- Content based mining
121Web Data Refinement
- Improve web schema - schema tightness operator
- Partition web tables based on content and
structure
122Partitioning of web tables
- Partitioning web tables
- restructured easily
- indexed easily
- monitored easily
- reorganized easily
- By
- time
- schema tree structure
- keywords
123WWW
Warehouse Concept Mart
Global Web Coupling
Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
124Webtable (Jan)
Webtable (Feb)
Webtable (Mar)
Webtable (Apr)
Lower-level Granularity
Web Information Manipulation Operators
Higher level Granularity
Summarized data
125What type of information can be summarized?
- Structural
- Content-based
- time-variant analysis
- snapshot analysis
- compare one period with another
- trend analysis
126Structural Summarization
- Most volatile documents
- Sites which change frequently
- Rate of change over time
- a pointer to directly access documents which
change rapidly - Most visible nodes, luminous nodes, luminous
paths - Change with time
- Decrease or increase - Analyze the reason
127Content Summarization
- What can be aggregrated in a web page?
- Number of links with identical labels
- Number of keywords
- Changes in content with time
- Comparing the changes
- Open question
- XML will improve the ability of analysis of web
data
128Summary
- Current status
- Mechanism for accessing and manipulating web
information in WHOWEDA - Implementing various web operators and query
language - Future research
- What types of information can be summarized?
- What types of knowledge can be mined?
- Refine web warehouse architecture
- www.cais.ntu.edu.sg8000/whoweda