Pi-Web Join in a Web Warehouse - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Pi-Web Join in a Web Warehouse

Description:

Pi-Web Join in a Web Warehouse S S Bhowmick, S K Madria, W -K Ng, E -P Lim Nanyang Technological University Singapore – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 36
Provided by: Lai137
Learn more at: https://web.mst.edu
Category:

less

Transcript and Presenter's Notes

Title: Pi-Web Join in a Web Warehouse


1
Pi-Web Join in a Web Warehouse
  • S S Bhowmick, S K Madria, W -K Ng, E -P Lim
  • Nanyang Technological University
  • Singapore

2
Presentation Overview
  • Search engines web query systems
  • Research objectives
  • Current research
  • WHOWEDA
  • Pi Web Join
  • Benefits of Pi-Web Join
  • Summary

3
If you build it, they will come
  • WWW is chaotic
  • Increasingly difficult to locate information.
  • Related data are scattered in a piecemeal
    fashion
  • Data, data everywhere.but how to find it?

4
Limitations of Search Engines
  • Do not exploit hyperlinks
  • No efficient document management
  • Query results cannot be further manipulated

5
Current Web Research
  • Web query systems
  • W3QS, WebSQL, AKIRA, NetQL, RAW, WebLog
  • Semistructured data
  • LOREL, UnQL, WebOQL
  • Website management system
  • STRUDEL

6
Context of this research
  • Build a web warehouse
  • Web data access
  • Historical web data
  • Information over time
  • Web data manipulation
  • Efficient visualization of web information
  • Maintenance of web data
  • Web data mining
  • Overcome existing limitations

7
WHOWEDA - What?
  • WareHouse Of Web Data
  • Subject - oriented
  • Integrated
  • Temporal
  • Granularity - Lower, higher
  • Some summary
  • Not updatable
  • Alternative information sources

8
Web Information Coupling System
  • A system to couple and manipulate related web
    information
  • Web data model
  • Web objects
  • Web algebra

9
Web Objects
  • Node url, title, format, size, date, text
  • Link source-url, target-url, label, link-type
  • Web tuple - Set of nodes and links
  • Web table - Collection of web tuples
  • Web schema

10
Web Schema
  • Metadata in the warehouse
  • Structural summary of web table
  • Coupling of related information begins with a
    query graph
  • Query graph -gtWeb schema
  • Ordered 4-tuple
  • Set of node variables
  • Set of link variables
  • Connectivities
  • Predicates

11
Example 1
  • Produce a list of diseases with their symptoms,
    evaluation procedures and treatment starting from
    the web site at http//www.panacea.org/
  • Web table Diseases

12
http//www.panacea.org/
z
Query Graph (Web Schema) for Example 1
13
Treatment list
q1
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Evaluation
p2
Elisa Test
A web tuple in Diseases
14
Example 2
  • Produce a list of drugs, and their uses and side
    effects starting from the web site at
    http//www.panacea.org/
  • Web table Drugs

15
Query Graph (Web Schema) of Drugs
16
A web tuple in Drugs
17
Web Algebra
  • Formal foundation of data representation and
    manipulation in a web warehouse
  • Web operators
  • Information access operator
  • Information manipulation operators
  • Web schema operators
  • Data visualization operators

18
Global Coupling - Information Access
  • To couple related information from the Web (ER
    98)
  • Match portions of the web that satisfy the web
    schema
  • Input is a query graph
  • Output is a web table

19
Web Project
  • Eliminate nodes from web tuples which are
    irrelevant
  • Based on project conditions
  • Set of node variables
  • Start node variable and end-node variable
  • Node variable and depth of links
  • Used to isolate data of interest in a web table,
    allowing subsequent web queries to run over
    smaller, more structured web table

20
Web Project
  • May create duplicate web tuples (web bag)
  • Duplicate web tuples are not removed
    automatically
  • To justify knowledge discovery (FODO 98)

21
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Evaluation
p2
A web project on Diseases
22
Web Join Operator
  • Information manipulation operator (DEXA 98)
  • Manipulate information residing in a web
    warehouse to derive additional information
  • Harness useful, composite information from two
    web tables
  • Capitalize on the reuse of retrieved data from
    the WWW in order to reduce execution time of
    queries

23
Joinable Nodes
  • Nodes participating in the web join process
  • Expressed as a pair
  • Each node in the pair should have identical
    contents

24
Web Join
  • Combine two web tables by concatenating a web
    tuple of one web table with a web tuple of other
    web table whenever there exist joinable nodes
  • Joinable nodes may be identified from the schemas
    of the two web tables
  • URLs of the joinable nodes are identical
    (Assuming that the last modification date is
    same)

25
treatment
q
http//www.panacea.org/
z
x
symptoms
Disease List
p
evaluation
Side effects
Drug list
b
d
Joined schema
k
Uses
26
Treatment list
q1
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
AIDS
Evaluation
p2
Side effects of Indavir
Drug list
Elisa Test
b1
d1
Indavir
Side effects
Use
k1
Uses of Indavir
Joined Tuple
27
Motivation of Pi-web Join
  • Quite often web join operation couples irrelevant
    nodes
  • In a complex web query with several web join
    operation, the size of the resultant web table
    can become very large with many contaminated
    nodes
  • Pi-web join resolves the above limitation by
    eliminating contaminated nodes
  • Reduces the size of joined web table

28
Pi-web Join
  • Web join followed by web project
  • The projection conditions are specified by the
    user conditions are similar to web project
  • We do not eliminate the joinable nodes
  • By retaining the joinable nodes we preserve the
    correlation between the information captured from
    two web tables
  • Pi-web join may result in a web bag

29
Example 3
  • Produce a list of diseases with their symptoms
    and side-effects starting from the web site at
    http//www.panacea.org/

30
Procedure
  • Perform web join on Diseases and Drugs
  • Project node variables b, k, q, p, node variables
    between a and q, node variables between b and k,
    node variables between b and d

31
http//www.panacea.org/
z
x
symptoms
Disease List
Side effects
d
Pi-joined schema
32
http//www.panacea.org/
Symptoms list
x0
z1
AIDS
List of Diseases
Side effects of Indavir
d1
Pi-joined Tuple
33
Benefits of Pi-web Join
  • Minimize the amount of data transmitted over the
    network in distributed web join processing
  • Reduction in storage cost associated with a
    joined web table
  • Reduces cognitive overhead associated with
    locating relevant nodes
  • Improve completeness of schema by removing
    unbound nodes and links

34
Summary
  • Motivation
  • Introduced WHOWEDA
  • Web project Web Join
  • Pi Web Join
  • For more information www.cais.ntu.edu.sg8000/wh
    oweda

35
Web Bags
  • Existence of identical web tuples.
  • Created due to web project operation.
  • Multiplets - each collection of identical web
    tuples
  • Structure based knowledge discovery
  • Used for discovering (FODO 98)
  • Visible nodes
  • Luminous nodes
  • Luminous paths
Write a Comment
User Comments (0)
About PowerShow.com