Title: Web Site Management Based on Declarative Specifications
1Web Site Management Based on Declarative
Specifications
- Alon Levy
- University of Washington
- Joint work with
- Strudel Dana Florescu (INRIA), Mary Fernandez,
Dan Suciu (ATT), Khaled Yagoub (INRIA) - Tiramisu Corin Anderson and Dan Weld (UW)
2Problem Building Web sites
- Building Web sites involves three tasks
- Selecting and managing the sites content
- Organizing the sites structure (pages and
links) - Designing the graphical presentation of pages.
- In current tools, these tasks are (mostly)
interdependent. - Strudels key ideas
- Separate the three tasks.
- Manage content and structure declaratively.
3Content Management and Graphical Presentation
- Content may be derived from multiple sources
- Databases relational, object-oriented
- Semi-structured sources (XML, Word, Excel,
bibtex). - Classical data integration problem!
- (see Tsimmis, Garlic, Information Manifold,
Tukwila) - Graphical presentation
- Need to integrate with tools that create
animations, images, Java applets. - Create sets of similar HTML pages using
templates.
4Web-Site Structure
- The structure includes
- Set of pages and contents of each page, and
- Links between the pages.
5Current practice
- Current tools separate only content management
from presentation - Content managed by database
- Embed queries in HTML templates
- Simple tools to view and modify structure at the
extensional - level.
- WYSIWYG tools for managing presentation.
- But they still cannot
- explicitly manage site's global structure, or
- flexibly choose content-management system
- As a result its hard to
- modify the structure of a web-site, build
multiple versions for - different classes of users, enforce integrity
constraints.
6Talk Outline
- Problem definition
- Strudel architecture
- Advantages of declarative specifications
- Specifying and verifying integrity constraints.
- Automatic generation of run-time plans for
managing data-intensive web sites. - Tiramisu
- Separating the design tool from the
implementation. - Using a collection of tools to build a site.
7Strudel Evolution
Strudel (Nov. 96)ATT
Strudel ATT Release
Strudel-R (INRIA)
http//www.research.att.com/sw/tools/strudel
Tiramisu (Sept. 98) (U. Washington)
8 Strudel Architecture and System
9Strudel
- Features
- Integrates content from multiple sources.
- High-level declarative language for managing
sites structure (StruQL). - Advantages
- Derives multiple sites from the same data.
- Supports easy restructuring and modification.
- Provides platform for
- Enforcing integrity constraints
- Designing policies for efficient run-time
management of sites.
10Strudel Architecture
11Data Model
- Strudel is based on a semi-structured data model
- labeled directed graphs.
- nodes in the graph represent objects,
- labels on arcs represent attribute names,
- named collections.
- Why semi-structured data?
- raw data is often semi-structured (and I dont
mean that its - embedded in HTML)
- convenient for data integration (a la TSIMMIS)
- web-sites are ultimately graphs.
12The StruQL Query Language
- A StruQL query is a function from a set of input
graphs to an - output graph.
- A StruQL expression contains two parts
- A query component, and
- A restructuring component.
- Formally
- INPUT graph names
- WHERE conjunction of regular path expression
atoms - CREATE name the nodes in the output graph using
Skolem functions - LINK specify the links in the resulting
graph. - StruQL evolved into XML-QL, (see WWW8 Conference)
13Example Raw Data
- Article 1
- Date 8/1/97
- Title Clinton announces new
- Priority Headline
- Category USA News
- Images im1.gif, im.gif
- Text President Clinton announced
- Related article article2
- Article 2
- Date 8/2/97
- Title FDA approves new cure for
- Priority Top Story
- Category Health
- Video vid1.avi
- Text The Federal Drug Administration
14CNN Web-site Query (part 1)
Input graph of articles INPUT CNN-ARTICLES Create
web page for each article WHERE Articles(a),
note arc variable
l art - l - t, l in "Title",
"Abstract", "Date", "Text",
"Image", "Topimage", "RelatedSite", a -
"Category" - c CREATE ArticlePage(a) LINK
ArticlePage(a) - l - t WHERE a -
"RelatedArticle" - r LINK ArticlePage(a) -
"RelatedArticle" - ArticlePage(r)
15CNN Site Schema
RootPage()
a- priority- headline
a- category-c
CategoryEntry(c)
RootPageEntry(a)
Data(t)- a - l -t l in title, top-image
CategoryPage(c)
a -category-c
ArticlePage(a)
Data(t) a - l - t, l in "Title",
"Abstract",
16CNN Web-site Query (part 2)
CREATE RootPage WHERE a - "Priority" -
"headline", l in "Title",
"Date", "Topimage" CREATE RootEntry(a)
LINK RootPage - "HeadlineStory" -
RootEntry(a), Link each headline story to its
title, date, top image and full article
RootEntry(a) - "FullStory" - ArticlePage(a),
RootEntry(a) - l - t
17HTML Templates
EMBED , EMBED related-article ORDERdescend KEYdate _at_a LINKtitle
18CNN Sports Query
INPUT CNN WHERE TopCategory(c), c -
"CategoryName" - cn,
cn"Sports", c - "SubTopic" - top,
Articles(a), a - l - t, l in
"Title", "Abstract", "Date", "Text", "Image",
"Topimage", "RelatedSite", a
- "Category" - c, ctop CREATE
ArticlePage(a) LINK ArticlePage(a) - l - t
19StruQL Details
- Regular path expressions are constructed by a
grammar - R _
- Atoms in the WHERE clause are of the form X - R
- Y or C(X) - The LINK clause includes atoms of the form
- LINK f(X) -- new link -- g(X) or
- LINK f(X) -- L -- g(X)
- Queries can be nested, inheriting the WHERE
clauses of - their outer blocks.
- Note separation between querying part and
restructuring part!
20More on StruQL
- Bare bones language for semi-structured data
includes the essential features. - More expressive than Lorel or UnQL (e.g., can
reverse graphs) - Conceptually and in practice separation between
query component and restructuring component is
important. - Containment is decidable for StruQL-WHERE
(Florescu, Levy Suciu, PODS-98)
21 Advantages of Declarative Specifications
22Enforcing Integrity Constraints
- We often want to verify some constraints on site
structure - all articles from the last two days are reachable
from the root - all paths to confidential data must go through an
authentication node - Good site design principles are summarized as
integrity constraints Lohse, CACM, 98. - When site specs are long, constraints are hard to
enforce. - Want to verify constraints intentionally.
23Intentional IC Verification
- Formally, we want to check whether
- S(D) IC
- S is the site specification (e.g., StruQL Query)
- IC is a formula describing the constraint
- ? a, Article(a) date(a) today-2
- Root - -
ArticlePage(a). - for any instance D of the underlying data.
- Results
- Sound and complete algorithms for verification of
a class of integrity constraints (path
constraints). - Algorithms will also propose corrections when
ICs are violated.
24Run-time Management of Sites
- When do we compute web pages?
- Static approach completely precompute site
- Doesnt work for large sites, forms, hard to
update. - Dynamic approach compute pages on request
- Users may wait, a lot of repeated computation,
structure of the site is not exploited. - Current tools use one of the extremes, or specify
policy per collection of pages. - The specification is implicit in code.
- Our goal use site specification to automatically
find optimal strategy.
25Possible Run-time Optimizations
- View materialization
- Function caching
- when web sites represent hierarchically
structured data, successive queries in the site
differ only in their projected attributes. - Simplification under preconditions
- previous queries on the path may have already
verified some conditions for current query. - Lookahead computation
- often it is possible with little cost to compute
the data necessary for subsequent pages.
26Problem Statement
- Given
- site specification
- knowledge about browsing patterns
- cost function
- Produce
- Operational plan
- operational schema a set of queries to
compute on a given page request. - Results (in Strudel-R) framework
- Performance study of the optimizations.
- Algorithm for generating operational plans.
- Identification of many open problems.
27 Strudel Experience -- Tiramisu
28Experiences with Strudel(except for the lousy
GUI)
- Integrating data from multiple sources when
building a Web site - is a prime concern. Sources are
semi-structured! - Declarative specification of site structure is
very important - because
- site creation is a highly iterative process
- site owners often need redesign after
experience from - deployment
- we often generate multiple versions of sites
from the - same data.
- Design of web-sites is done in a top-down
fashion. - Strudel cant be the all encompassing web-site
management tool.
29Tiramisu the Second Generation
- Strudel and its siblings (Araneus, YAT, WebOQL,
WIRM) force the design and implementation of the
site to be done in the same tool. - Furthermore, there will always be tools that are
specialized for specific tasks. - Tiramisu
- Separate design phase from implementation.
- Allow the implementation to be done by a set of
cooperating tools.
30Tiramisu Architecture
mediator
data source
E/R style diagram of site (site schema)
data source
web site
Implementation manager
data source
wrapper
wrapper
wrapper
Tool (ASP)
Tool (FrontPage)
Tool (Strudel)
31Screenshot of a TERD
32Conclusions
- Web-site management is an important area for
Database research. - First-generation systems (Strudel, Araneus, YAT,
WebOQL) offer important advantages - Easy modification, creation of multiple versions
- enforcing constraints, run-time management
- Second generation (Tiramisu)
- Emphasize design phase of site
- Implement with a collection of cooperating tools.