Title: Issues in Monitoring Web Data
1Issues in Monitoring Web Data
- Serge Abiteboul
- INRIA and Xyleme
- Serge.Abiteboul_at_inria.fr
2Organization
- Introduction
- What is there to monitor?
- Why monitor?
- Some applications of web monitoring
- Web archiving
- An experience the archiving of the French web
- Page importance and change frequency
- Creation of a warehouse using web resources
- An experience the Xyleme Project
- Monitoring in Xyleme
- Queries and monitoring
- Conclusion
31. Introduction
4The Web Today
- Billions of pages millions of servers
- Query keywords to retrieve URLs
- Imprecise query results are useless for further
processing - Applications based on ad-hoc wrapping
- Expensive incomplete short-lived, not adapted
to the Web constant changes - Poor quality
- Cannot be trusted spamming, rumors
- Often stale
- Our vision of it often out-of-date
- Importance of monitoring
5The HTML Web Structure
Source IBM, AltaVista, Compaq
6HTML Percentage covered by Crawlers
Source searchenginewatch.com
7So much for the world knowledge
- Most of the web is not reached by crawlers
(hidden web) - Some of the public HTML pages are never read
- Most of what is on the web is junk anyway
- Our knowledge of it may be stale
- Do not junk the techno improve it!
8What is there to monitor?
- Documents HTML but also doc, pdf, ps
- Many data exchange formats such as asn1, bibtex
- New official data exchange format XML
- Hidden web database queries behind forms or
scripts - Multimedia data ignored here
- Public vs. private (Intranet or Internetpasswd)
- Static vs. dynamic
9What is changing?
- XML is coming
- Universal data exchange format
- Marriage of document and database worlds
- Standard query language XQuery
- Quickly growing on Intranet and very slowly on
public web (less than 1) - Web services are coming
- Format for exporting services
- Format for encapsulating queries
- More semantics to be expected
- RDF for data
- WSDLUDDI for services
10What is not changing fast or even getting worse
- Massive quantity of data most of it junk
- Lots of stale data
- Very primitive HTML query mechanisms (keywords)
- No real change control mechanism soon
- Compare database queries (fresh data) with web
search engines (possibly stale) - Compare database triggers (based on push) to web
notification services (most of the times based on
pull/refresh)
11The need to monitor the web
- The web changes all the time
- Users are often as interested in changes as by
data new products, new press articles, new
price - Discover new resources
- Keep our vision of the web up-to-date
- Be aware of changes that may be of interest, have
impact on our business
12Analogy databases
- Databases
- Query instantaneous vision of data
- Trigger alert/notification of some changes of
interest - Web
- Query need monitoring to give correct answer
- Monitoring to support alert/notifications of
changes of interest
13Web vs. database monitoring
- Quantity of data larger on the web
- Knowledge of data
- structure and semantics known in databases
- Reliability and availability
- High in databases null on the web
- Data granularity
- Tuple vs. page in HTML or element in XML
- Change control
- Databases support from data sources/triggers
- Web no support pull only in general
142. Some applications ofweb monitoring
15Comparative shopping
- Unique entry point to many catalogs
- Data integration problem
- Main issue wrapping of web catalogs
- Semi-automatic so limited to a few sites
- Simpler and towards automatic with XML
- Alternatives
- Mediation when data change very fast
- prices and availability of plane tickets
- Warehousing otherwise ? need to monitor changes
16Web surveillance
- Applications
- Anti-criminal and anti-terrorist intelligence,
e.g., detecting suspicious acquisition of
chemical products - Business intelligence, e.g., discovering
potential customers, partners, competitors - Find the data (crawl the web)
- Monitor the changes
- new pages, deleted pages, changes in a page
- Classify information and extract data of interest
- Data mining, text understanding, knowledge
representation and extraction, linguistic Very AI
17Copy tracking
- Example a press agency wants to check that
people are not publishing copies of their wires
without paying
Query to search engine Or specific crawl
pre-filter
Filter
1
2
3
detection
Flow of candidate documents
Slice the document
18Web archiving
- We will discuss an experience in archiving the
French web
19Creation of a data warehouse with resources found
of the web
- We will discuss some work in the Xyleme project
on the construction of XML warehouses
203. Web archiving
- An experience towards the archiving of the French
web with - Bibliothèque Nationale de France
21Dépôt légal (legal deposit)
- Books are archived since 1537, a decision by King
Francois the 1st - The Web is an important and valuable source of
information that should also be archived - What is different?
- Number of content providers 148000 sites vs.
5000 editors - Quantity of information millions of pages
video/audio - Quality of information lots of junk
- Relationship with editors freedom of publication
vs. traditional push model - Updates and changes occur continuously
- The perimeter is unclear what is the French web?
22Goal and Scope
- Provide future generations with a representative
archive of the cultural production - Provide material for cultural, political,
sociological studies - The mission is to archive a wide range of
material because nobody knows what will be of
interest for future research - In traditional publication, publishers are
filtering contents. No filter on the web
23Similar Projects
- The Internet Archive www.archive.org
- The Wayback machine
- Largest collection of versions of web pages
- Human selection based approach
- select a few hundred sites and choose a
periodicity of archiving - Australia and Canada
- The Nordic experience
- Use robot crawler to archive a significant part
of the surface web - Sweden, Finland, Norway
- Problems encountered
- Lack of updates of archived pages between two
snapshots - The hidden Web
24Orientation of our experiment
- Goals
- Cover a large portion of the French web
- Automatic content gathering is necessary
- Adapt robots to provide a continuous archiving
facility - Have frequent versions of the sites, at least for
the most important ones - Issues
- The notion of important sites
- Building a coherent Web archive
- Discover and manage important sources of deep Web
25First issue the perimeter
- The perimeter of the French Web contents edited
in France - Many criteria may be used
- The French language but many French sites use
English (e.g. INRIA) many French-speaking sites
are from other French speaking countries or
regions (e.g. Quebec) - Domain Name or resource locators .fr sites, but
many are also in .com or .org - Address of the site physical location of the web
servers or address of the owner - Other criteria than the perimeter
- Little interest in commercial sites
- Possibly interest in foreign sites that discuss
French issues - Pure automatic does not work ? involve librarians
26Second issueSite vs. Page archiving
- The Web
- Physical granularity HTML pages
- The problem is inconsistent data and links
- Read page P one week later read pages pointed by
P may not exist anymore - Logical granularity?
- Snapshot view of a web site
- What is a site?
- INRIA is www.inria.fr www-rocq.inria.fr
- www.multimania.com is the provider of many sites
- There are technical issues (rapid firing, )
27Importance of data
28What is page importance?
- Le Louvre homepage is more important than an
unknown persons homepage - Important pages are pointed by
- Other important pages
- Many unimportant pages
- This leads to Google definition of PageRank
- Based on the link structure of the web
- used with remarkable success by Google for
ranking results - Useful but not sufficient for web archiving
29Page Importance
- Importance
- Link matrix L
- In short, page importance is the fixpoint X of
the equation LX X - Storing the Link matrix and computing page
importance uses lots of resources - We developed a new efficient technique to compute
the fixpoint - Without having to store the Link matrix
- Technique adapts to automatically to changes
30Site vs. pages
- Limitation of page importance
- Google page importance works well when links have
a strong semantic - More and more web pages are automatically
generated and most links have little semantics - More limitation
- Refresh at the page level presents drawbacks
- So we also use link topology between sites and
not only between pages
31Experiments
- Crawl
- We used between 2 to 8 PCs for Xyleme crawlers
for 2 months - Discovery and refresh based on page importance
- Discovery
- We looked at more than 1.5 billion (most
interesting) web pages - We discovered more than 15 million .fr pages
about 1.5 of the web - We discovered 150 000 .fr sites
- Refresh
- Important pages were refreshed more often
- Takes into account also the change rate of pages
- Analysis of the relevance of site importance for
librarians - Comparison with ranking by librarians
- Strong correlation with their rankings
32Issues and on going workOther criteria for
importance
- Take into account indications by archivists
- They know best -- man-machine-interface issue
- Use classification and clustering techniques to
refine the notion of site - Frequent use of infrequent words
- Find pages dedicated to specific topics
- Text Weight
- Find pages with text content vs. raw data pages)
- Others
334. Creation of a Warehouse from Web data
34Xyleme in short
- The Xyleme project
- Initiated at INRIA
- Joint work with researchers from Orsay, Mannheim
and CNAM-Paris universities - The Xyleme company www.xyleme.com
- Started in 2000
- About 30 people
- Mission Deliver a new generation of content
technologies to unlock the potential of XML - Here focus on the Xyleme project
35Goal of the Xyleme project
- Focus is on XML data (but also handle HTML)
- Semantic
- Understand tags, partition the Web into semantic
domains, provide a simple view of each domain - Dynamicity
- Find and monitor relevant data on the web
- Control relevant changes in Web data
- XML storage, index and queries
- Manage efficiently millions of XML documents and
process millions of simultaneous queries
36Corporate information environment with Xyleme
Crawling interpreting data
XML Repository
Repository
Query Engine
Xyleme Server
Systematic updating
publishing
searches
queries
Information System
37XML in short
- Data exchange format
- eXtensible Mark-up Language (child of SGML)
- Promoted by W3C and major industry players
- XML document ordered labeled tree
- Other essential gadgets unicode, namespaces,
attributes, pointers, typing (XML schema)
38XML magic in short
- Presentation is given elsewhere (style-sheet)
- Semantic and structure are provided by labels
- So it is easy to extract information
- Universal format understood by more and more
softwares (e.g., exported by most databases
read by more and more editors) - More and more tools available
39It is easy to extract information
404.1 XylemeFunctionality and architechture
41The goal of Xyleme project XML Dynamic
Datawarehouse
- Many research issues
- Query Processor
- Semantic Classification
- Data Monitoring
- Native Storage
- XML document Versionning
- XML automatic or user driven acquisition
- Graphical User Interface through the Web
42Functional Architecture
Query Processor
Repository and Index Manager
43Architecture
-------------------- I N T E R N E T
-----------------------
44Prototype main choices
- Network of Linux PCs
- C on the server side
- Corba for communications between PCs
- HTTP SOAP for communications for external
communications - Exception for query processing
45Scaling
- Parallelism based on
- Partitioning
- XML documents
- URL table
- Indexes (semantic partitioning)
- Memory replication
- Autonomous machines (PCs)
- Caches are used for data flow
464.2 XylemeData Acquisition
47Data Acquisition
- Xyleme crawler visits the HTML/XML web
- Management of metadata on pages
- Sophisticate strategy to optimize network
bandwidth - importance ranking of pages
- change frequency and age of pages
- publications (owners) subscriptions (users)
- Each crawler visits about 4 million pages per day
- Each index may create index for 1 million pages
per day
484.3 XylemeChange Control
49Change Management
- The Web changes all the time
- Data acquisition
- automatic and via publication
- Monitoring
- subscriptions
- continuous queries
- versions
50Subscription
- Users can subscribe to certain events, e.g.,
- changes in all pages of a certain DTD or of a
certain semantic domain - insertion of a new product in a particular
catalog or in all catalogs with a particular DTD
- They may request to be notified
- at the time the event is detected by Xyleme
- regularly, e.g., once a week
51Continuous Queries
- Queries asked regularly or when some events are
detected - send me each Monday the list of movies in
Pariscope - send me each Monday the list of new movies in
Pariscope - each time you detect that a new member is added
to the Stanford DB-group, send me their lists of
publications from their homepages
52Versions and Deltas
- Store snapshots of documents
- For some documents, store changes (deltas)
- storage last version sequence of deltas
- complete delta reconstruct old versions
- partial delta allow to send changes to the user
and allow refresh - Deltas are XML documents
- so changes can be queried as standard data
- Temporal queries
- List of products that were introduced in this
catalog since January 1st 2002
53The Information Factory
Web
loaders
subscription processor
send notification
changes detection
documents and deltas
continuous queries
time
Repository
results
version queries
54Results
- Very efficient XML Diff algorithm
- compute difference between consecutive versions
- Representation of deltas based on an original
naming scheme for XML elements - one element is assigned a unique identifier for
its entire life - compact way of representing these IDs
- Efficient versioning mechanism
55Results
- Sophisticate monitoring algorithm
- Detection of simple patterns (conjunctions) at
the document level - Detection of changes between consecutive versions
of the same documents - Scale to dozens of crawlers loading millions of
documents per day for a single monitor
56Issues languages for monitoring
- In the spirit of temporal languages for
relational databases - But
- Data model is richer (trees vs. tables)
- Context is richer versions, continuous queries,
monitoring of data streams
574.4 XylemeSemantic Data Integration
58Data Integration
- One application domain -- Several schemas
- heterogeneous vocabulary and structure
- Xyleme Semantic Integration è
- gives the illusion that the system maintains an
homogeneous database for this domain - abstracts a set of DTDs into a hierarchy of
pertinent terms for a particular domain
(business, culture, tourism, biology, )
59Technology in short
- Cluster DTDs into application domains
- For an application domain semi-automatically
- Organize tags into a hierarchy of concepts using
thesauri such as Wordnet and other linguistic
tool - This provides the abstract DTD for the particular
domain - Generate mappings between concrete DTDs and the
abstract one
604.5 XylemeQuery Processing
61Xyleme Query Language
- A mix of OQL and XQL, will use the W3C standard
when there will be one.
Select product/name, product/price From doc
in catalogue, product in
doc/product Where product//components contains
flash and product/description
contains camera
62Principle of Querying
query on abstract dtd
Union of concrete queries (possibly with Joins)
catalogue/product/price ? d1//camera/price ?
d2/product/cost catalogue/product/description
? d1//camera/description ?
d2/product/info, ref ? d2/description
MAPPINGS between concrete and abstract DTDs
63Query Processing
- Partial translation, from abstract to concrete,
to identify machines with relevant data - Algebraic rewriting, linear search strategy based
on simple heuristics in priority, use in memory
indexes and minimize communication - Decomposition into local physical subplans and
installation - Execution of plans
- If needed, Relaxation
64Query processing
- Essential use of a smart index combining
full-text and structure
654.6 XylemeRepository
66Storage System
- Xyleme store
- efficient storage of trees in variable length
records within fixed length pages - Balancing of tree branches in case of overflow
- minimize the number of I/O for direct access and
scanning - good compromise compaction / access time
67Tree Balancing in Xyleme Store
Record 1
More children
Record 3
Record 2
Record 4
685. Conclusion
69Web monitoring
- Very challenging problem
- Complexity due to the volume of data and the
number of users - Complexity due to heterogeneity
- Complexity due to lack of cooperation from data
sources - Many issues to investigate
70New directions
- Active web sites
- Friendly sites willing to cooperate
- Web services provide the infrastructure
- Support for triggers
- Mobile data
- Web sites on mobile devices
- Issues of availability (device unplugged)
- Issues in synchronization
- Geography dependent queries