WEB MINING AND APPLICATIONS

About This Presentation

Title:

WEB MINING AND APPLICATIONS

Description:

Laptop. Camera, DVD. C4. USB. Videosoft. MemoryCard. DVD-R, DVD-Rec. C3 ... Since the number of reviews of an object can be large, the goal was to produce ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 85

Provided by: vaishalik

Category:

more less

Transcript and Presenter's Notes

Title: WEB MINING AND APPLICATIONS

1
WEB MINING AND
APPLICATIONS

Pallavi Tripathi 105956127
Vaishali Kshatriya 105951122
Mehru Anand 106113525
Minnie Virk 106113516

2
REFERENCES

Data Mining Concepts Techniques by Jiawei Han
and Micheline Kamber
Presentation Slides of Prof. Anita Wasilewska
http//www.cs.rpi.edu/youssefi/research/VWM/
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
http//www.galeas.de/webimining.html
http//www.cs.helsinki.fi/u/gionis/seminar_papers/
zaki00spade.ps

3
CITATIONS

Amir H. Youssefi, David J. Duke, Mohammed J.
Zaki, Ephraim P. Glinert, Visual Web Mining 13th
International World Wide Web Conference (poster
proceedings), New York, NY, May 2004.
Amir H. Youssefi, David Duke, Ephraim P. Glinert,
and Mohammed J. Zaki, Toward Visual Web Mining,
3rd International Workshop on Visual Data Mining
(with ICDM'03), Melbourne, FL, November 2003.

With the explosive growth of information
sources available on the World Wide Web, it has
become increasingly necessary for users to
utilize automated tools in finding the desired
information resources, and to track and analyze
their usage patterns. These factors give rise to
the necessity of creating serverside and
clientside intelligent systems that can
effectively mine for knowledge

http//www.galeas.de/webimining.html
5
WHAT IS WEB MINING?

Web Mining is the extraction of interesting
and potentially useful patterns and implicit
information from artifacts or activity related to
the WorldWide Web.

6
AREAS OF CLASSIFICATION

WEB CONTENT MINING is the process of extracting
knowledge from the content of documents or their
descriptions.
WEB STRUCTURE MINING is the process of inferring
knowledge from the WorldWide Web organization
and links between references and referents in the
Web.
WEB USAGE MINING, also known as WEB LOG MINING,
is the process of extracting interesting patterns
in web access logs
In addition to these three web mining types,
there are other helpful approaches for web
knowledge discovery, such as information
visualization which helps us to understand the
complex relationships and structures of many
search results.

http//www.galeas.de/webimining.html
7
TOPICS COVERED

In todays presentation we would be covering
the following algorithms related to the various
aspects of Web Mining
Spade Algorithm and its applications in Visual
Web Mining
Sentiment Classification
Community Trawling Algorithm

8
VISUAL WEB MINING

Application of Information visualization
techniques on results of Web Mining in order to
further amplify the perception of extracted
patterns and visually explore new ones in web
domain.
Application Domain is Web Usage Mining and
Web Content Mining

http//www.cs.rpi.edu/youssefi/research/VWM/
9
APPROACH USED

Make personalized results for targeted web
surfers
Use data mining algorithms for extracting new
insight and measures
Employ a database server and relational query
language as a means to submit specific queries
against data
Utilize visualization to obtain an overall picture

http//www.cs.rpi.edu/youssefi/research/VWM/
10
SPADE OVERVIEW

Proposed by Mohammed J Zaki
Sequential PAttern Discovery Using Equivalent
Class
An algorithm based on Apriori for fast discovery
of frequent sequences
Needs three database scans in order to extract
sequential patterns
Given A database of customer transactions, each
of which having the following characteristics
sequence-id or customer-id, transaction-time and
the item involved in the transaction.
The aim is to obtain typical behaviors according
to the user's viewpoint.

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
11
DEFINITIONS

Item Can be considered as the object bought by
a customer, or the page requested by the user of
a website, etc.
Itemset An itemset is the set of items that are
grouped by timestamp.
Data Sequence Sequence of itemsets associated to
a customer.
Sequential Mining Discovering frequent sequences
over time of attribute sets in large databases.
Frequent Sequential Pattern Sequence whose
statistical significance in the database is above
user-specified threshold.

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
12
SPADE ALGORITHM

In the first scan ,find frequent items
The second scan aims at finding frequent
sequences of length 2
The last scan associates to frequent sequences of
length 2, a table of the corresponding sequences
id and itemsets id in the database
Based on this representation in main memory, the
support of the candidate sequences of length k is
the result of join operations on the tables
related to the frequent sequences of length (k-1)
able to generate this candidate

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
13
Data Sequence of 4 customers
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
14
AN EXAMPLE

With a minimum support of 50 a sequential
pattern can be considered as frequent if it
occurs at least in the data sequences of 2
customers (2/4).
In this case a maximal sequential pattern mining
process will find three patterns
S1 (Camera,DVD)(DVD-R,DVD-Rec)
S2 (DVD-R,DVD-Rec)(Videosoft)
S3 (Memory Card)(USB)

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
15
Determining Support
SUFFIX JOIN ON ID LIST
ORIGINAL ID LIST DATABASE
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
16
ADVANTAGES

Uses simple join operations on id table
No complicated hash tree structures used
No overhead of generating and searching
subsequences
Cuts down on I/O operations by limiting itself to
three scans

http//www.cs.helsinki.fi/u/gionis/seminar_papers/
zaki00spade.ps
17

The visual Web Mining Framework provides
prototype implementation for applying information
visualization techniques on these results.

http//www.cs.rpi.edu/youssefi/research/VWM/
18
SYSTEM ARCHITECTURE
http//www.cs.rpi.edu/youssefi/research/VWM
19

A robot (webbot) is used to retrieve the pages of
the Website
Web Server log files are downloaded and processed
The Integration Engine is a suite of programs for
data preparation ie extracting, cleaning,
transforming, integrating data and finally
loading into database and later generating graphs
in XGML.

http//www.cs.rpi.edu/youssefi/research/VWM
20

We extract user sessions from web logs , this
yields results of roughly related to a specific
user
The user sessions are converted into format
suitable for Sequence Mining
Outputs are frequent contiguous sequence with
given minimum support.
These are imported into a database
Different queries are executed against this data.

http//www.cs.rpi.edu/youssefi/research/VWM
21
APPLICATIONS

Designing different visualization diagrams and
exploring frequent patterns of user access on a
website
Classification of web pages into two classes
hot and cold attracting high and low number of
visitors.
A webmaster can make exploratory changes to
website structure and analyze the change in user
access patterns in real world.

http//www.cs.rpi.edu/youssefi/research/VWM/
22
Sentiment Classification

Vaishali Kshatriya
105951122

23
References

The Sentimental Factor Improving Review
Classification via Human-Provided Information. -
Philip Beineke , Shivakumar Vaithyanathan and
Trevor Hastie
Thumbs Up or Thumbs Down? Semantic orientation
applied to unsupervised classification of
reviews Turney (July 2002)
http//wing.comp.nus.edu.sg/chime/050427/Sentiment
Classification3_files/frame.htm
http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
Detection.ppt267,12,Recent Advances
Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion
Observer Analyzing and Comparing Opinions on the
Web" Proceedings of the 14th international World
Wide Web conference (WWW-2005), May 10-14, 2005,
in Chiba, Japan.

24
Sentiment Classification

It is a task of labeling a review document
according to the polarity of its prevailing
opinion.

25
Online Shopping
26
Topical vs. Sentimental Classification

Topical Classification
Classifying documents into various subjects for
example Mathematics, Sports etc
comparing individual words (unigrams) in various
subject areas (Bag-of-Words approach). Example
score, referee, football gt Sports
Sentiment Classification
classifying documents according to the overall
sentiment positive vs. negative E.g. like vs.
dislike Recommended vs. not recommended
More difficult compared to traditional topical
classification. May need more linguistic
processing E.g. you will be disappointed and
it is not satisfactory

http//wing.comp.nus.edu.sg/chime/050427/Sentiment
Classification3_files/frame.htm
27
Challenges

Dependence of context on the document
unpredictable plot, unpredictable performance
Negations have to be captured
The movie was not that bad.
The pictures taken by the cell is not of best
quality.
Subtle Expressions
How can someone sit through the entire movie?

http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
Detection.ppt267,12,Recent Advances
28
Unsupervised review classification (Turney ACL
-02)

Input Written review
Output classification (i.e. positive or
negative)
Step 1 Use part-of-speech tagger to identify
phrases
Step 2 Estimate the semantic orientation of
extracted phrase
Step 3 Assign the given review to a class
(either recommended or not recommended)

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
29
Step 1 Extract the phrases

Part-of-speech tagger is applied to the review
Two consecutive words are extracted from the
review if their tags conform to any of the
patterns in the table

where JJ Adjective and NN Noun
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
30
Step 2 Estimate the semantic orientation

Uses PMI-IR (Pointwise Mutual Information and
Information Retrieval)
PMI between 2 words, word1 and word2 can be
defined as
The Semantic Orientation (SO) of a phrase is
calculated as
SO(phrase) PMI(phrase, excellent)
PMI(phrase, poor)
SO is positive when the phrase is more strongly
associated with excellent and negative when it is
more strongly associated with poor.

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
31
Step 2 (contd)

PMI-IR estimates PMI by issuing queries to a
search engine (hence the IR in PMI-IR) and noting
the number of hits (matching documents).
The experiment uses AltaVista

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
32
Step 3 Assign a Class

Calculate the average of the SO of the phrases
and classify them as recommended if the average
is positive and not recommended if the average is
negative.

Reviews of a bank
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
33
Drawbacks

Sentiment classification is useful but it does
not find what the reviewer liked or disliked.
A negative sentiment on an object does not imply
that the user did not like anything about the
product
Similarly a positive sentiment does not imply
that the user liked everything about the product
The solution is to go to sentence and feature
level

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
34
Feature based Opinion mining and summarization
(Hu and Liu 04)

Interested in what reviewers liked and disliked
Since the number of reviews of an object can be
large, the goal was to produce simple summary of
the reviews
The summary can be easily visualized and compared

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
35
Three main tasks

Step1 Identify and extract object features that
have been commented on in each review
Step 2 Determine whether the opinion on the
review is positive, negative or neutral
Step 3 Group synonyms of features
Produce a feature-based summary!!

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
36
Online Shopping
37
Summary

Classification of reviews as good or bad
sentimental classification
Unsupervised review classification extracts the
phrases from the review, estimates the semantic
orientation and assigns a class to the review
The solution for the short-comings of the
sentimental classification is feature-based
opinion extraction

38
Discovering Web communities on the web
Mehru Anand (106113525)
39
References

Inferring Web Communities from Link Topology
(1998) David Gibson, Jon Kleinberg, Prabhakar
Raghavan, UK Conference on Hypertext.
Trawling the web for emerging cyber-communities
(1999) Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins, WWW8 / Computer
Networks.
Finding Related Pages in the World Wide Web
(1999) Jeffrey Dean, Monika R. Henzinger, WWW8 /
Computer Networks.
A System for Collaborative Web Resource
Categorization and Ranking Maxim Lifantsev.
Web Mining A Birds Eye View by Sanjay Kumar
Madria Department of Computer Science,University
of Missouri-Rolla, MO ,madrias_at_umr.edu

40
Introduction

Introduction of the cyber-community
Methods to measure the similarity of web pages on
the web graph
Methods to extract the meaningful communities
through the link structure

41
What is cyber-community

A community on the web is a group of web pages
sharing a common interest
Eg. A group of web pages talking about POP Music
Eg. A group of web pages interested in
data-mining
Main properties
Pages in the same community should be similar to
each other in contents
The pages in one community should differ from the
pages in another community
Similar to cluster

42
Two different types of communities

Explicitly-defined communities
They are well known ones, such as the resource
listed by Yahoo!
Implicitly-defined communities
They are communities unexpected or invisible to
most users

Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
43
(No Transcript)
44
Two different types of communities

The explicit communities are easy to identify
Eg. Yahoo!, InfoSeek, Clever System
In order to extract the implicit communities, we
need analyze the web-graph objectively
In research, people are more interested in the
implicit communities

45
Similarity of web pages

Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes
A Method I
For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A
Not so good. Consider the home page of IBM and
Microsoft.

Page A
Page B
46
Similarity of web pages

Method II (from Bibliometrics)
Co-citation the similarity of A and B is
measured by the number of pages cite both A and B
Bibliographic coupling the similarity of A and B
is measured by the number of pages cited by both
A and B.

Page A
Page B
Page A
Page B
47
Methods of clustering

Clustering methods based on co-citation analysis
Methods derived from HITS (Kleinberg)
Using co-citation matrix
All of them can discover meaningful communities
But their methods are very expensive to the
whole World Wide Web with billions of web pages.

48
Trawling the Web for emerging cyber-communitiesPr
oceeding of the eighth international conference
on World Wide Web Toronto, Canada Pages 1481 -
1493 Year of Publication 1999 ISSN1389-1286
Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins
49
A cheaper method

The method from Ravi Kumar, Prabhakar Raghavan,
Sridhar Rajagopalan, Andrew Tomkins
IBM Almaden Research Center
They call their method communities trawling (CT)
They implemented it on the graph of 200 millions
pages, it worked very well

50
Basic idea of CT

Definition of communities
dense directed bipartite sub graphs
Bipartite graph Nodes are partitioned into two
sets, F and C
Every directed edge in the graph is directed from
a node u in F to a node v in C
dense if many of the possible edges between F and
C are present

F
C
51
Basic idea of CT

Bipartite cores
a complete bipartite subgraph with at least i
nodes from F and at least j nodes from C
i and j are tunable parameters
A (i, j) Bipartite core
Every community have such a core with a certain i
and j.

A (i3, j3) bipartite core
52
Basic idea of CT

A bipartite core is the identity of a community
To extract all the communities is to enumerate
all the bipartite cores on the web.
Author invent an efficient algorithm to enumerate
the bipartite cores. Its main idea is iterate
pruning -- elimination-generation pruning

53
Complete bipartite graph there is an edge
between each node in F and each node in C
(i,j)-Core a complete bipartite graph with at
least i nodes in F and j nodes in C (i,j)-Core
is a good signature for finding online
communities Trawling finding cores Find all
(i,j)-cores in the Web graph. In particular
find fans (or hubs) in the graph centers
authorities Challenge Web is huge. How to
find cores efficiently?
54
Main idea pruning

Step 1 using out-degrees
Rule each fan must point to at least 6
different websites
Pruning results 12 of all pages ( 24M pages)
are potential fans
Retain only links, and ignore page contents

55
Step 2 Eliminate mirroring pages

Many pages are mirrors (exactly the same page)
They can produce many spurious fans
Use a shingling method to identify and
eliminate duplicates
Results
60 of 24M potential-fan pages are removed
of potential centers is 30 times of of
potential fans

56
Step 3 Iterative pruning

To find (i,j)-cores
Remove all pages whose of out-links is lt i
Remove all pages whose of in-links is lt j
Do it iteratively
Step 4 inclusion-exclusion pruning
Idea in each step, we
Either include a community
Or we exclude a page from further contention

Check a page x with j out-degree. x is a fan of a
(i,j)-core if
There are i-1 fans point to all the forward
neighbors of x
This step can be checked easily using the index
on fans and centers
Result for (3,3)-cores, 5M pages remained
Final step
Since the graph is much smaller, we can afford
to enumerate the remaining cores

Step 5 using in-degrees of pages
Delete pages highly references, e.g., yahoo,
altavista
Reason they are referenced for many reasons, not
likely forming an emerging community
Formally remove all pages with more than k
inlinks (k 50,for instance)
Results
60M pages pointing to 20M pages
2M potential fans

59
Weakness of CT

The bipartite graph cannot suit all kinds of
communities
The density of the community is hard to adjust

60
Experiment on CT

200 millions web pages
IBM PC with an Intel 300MHz Pentium II processor,
with 512M of memory, running Linux
i from 3 to 10 and j from 3 to 20
200k potential communities were discovered
29 of them cannot be found in Yahoo!.

61
Summary

Conclusion The methods to discover communities
from the web depend on how we define the
communities through the link structure
Future works
How to relate the contents to link structure

62
Mining Topic-Specific Concepts and Definitions on
the Web

Minnie Virk
May 2003, Proceedings of the 12th International
conference on World Wide Web, ACM Press
Bing Liu, University of Illinois at Chicago, 851
S. Morgan Street Chicago IL 60607-7053
Chee Wee Chin,
Hwee Tou Ng, National University of Singapore
3 Science Drive 2
Singapore

63
References

Agrawal, R. and Srikant, R. Fast Algorithm for
Mining Association Rules, VLDB-94, 1994.
Anderson, C. and Horvitz, E. Web Montage A
Dynamic Personalized Start Page, WWW-02, 2002.
Brin, S. and Page, L. The Anatomy of a
Large-Scale Hypertextual Web Search Engine,
WWW7, 1998.
Web Mining A Birds Eye View by Sanjay Kumar
Madria Department of Computer Science,University
of Missouri-Rolla, MO ,madrias_at_umr.edu

64
Introduction

When one wants to learn about a topic, one reads
a book or a survey paper.
One can read the research papers about the topic.
None of these is very practical.
Learning from web is convenient, intuitive, and
diverse.

65
Purpose of the Paper

This papers task is mining topic-specific
knowledge on the Web.
The goal is to help people learn in-depth
knowledge of a topic systematically on the Web.

66
Learning about a New Topic

One needs to find definitions and descriptions of
the topic.
One also needs to know the sub-topics and salient
concepts of the topic.
Thus, one wants the knowledge as presented in a
traditional book.
The task of this paper can be summarized as
compiling a book on the Web.

67
Proposed Technique

First, identify sub-topics or salient concepts of
that specific topic.
Then, find and organize the informative pages
containing definitions and descriptions of the
topic and sub-topics.

68
Why are the current search tecnhiques not
sufficient?

For definitions and descriptions of the topic
Existing search engines rank web pages based
on keyword matching and hyperlink structures. NOT
very useful for measuring the informative value
of the page.
For sub-topics and salient concepts of the topic
A single web page is unlikely to contain
information about all the key concepts or
sub-topics of the topic. Thus, sub-topics need to
be discovered from multiple web pages. Current
search engine systems do not perform this task.

69
Related Work

Web information extraction wrappers
Web query languages
User preference approach
Question answering in information retrieval
Question answering is a closely-related work to
this paper. The objective of a question-answering
system is to provide direct answers to questions
submitted by the user. In this papers task, many
of the questions are about definitions of terms.

70
The Algorithm

WebLearn (T)
1) Submit T to a search engine, which returns a
set of relevant pages
2) The system mines the sub-topics or salient
concepts of T using a set S of top ranking pages
from the search engine
3) The system then discovers the informative
pages containing definitions of the topic and
sub-topics (salient concepts) from S
4) The user views the concepts and informative
pages.
If s/he still wants to know more about
sub-topics then
for each user-interested sub-topic Ti of
T do
WebLearn (Ti)

71
Sub-Topic or Salient Concept Discovery

Observation
Sub-topics or salient concepts of a topic are
important word phrases, usually emphasized using
some HTML tags (e.g., lth1gt,...,lth4gt,ltbgt).
However, this is not sufficient. Data mining
techniques are able to help to find the frequent
occurring word phrases.

72
Sub-Topic Discovery

After obtaining a set of relevant top-ranking
pages (using Google), sub-topic discovery
consists of the following 5 steps.
1) Filter out the noisy documents that rarely
contain sub-topics or salient-concepts. The
resulting set of documents is the source for
sub-topic discovery.

73
Sub-Topic Discovery

2) Identify important phrases in each page
(discover phrases emphasized by HTML markup
tags).
Rules to determine if a markup tag can safely
be ignored
Contains a salutation title (Mr, Dr, Professor).
Contains an URL or an email address.
Contains terms related to a publication
(conference, proceedings, journal).
Contains an image between the markup tags.
Too lengthy (the paper uses 15 words as the upper
limit)

74
Sub-Topic Discovery

Also, in this step, some preprocessing techniques
such as stopwords removal and word stemming are
applied in order to extract quality text
segments.
Stopwords removal Eliminating the words that
occur too frequently and have little
informational meaning.
Word stemming Finding the root form of a word by
removing its suffix.

75
Sub-Topic Discovery

3) Mine frequent occurring phrases
- Each piece of text extracted in step 2 is
stored in a dataset called a transaction set.
- Then, an association rule miner based on
Apriori algorithm is executed to find those
frequent itemsets. In this context, an itemset is
a set of words that occur together, and an
itemset is frequent if it appears in more than
two documents.
- We only need the first step of the Apriori
algorithm and we only need to find frequent
itemsets with three words or fewer (this
restriction can be relaxed).

76
Sub-Topic Discovery

4) Eliminate itemsets that are unlikely to be
sub-topics, and determine the sequence of words
in a sub-topic. (postprocessing)
Heuristic If an itemset does not appear alone as
an important phrase in any page, it is unlikely
to be a main sub-topic and it is removed.

77
Sub-Topic Discovery

5) Rank the remaining itemsets. The remaining
itemsets are regarded as the sub-topics or
salient concepts of the search topic and are
ranked based on the number of pages that they
occur.

78
Definition Finding

This step tries to identify those pages that
include definitions of the search topic and its
sub-topics discovered in the previous step.
Preprocessing steps
Texts that will not be displayed by browsers
(e.g., ltscriptgt...lt/ script gt,lt!comments--gt) are
ignored.
Word stemming is applied.
Stopwords and punctuation are kept as they serve
as clues to identify definitions.
HTML tags within a paragraph are removed.

79
Definition Finding

After that, following patterns are applied to
identify definitions

1 Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining
Topic-Specific Concepts and Definitions on the Web
80
Definition Finding

Besides using the above patterns, the paper also
relies on HTML structuring and hyperlink
structures.
1) If a page contains only one header or one big
emphasized text segment at the beginning in the
entire document, then the document contains a
definition of the concept in the header.
2) Definitions at the second level of the
hyperlink structure are also discovered. All the
patterns and methods described above are applied
to these second level documents.

81
Definition Finding

Observation Sometimes no informative page is
found for a particular sub-topic when the pages
for the main topic are very general and do not
contain detailed information for sub-topics.
In such cases, the sub-topic can be submitted to
the search engine and sub-subtopics may be found
recursively.

82
Conclusions

The proposed techniques aim at helping Web users
to learn an unfamiliar topic in-depth and
systematically.
This is an efficient system to discover and
organize knowledge on the web, in a way similar
to a traditional book, to assist learning.