Title: WEB MINING AND APPLICATIONS
1WEB MINING AND
APPLICATIONS
- Pallavi Tripathi 105956127
- Vaishali Kshatriya 105951122
- Mehru Anand 106113525
- Minnie Virk 106113516
2 REFERENCES
- Data Mining Concepts Techniques by Jiawei Han
and Micheline Kamber - Presentation Slides of Prof. Anita Wasilewska
- http//www.cs.rpi.edu/youssefi/research/VWM/
- http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf - http//www.galeas.de/webimining.html
- http//www.cs.helsinki.fi/u/gionis/seminar_papers/
zaki00spade.ps
3 CITATIONS
- Amir H. Youssefi, David J. Duke, Mohammed J.
Zaki, Ephraim P. Glinert, Visual Web Mining 13th
International World Wide Web Conference (poster
proceedings), New York, NY, May 2004. - Amir H. Youssefi, David Duke, Ephraim P. Glinert,
and Mohammed J. Zaki, Toward Visual Web Mining,
3rd International Workshop on Visual Data Mining
(with ICDM'03), Melbourne, FL, November 2003.
4- With the explosive growth of information
sources available on the World Wide Web, it has
become increasingly necessary for users to
utilize automated tools in finding the desired
information resources, and to track and analyze
their usage patterns. These factors give rise to
the necessity of creating serverside and
clientside intelligent systems that can
effectively mine for knowledge
http//www.galeas.de/webimining.html
5WHAT IS WEB MINING?
- Web Mining is the extraction of interesting
and potentially useful patterns and implicit
information from artifacts or activity related to
the WorldWide Web.
6AREAS OF CLASSIFICATION
- WEB CONTENT MINING is the process of extracting
knowledge from the content of documents or their
descriptions. - WEB STRUCTURE MINING is the process of inferring
knowledge from the WorldWide Web organization
and links between references and referents in the
Web. - WEB USAGE MINING, also known as WEB LOG MINING,
is the process of extracting interesting patterns
in web access logs - In addition to these three web mining types,
there are other helpful approaches for web
knowledge discovery, such as information
visualization which helps us to understand the
complex relationships and structures of many
search results. -
http//www.galeas.de/webimining.html
7 TOPICS COVERED
- In todays presentation we would be covering
the following algorithms related to the various
aspects of Web Mining - Spade Algorithm and its applications in Visual
Web Mining - Sentiment Classification
-
- Community Trawling Algorithm
-
8 VISUAL WEB MINING
- Application of Information visualization
techniques on results of Web Mining in order to
further amplify the perception of extracted
patterns and visually explore new ones in web
domain. - Application Domain is Web Usage Mining and
Web Content Mining
http//www.cs.rpi.edu/youssefi/research/VWM/
9 APPROACH USED
- Make personalized results for targeted web
surfers - Use data mining algorithms for extracting new
insight and measures - Employ a database server and relational query
language as a means to submit specific queries
against data - Utilize visualization to obtain an overall picture
http//www.cs.rpi.edu/youssefi/research/VWM/
10 SPADE OVERVIEW
- Proposed by Mohammed J Zaki
- Sequential PAttern Discovery Using Equivalent
Class - An algorithm based on Apriori for fast discovery
of frequent sequences - Needs three database scans in order to extract
sequential patterns - Given A database of customer transactions, each
of which having the following characteristics
sequence-id or customer-id, transaction-time and
the item involved in the transaction. - The aim is to obtain typical behaviors according
to the user's viewpoint.
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
11 DEFINITIONS
- Item Can be considered as the object bought by
a customer, or the page requested by the user of
a website, etc. - Itemset An itemset is the set of items that are
grouped by timestamp. - Data Sequence Sequence of itemsets associated to
a customer. - Sequential Mining Discovering frequent sequences
over time of attribute sets in large databases. - Frequent Sequential Pattern Sequence whose
statistical significance in the database is above
user-specified threshold.
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
12 SPADE ALGORITHM
- In the first scan ,find frequent items
- The second scan aims at finding frequent
sequences of length 2 - The last scan associates to frequent sequences of
length 2, a table of the corresponding sequences
id and itemsets id in the database - Based on this representation in main memory, the
support of the candidate sequences of length k is
the result of join operations on the tables
related to the frequent sequences of length (k-1)
able to generate this candidate
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
13Data Sequence of 4 customers
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
14 AN EXAMPLE
- With a minimum support of 50 a sequential
pattern can be considered as frequent if it
occurs at least in the data sequences of 2
customers (2/4). - In this case a maximal sequential pattern mining
process will find three patterns - S1 (Camera,DVD)(DVD-R,DVD-Rec)
- S2 (DVD-R,DVD-Rec)(Videosoft)
- S3 (Memory Card)(USB)
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
15Determining Support
SUFFIX JOIN ON ID LIST
ORIGINAL ID LIST DATABASE
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
16 ADVANTAGES
- Uses simple join operations on id table
- No complicated hash tree structures used
- No overhead of generating and searching
subsequences - Cuts down on I/O operations by limiting itself to
three scans
http//www.cs.helsinki.fi/u/gionis/seminar_papers/
zaki00spade.ps
17- The visual Web Mining Framework provides
prototype implementation for applying information
visualization techniques on these results.
http//www.cs.rpi.edu/youssefi/research/VWM/
18SYSTEM ARCHITECTURE
http//www.cs.rpi.edu/youssefi/research/VWM
19- A robot (webbot) is used to retrieve the pages of
the Website - Web Server log files are downloaded and processed
- The Integration Engine is a suite of programs for
data preparation ie extracting, cleaning,
transforming, integrating data and finally
loading into database and later generating graphs
in XGML.
http//www.cs.rpi.edu/youssefi/research/VWM
20- We extract user sessions from web logs , this
yields results of roughly related to a specific
user - The user sessions are converted into format
suitable for Sequence Mining - Outputs are frequent contiguous sequence with
given minimum support. - These are imported into a database
- Different queries are executed against this data.
http//www.cs.rpi.edu/youssefi/research/VWM
21 APPLICATIONS
- Designing different visualization diagrams and
exploring frequent patterns of user access on a
website - Classification of web pages into two classes
hot and cold attracting high and low number of
visitors. - A webmaster can make exploratory changes to
website structure and analyze the change in user
access patterns in real world.
http//www.cs.rpi.edu/youssefi/research/VWM/
22Sentiment Classification
- Vaishali Kshatriya
- 105951122
23References
- The Sentimental Factor Improving Review
Classification via Human-Provided Information. -
Philip Beineke , Shivakumar Vaithyanathan and
Trevor Hastie - Thumbs Up or Thumbs Down? Semantic orientation
applied to unsupervised classification of
reviews Turney (July 2002) - http//wing.comp.nus.edu.sg/chime/050427/Sentiment
Classification3_files/frame.htm - http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
Detection.ppt267,12,Recent Advances - Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion
Observer Analyzing and Comparing Opinions on the
Web" Proceedings of the 14th international World
Wide Web conference (WWW-2005), May 10-14, 2005,
in Chiba, Japan.
24Sentiment Classification
- It is a task of labeling a review document
according to the polarity of its prevailing
opinion.
25Online Shopping
26Topical vs. Sentimental Classification
- Topical Classification
- Classifying documents into various subjects for
example Mathematics, Sports etc - comparing individual words (unigrams) in various
subject areas (Bag-of-Words approach). Example
score, referee, football gt Sports - Sentiment Classification
- classifying documents according to the overall
sentiment positive vs. negative E.g. like vs.
dislike Recommended vs. not recommended - More difficult compared to traditional topical
classification. May need more linguistic
processing E.g. you will be disappointed and
it is not satisfactory
http//wing.comp.nus.edu.sg/chime/050427/Sentiment
Classification3_files/frame.htm
27Challenges
- Dependence of context on the document
unpredictable plot, unpredictable performance - Negations have to be captured
- The movie was not that bad.
- The pictures taken by the cell is not of best
quality. - Subtle Expressions
- How can someone sit through the entire movie?
http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
Detection.ppt267,12,Recent Advances
28Unsupervised review classification (Turney ACL
-02)
- Input Written review
- Output classification (i.e. positive or
negative) - Step 1 Use part-of-speech tagger to identify
phrases - Step 2 Estimate the semantic orientation of
extracted phrase - Step 3 Assign the given review to a class
(either recommended or not recommended)
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
29Step 1 Extract the phrases
- Part-of-speech tagger is applied to the review
- Two consecutive words are extracted from the
review if their tags conform to any of the
patterns in the table
where JJ Adjective and NN Noun
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
30Step 2 Estimate the semantic orientation
- Uses PMI-IR (Pointwise Mutual Information and
Information Retrieval) - PMI between 2 words, word1 and word2 can be
defined as - The Semantic Orientation (SO) of a phrase is
calculated as - SO(phrase) PMI(phrase, excellent)
PMI(phrase, poor) - SO is positive when the phrase is more strongly
associated with excellent and negative when it is
more strongly associated with poor.
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
31Step 2 (contd)
- PMI-IR estimates PMI by issuing queries to a
search engine (hence the IR in PMI-IR) and noting
the number of hits (matching documents). - The experiment uses AltaVista
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
32Step 3 Assign a Class
- Calculate the average of the SO of the phrases
and classify them as recommended if the average
is positive and not recommended if the average is
negative.
Reviews of a bank
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
33Drawbacks
- Sentiment classification is useful but it does
not find what the reviewer liked or disliked. - A negative sentiment on an object does not imply
that the user did not like anything about the
product - Similarly a positive sentiment does not imply
that the user liked everything about the product - The solution is to go to sentence and feature
level
http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
34Feature based Opinion mining and summarization
(Hu and Liu 04)
- Interested in what reviewers liked and disliked
- Since the number of reviews of an object can be
large, the goal was to produce simple summary of
the reviews - The summary can be easily visualized and compared
http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
35Three main tasks
- Step1 Identify and extract object features that
have been commented on in each review - Step 2 Determine whether the opinion on the
review is positive, negative or neutral - Step 3 Group synonyms of features
- Produce a feature-based summary!!
http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
36Online Shopping
37Summary
- Classification of reviews as good or bad
sentimental classification - Unsupervised review classification extracts the
phrases from the review, estimates the semantic
orientation and assigns a class to the review - The solution for the short-comings of the
sentimental classification is feature-based
opinion extraction
38Discovering Web communities on the web
Mehru Anand (106113525)
39References
- Inferring Web Communities from Link Topology
(1998)Â David Gibson, Jon Kleinberg, Prabhakar
Raghavan, UK Conference on Hypertext. - Trawling the web for emerging cyber-communities
(1999)Â Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins, WWW8 / Computer
Networks. - Â
- Finding Related Pages in the World Wide Web
(1999)Â Jeffrey Dean, Monika R. Henzinger, WWW8 /
Computer Networks. - Â
- A System for Collaborative Web Resource
Categorization and Ranking Maxim Lifantsev. - Web Mining A Birds Eye View by Sanjay Kumar
Madria Department of Computer Science,University
of Missouri-Rolla, MO ,madrias_at_umr.edu - Â
40Introduction
- Introduction of the cyber-community
- Methods to measure the similarity of web pages on
the web graph - Methods to extract the meaningful communities
through the link structure
41What is cyber-community
- A community on the web is a group of web pages
sharing a common interest - Eg. A group of web pages talking about POP Music
- Eg. A group of web pages interested in
data-mining - Main properties
- Pages in the same community should be similar to
each other in contents - The pages in one community should differ from the
pages in another community - Similar to cluster
42Two different types of communities
- Explicitly-defined communities
- They are well known ones, such as the resource
listed by Yahoo! - Implicitly-defined communities
- They are communities unexpected or invisible to
most users
Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
43(No Transcript)
44Two different types of communities
- The explicit communities are easy to identify
- Eg. Yahoo!, InfoSeek, Clever System
- In order to extract the implicit communities, we
need analyze the web-graph objectively - In research, people are more interested in the
implicit communities
45Similarity of web pages
- Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes - A Method I
- For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A - Not so good. Consider the home page of IBM and
Microsoft.
Page A
Page B
46Similarity of web pages
- Method II (from Bibliometrics)
- Co-citation the similarity of A and B is
measured by the number of pages cite both A and B - Bibliographic coupling the similarity of A and B
is measured by the number of pages cited by both
A and B.
Page A
Page B
Page A
Page B
47Methods of clustering
- Clustering methods based on co-citation analysis
- Methods derived from HITS (Kleinberg)
- Using co-citation matrix
- All of them can discover meaningful communities
- But their methods are very expensive to the
whole World Wide Web with billions of web pages.
48Trawling the Web for emerging cyber-communitiesPr
oceeding of the eighth international conference
on World Wide Web Toronto, Canada Pages 1481 -
1493 Year of Publication 1999 ISSN1389-1286
Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins
49 A cheaper method
- The method from Ravi Kumar, Prabhakar Raghavan,
Sridhar Rajagopalan, Andrew Tomkins - IBM Almaden Research Center
- They call their method communities trawling (CT)
- They implemented it on the graph of 200 millions
pages, it worked very well
50Basic idea of CT
- Definition of communities
- dense directed bipartite sub graphs
- Bipartite graph Nodes are partitioned into two
sets, F and C - Every directed edge in the graph is directed from
a node u in F to a node v in C - dense if many of the possible edges between F and
C are present
F
C
51Basic idea of CT
- Bipartite cores
- a complete bipartite subgraph with at least i
nodes from F and at least j nodes from C - i and j are tunable parameters
- A (i, j) Bipartite core
- Every community have such a core with a certain i
and j.
A (i3, j3) bipartite core
52Basic idea of CT
- A bipartite core is the identity of a community
- To extract all the communities is to enumerate
all the bipartite cores on the web. - Author invent an efficient algorithm to enumerate
the bipartite cores. Its main idea is iterate
pruning -- elimination-generation pruning
53 Complete bipartite graph there is an edge
between each node in F and each node in C
(i,j)-Core a complete bipartite graph with at
least i nodes in F and j nodes in C (i,j)-Core
is a good signature for finding online
communities Trawling finding cores Find all
(i,j)-cores in the Web graph. In particular
find fans (or hubs) in the graph centers
authorities Challenge Web is huge. How to
find cores efficiently?
54Main idea pruning
- Step 1 using out-degrees
- Rule each fan must point to at least 6
different websites - Pruning results 12 of all pages ( 24M pages)
are potential fans - Retain only links, and ignore page contents
55Step 2 Eliminate mirroring pages
- Many pages are mirrors (exactly the same page)
- They can produce many spurious fans
- Use a shingling method to identify and
eliminate duplicates - Results
- 60 of 24M potential-fan pages are removed
- of potential centers is 30 times of of
potential fans
56Step 3 Iterative pruning
- To find (i,j)-cores
- Remove all pages whose of out-links is lt i
- Remove all pages whose of in-links is lt j
- Do it iteratively
- Step 4 inclusion-exclusion pruning
- Idea in each step, we
- Either include a community
- Or we exclude a page from further contention
57- Check a page x with j out-degree. x is a fan of a
(i,j)-core if - There are i-1 fans point to all the forward
neighbors of x - This step can be checked easily using the index
on fans and centers - Result for (3,3)-cores, 5M pages remained
- Final step
- Since the graph is much smaller, we can afford
to enumerate the remaining cores
58- Step 5 using in-degrees of pages
- Delete pages highly references, e.g., yahoo,
altavista - Reason they are referenced for many reasons, not
likely forming an emerging community - Formally remove all pages with more than k
inlinks (k 50,for instance) - Results
- 60M pages pointing to 20M pages
- 2M potential fans
59Weakness of CT
- The bipartite graph cannot suit all kinds of
communities - The density of the community is hard to adjust
60Experiment on CT
- 200 millions web pages
- IBM PC with an Intel 300MHz Pentium II processor,
with 512M of memory, running Linux - i from 3 to 10 and j from 3 to 20
- 200k potential communities were discovered
- 29 of them cannot be found in Yahoo!.
61Summary
- Conclusion The methods to discover communities
from the web depend on how we define the
communities through the link structure - Future works
- How to relate the contents to link structure
62Mining Topic-Specific Concepts and Definitions on
the Web
- Minnie Virk
- May 2003, Â Proceedings of the 12th International
conference on World Wide Web, ACM Press - Bing Liu, University of Illinois at Chicago, 851
S. Morgan Street Chicago IL 60607-7053 - Chee Wee Chin,
- Hwee Tou Ng, National University of Singapore
- 3 Science Drive 2
Singapore
63References
- Agrawal, R. and Srikant, R. Fast Algorithm for
Mining Association Rules, VLDB-94, 1994. - Anderson, C. and Horvitz, E. Web Montage A
Dynamic Personalized Start Page, WWW-02, 2002. - Brin, S. and Page, L. The Anatomy of a
Large-Scale Hypertextual Web Search Engine,
WWW7, 1998. - Web Mining A Birds Eye View by Sanjay Kumar
Madria Department of Computer Science,University
of Missouri-Rolla, MO ,madrias_at_umr.edu
64Introduction
- When one wants to learn about a topic, one reads
a book or a survey paper. - One can read the research papers about the topic.
- None of these is very practical.
- Learning from web is convenient, intuitive, and
diverse.
65Purpose of the Paper
- This papers task is mining topic-specific
knowledge on the Web. - The goal is to help people learn in-depth
knowledge of a topic systematically on the Web.
66Learning about a New Topic
- One needs to find definitions and descriptions of
the topic. - One also needs to know the sub-topics and salient
concepts of the topic. - Thus, one wants the knowledge as presented in a
traditional book. - The task of this paper can be summarized as
compiling a book on the Web.
67Proposed Technique
- First, identify sub-topics or salient concepts of
that specific topic. - Then, find and organize the informative pages
containing definitions and descriptions of the
topic and sub-topics.
68Why are the current search tecnhiques not
sufficient?
- For definitions and descriptions of the topic
- Existing search engines rank web pages based
on keyword matching and hyperlink structures. NOT
very useful for measuring the informative value
of the page. - For sub-topics and salient concepts of the topic
- A single web page is unlikely to contain
information about all the key concepts or
sub-topics of the topic. Thus, sub-topics need to
be discovered from multiple web pages. Current
search engine systems do not perform this task.
69Related Work
- Web information extraction wrappers
- Web query languages
- User preference approach
- Question answering in information retrieval
- Question answering is a closely-related work to
this paper. The objective of a question-answering
system is to provide direct answers to questions
submitted by the user. In this papers task, many
of the questions are about definitions of terms.
70The Algorithm
- WebLearn (T)
- 1) Submit T to a search engine, which returns a
set of relevant pages - 2) The system mines the sub-topics or salient
concepts of T using a set S of top ranking pages
from the search engine - 3) The system then discovers the informative
pages containing definitions of the topic and
sub-topics (salient concepts) from S - 4) The user views the concepts and informative
pages. - If s/he still wants to know more about
sub-topics then - for each user-interested sub-topic Ti of
T do - WebLearn (Ti)
71Sub-Topic or Salient Concept Discovery
- Observation
- Sub-topics or salient concepts of a topic are
important word phrases, usually emphasized using
some HTML tags (e.g., lth1gt,...,lth4gt,ltbgt). - However, this is not sufficient. Data mining
techniques are able to help to find the frequent
occurring word phrases.
72Sub-Topic Discovery
- After obtaining a set of relevant top-ranking
pages (using Google), sub-topic discovery
consists of the following 5 steps. - 1) Filter out the noisy documents that rarely
contain sub-topics or salient-concepts. The
resulting set of documents is the source for
sub-topic discovery.
73Sub-Topic Discovery
- 2) Identify important phrases in each page
(discover phrases emphasized by HTML markup
tags). - Rules to determine if a markup tag can safely
be ignored - Contains a salutation title (Mr, Dr, Professor).
- Contains an URL or an email address.
- Contains terms related to a publication
(conference, proceedings, journal). - Contains an image between the markup tags.
- Too lengthy (the paper uses 15 words as the upper
limit)
74Sub-Topic Discovery
- Also, in this step, some preprocessing techniques
such as stopwords removal and word stemming are
applied in order to extract quality text
segments. -
- Stopwords removal Eliminating the words that
occur too frequently and have little
informational meaning. - Word stemming Finding the root form of a word by
removing its suffix.
75Sub-Topic Discovery
- 3) Mine frequent occurring phrases
- - Each piece of text extracted in step 2 is
stored in a dataset called a transaction set. - - Then, an association rule miner based on
Apriori algorithm is executed to find those
frequent itemsets. In this context, an itemset is
a set of words that occur together, and an
itemset is frequent if it appears in more than
two documents. - - We only need the first step of the Apriori
algorithm and we only need to find frequent
itemsets with three words or fewer (this
restriction can be relaxed).
76Sub-Topic Discovery
- 4) Eliminate itemsets that are unlikely to be
sub-topics, and determine the sequence of words
in a sub-topic. (postprocessing) - Heuristic If an itemset does not appear alone as
an important phrase in any page, it is unlikely
to be a main sub-topic and it is removed.
77Sub-Topic Discovery
- 5) Rank the remaining itemsets. The remaining
itemsets are regarded as the sub-topics or
salient concepts of the search topic and are
ranked based on the number of pages that they
occur.
78Definition Finding
- This step tries to identify those pages that
include definitions of the search topic and its
sub-topics discovered in the previous step. - Preprocessing steps
- Texts that will not be displayed by browsers
(e.g., ltscriptgt...lt/ script gt,lt!comments--gt) are
ignored. - Word stemming is applied.
- Stopwords and punctuation are kept as they serve
as clues to identify definitions. - HTML tags within a paragraph are removed.
79Definition Finding
- After that, following patterns are applied to
identify definitions
1 Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining
Topic-Specific Concepts and Definitions on the Web
80Definition Finding
- Besides using the above patterns, the paper also
relies on HTML structuring and hyperlink
structures. - 1) If a page contains only one header or one big
emphasized text segment at the beginning in the
entire document, then the document contains a
definition of the concept in the header. - 2) Definitions at the second level of the
hyperlink structure are also discovered. All the
patterns and methods described above are applied
to these second level documents.
81Definition Finding
- Observation Sometimes no informative page is
found for a particular sub-topic when the pages
for the main topic are very general and do not
contain detailed information for sub-topics. - In such cases, the sub-topic can be submitted to
the search engine and sub-subtopics may be found
recursively.
82Conclusions
- The proposed techniques aim at helping Web users
to learn an unfamiliar topic in-depth and
systematically. - This is an efficient system to discover and
organize knowledge on the web, in a way similar
to a traditional book, to assist learning.
83 84