WEB STRUCTURE MINING - PowerPoint PPT Presentation

About This Presentation
Title:

WEB STRUCTURE MINING

Description:

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18 INTRODUCTION Web mining is the application of data mining techniques in search engines. – PowerPoint PPT presentation

Number of Views:287
Avg rating:3.0/5.0
Slides: 23
Provided by: Bij78
Category:

less

Transcript and Presenter's Notes

Title: WEB STRUCTURE MINING


1
WEB STRUCTURE MINING
  • SUBMITTED BY
  • BLESSY JOHN
  • R7A
  • ROLL NO18

2
INTRODUCTION
  • Web mining is the application of data mining
    techniques in search engines.
  • Data mining - process of discovering useful
    knowledge from data sources
  • Web mining automatically discover and extract
    information from Web documents.
  • Web structure mining discovers useful data from
    hyperlinks.

3
WEB MINING
  • Useful patterns extraction from WWW resources
  • WWW is widely distributed, global information
    service centre that constitutes a rich source for
    data mining
  • Employing techniques from Data Mining,
    information retrieval,etc.

4
NEED FOR WEB MINING
  • Aims at finding and extracting relevant
    information that is hidden in web- related data.
  • The challenge is to bring back the semantics of
    hyper text document
  • To turn web data into web knowledge

5
CLASSIFICATION
6
WEB STRUCTURE MINING
  • Generate structural summary about the Web site
    and Web page
  • Use graph theory to analyse node and connection
    structure of a web site
  • Analysis of the link structure of the web, and
    its purposes is to identify more preferable
    documents

7
WEB STRUCTURE MINING cont..
  • Discovering the nature of the hierarchy of
    hyperlinks in the website and its structure
  • Hyperlink identifies authors endorsement of the
    other web page
  • Retrieving information about the relevance and
    the quality of the web page.

8
Page Layout and Link Analysis for Web Images
9
WEB BASICS
  • A web is a huge collection of documents linked
    together by references.
  • To refer from one document to another is based on
    hyper text and embedded in HTML
  • HTML describes how the document should display on
    browser window
  • Web document has a web address called URL that
    identifies it uniquely.

10
WEB CRAWLERS
  • Collects all web documents by browsing the Web
    systematically and exhaustively
  • Region of the web to be crawled can be speci?ed
    by using the URL structure.
  • Used by a search engine to provide local access
    to the most recent versions of possibly all web
    pages

11
INDEXING AND KEYWORD SEARCH
  • There are two types of data
  • structured and unstructured
  • Structured data have keys associated with each
    data item that re?ect its content
  • Content-based access to unstructured data without
    considering the meaning is the keyword search
    approach

12
DOCUMENT REPRESENTATION
  • To facilitate the process of matching keywords
    and documents, some preprocessing steps are taken
    ?rst
  • Documents are tokenized
  • Characters are converted to upper or lower case
  • Words reduced to canonical form
  • Stopwords are usually removed

13
ALGORITHMS
  • There are two main algorithms used in web
    structure mining
  • 1. HITS (Hypertext-Induced Topic
  • Search)
  • 2. Page rank algorithm

14
HITS (Hypertext-Induced Topic Search)
  • Link analysis algorithm
  • Rates web pages
  • Developed by Jon Kleinberg
  • Determines two values for a page
  • Authority-estimates the value of the content of
    the page
  • Hub-estimates the value of its links to other
    pages

15
Hubs and Authorities
  • Hub pages point to interesting links to
    authorities relevant pages
  • Authorities are targets of hub pages

16
Continue
  • Authority and hub values are defined in terms of
    one another in a mutual recursion
  • It is executed at querry time with the associated
    HIT on performance

17
Page Rank
  • Link analysis algorithm
  • Assigns a numerical weightage to each element of
    a hyperlinked set of documents
  • Denoted by PR(E)
  • Relies on uniquely democratic nature
  • Link from page A to page B is a vote, by page A,
    for page B

18
Continue..
  • Here, A considers itself important and help to
    make B important
  • Also a probability distribution represents the
    probability that a click on a link arrives at any
    particular page
  • Page rank of 0.5 -gt 50 chance that a person
    clicking on a link will be directed to the
    document with the 0.5 page rank

19
APPLICATIONS
  • Information retrieval in social networks.
  • To find out the relevancy of each Web page
  • Measuring completeness of the Web sites
  • Used in search engines to find out relevant
    information

20
CONCLUSION
  • Search engines uses web structure mining to find
    the information.
  • We can create new knowledge out of the available
    information
  • Web Content mining can be added to it to enhance
    the performance of search engines.

21
  • Thank You !

22
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com