6'5 Text and Web page preprocessing - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

6'5 Text and Web page preprocessing

Description:

base on n-grams (shingles) consecutive of words of a fixed window size n ... n-grams (shingles) (cont.) Sn(d) : the set of distinctive n-grams in. document d ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 21

Provided by: mach53

Category:

Tags: page | preprocessing | shingle | text | web

Transcript and Presenter's Notes

Title: 6'5 Text and Web page preprocessing

1
6.5 Text and Web page pre-processing
2
The tasks before retrival

Traditional text documents
stopword removal
stemming
handling of digits, hyphens, punctuations, case
of letters
Web page additional
HTML tag removal
Identification of main content blocks

3
stopword removal

Stopwords
frequently occurring
insignificant words
help construct sentences
not represent any content
articles the, this, that,etc
prepositions as, at, by, for, with, to,etc
conjunctions and, so, then, or, etc
pronouns who, when, where, that, it, etc

4
stemming

stem
the portion of a word that is left after removing
its prefixes or suffixes
nouns plural form (book? books, country?
countries)
verbs gerund form (do? doing)
past tense (do? did, do? done)
? low recall for a retrieval system

5
stemming (cont.)

stemming
suffix removal
stripping
ex computer, compute, computing
? comput
walks, walked, walking, walker
? walk

6
stemming (cont.)

stemming algorithm
Brute Force Algorithms
Suffix Stripping Algorithms
Lemmatisation Algorithms
Stochastic Algorithms
Hybrid Approaches
Matching Algorithms
Martin Porters stemming algorithm

7
stemming (cont.)

advantage
increase the recall
reduce the size of the indexing structure

disadvantage
irrelevant documents may be considered relevant
ex cope reduce to cop

8
Other pre-processing Tasks for text

Digits
except specific types, e.g. dates, pre-specified
types expressed with regular expressions
Hyphens
replace with a space or without leaving a space
? some relevant page will not be found.
pre-processing with query term preprocessing
Punctuation Marks
Case of Letters

9
web page pre-processing

Identifying different text fields
allows the retrieval system to treat terms in
different fields differently
ex title concise description of the page,
emphasize terms e.g. header, bold tag

10
web page pre-processing (cont.)

Identifying anchor text
anchor text associated with hyperlink represents
a more accurate description of the information
point to an external page is especially valuable
summary description of page , thus trustworthy

11
web page pre-processing (cont.)

removing HTML tags
dealt with similarly to punctuation
In a typical commercial page information is in
many blocks, removing HTML tags may cause
problems by joining texts but should not be join.
(fig. 6.6)

12
(No Transcript)
13
web page pre-processing (cont.)

Identifying main content blocks
commercial page contains a large amount
information not part of the page
lead to poor results for search and mining

14
Identifying main content blocks

two techniques for find such blocks in Web pages.
Partitioning based on visual cues
Tree matching

15
Partitioning based on visual cues

visual or rendering information could be obtained
from the Web browser
ex IE provides an API can output X and Y
coordinates, machine learning model can be built
base on location and apperance features ( amount
of training example need to be manually labeled)

16
Tree matching

base on observation that in most commercial web
pages are generated by using some fixed
templates. Since HTML has a nested structure,
its easy to build a tag tree.
find the hidden templates
Once a template is found, we can identify
which blocks are likely to be main contain blocks
(quite different across different page of the
same template)

17
Duplicate Detection

duplication(replication) copy a page
mirroring copy an entire site
improve efficiency of browsing
fire downloading worldwide due to limited
bandwidth across different geographic regions
poor or unpredictable network performance
?some duplicate page are plagiarism

18
some method to find duplicate

MD5 algorithm
computing an aggregated number
?only useful for detecting exact duplicates
different mirror sites
different URLs
different Web masters
different contact information
different advertisements to suit local needs

19
Efficient duplicate detection technique

base on n-grams (shingles)
consecutive of words of a fixed window size n
ex John went to school with his brother.
with 3-grams
?John went to ? went to school
?to school with ? school with his
?with his brother

20
n-grams (shingles) (cont.)

Sn(d) the set of distinctive n-grams in
document d
each n-gram coded with a number (MD5 hash)
representations of d1 and d2, Sn(d1) and Sn(d2)
Jaccard coefficient compute the similarity
Sn(d1) nSn(d2)
sim (d1,d2)
Sn(d1) ?Sn(d2)
window size n and similarity threshhold can be
chosen through experiments.

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Contents of this Chapter PowerPoint PPT Presentation

Contents of this Chapter - Mining Text and Web Data Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification [Han & Kamber 2006, Sections 10 ... | PowerPoint PPT presentation | free to view

Web Scraping Food Reviews Data and Sentiment Analysis - A Comprehensive Guide PowerPoint PPT Presentation

Web Scraping Food Reviews Data and Sentiment Analysis - A Comprehensive Guide - Unlock insights from web scraping food reviews data. Dive deep into sentiment analysis for informed decision-making. know morehttps://www.datazivot.com/web-scraping-food-reviews-data-analysis.php | PowerPoint PPT presentation | free to view

CS276B Text Retrieval and Mining Winter 2005 PowerPoint PPT Presentation

CS276B Text Retrieval and Mining Winter 2005 - CS276B Text Retrieval and Mining Winter 2005 Lecture 9 Plan for today Web size estimation Mirror/duplication detection Pagerank Size of the web What is the size of ... | PowerPoint PPT presentation | free to view

How Do I Scrape Contact Information From Any Social Web Page? PowerPoint PPT Presentation

How Do I Scrape Contact Information From Any Social Web Page? - You can scrape all social media sites and business directories by using web scraper such as Facebook, Twitter, Amazon, eBay, Yellow Pages, Manta, For Square, etc. | PowerPoint PPT presentation | free to view

Chapter 4 Data, Text, and Web Mining PowerPoint PPT Presentation

Chapter 4 Data, Text, and Web Mining - Chapter 4 Data, Text, and Web Mining | PowerPoint PPT presentation | free to view

Lecture 05: Web Search Issues and Algorithms PowerPoint PPT Presentation

Lecture 05: Web Search Issues and Algorithms - Lecture 05: Web Search Issues and Algorithms SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS | PowerPoint PPT presentation | free to view

SCSC 455 PowerPoint PPT Presentation

SCSC 455 - Web Security SCSC 455 * | PowerPoint PPT presentation | free to view

Modeling the Internet and the Web: Text Analysis PowerPoint PPT Presentation

Modeling the Internet and the Web: Text Analysis - Modeling the Internet and the Web: Text Analysis | PowerPoint PPT presentation | free to view

Web Content Extraction through Histogram clustering PowerPoint PPT Presentation

Web Content Extraction through Histogram clustering - Too much junk in a web page. Goal: Extract only the content of a page ... Non-HTML or all content pages. approximation. ANNIE'08 Paper. Computing and ... | PowerPoint PPT presentation | free to view

An Automatic Text Mining Framework for Knowledge Discovery on the Web PowerPoint PPT Presentation

An Automatic Text Mining Framework for Knowledge Discovery on the Web - Effectiveness (accuracy, precision, recall), efficiency (time) ... Considered occurrences in title, extended anchor text, and full text (Lee et al. 2002) ... | PowerPoint PPT presentation | free to view

CS276: Information Retrieval and Web Search PowerPoint PPT Presentation

CS276: Information Retrieval and Web Search - CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 5: Index Compression Postings: two conflicting forces A term like ... | PowerPoint PPT presentation | free to view

Text and Web Search PowerPoint PPT Presentation

Text and Web Search - NO, just keep the first k (concepts) Web Search What about web search? First you need to get all the documents of the web . Crawlers. Then you have to index them ... | PowerPoint PPT presentation | free to view

Block-based Web Search PowerPoint PPT Presentation

Block-based Web Search - Title: Auto-Helpdesk Review Author: jrwen Last modified by: Shipeng Yu Created Date: 10/6/2000 10:55:41 PM Document presentation format: On-screen Show | PowerPoint PPT presentation | free to view

MEAD 3.09 A platform for multidocument multilingual text summarization PowerPoint PPT Presentation

MEAD 3.09 A platform for multidocument multilingual text summarization - MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia University University of Pennsylvania, Johns ... | PowerPoint PPT presentation | free to view

Evaluation of Bipartite-graph-based Web Page Clustering PowerPoint PPT Presentation

Evaluation of Bipartite-graph-based Web Page Clustering - Due to these two properties of the Web. ... Constructing a Web page clustering system ... 70,000 pages departed from seed pages by 2 hops. Preprocess. Word ID ... | PowerPoint PPT presentation | free to view

Block-based Web Search PowerPoint PPT Presentation

Block-based Web Search - Has three types of passages: discourse, semantic, window ... Okapi, with weighting function BM2500. Preprocessing. Standard stop-word list ... | PowerPoint PPT presentation | free to view

Web%20Personalization%20and%20Recommender%20Systems PowerPoint PPT Presentation

Web%20Personalization%20and%20Recommender%20Systems - ... CF Movie data set Movie ratings from the movielens data set Semantic info. extracted from IMDB based on the following ontology ... word) clusters ... Similarity ... | PowerPoint PPT presentation | free to view

Learning Based Web Query Processing PowerPoint PPT Presentation

Learning Based Web Query Processing - Learning Based Web Query Processing Yanlei Diao Computer Science Department Hong Kong U. of Science & Technology Outline Background Learning Based Web Query ... | PowerPoint PPT presentation | free to view

Web Page Clustering based on Web Community Extraction PowerPoint PPT Presentation

Web Page Clustering based on Web Community Extraction - Due to these two properties of the Web. ... http://www.ducati.com/od/ducatijapan/jp/index.jhtml. http://www.triumphmotorcycles.com/japan ... | PowerPoint PPT presentation | free to view

Overview of Web Mining and E-Commerce Data Analytics PowerPoint PPT Presentation

Overview of Web Mining and E-Commerce Data Analytics - What is Data Mining. What do we need? Extract interesting and useful knowledge from the data. Find rules, regularities, irregularities, patterns, constraints | PowerPoint PPT presentation | free to view

Data Preparation for Web Usage Analysis PowerPoint PPT Presentation

Data Preparation for Web Usage Analysis - Data Preparation for Web Usage Analysis Bamshad Mobasher DePaul University | PowerPoint PPT presentation | free to view

Towards Web-Scale Information Extraction PowerPoint PPT Presentation

Towards Web-Scale Information Extraction - ... generated): see Prof. Bing Liu's KDD webinar: http: ... Steve Cook. Ronald Fagin. Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction ... | PowerPoint PPT presentation | free to view

Overview of Web Mining and E-Commerce Data Analytics PowerPoint PPT Presentation

Overview of Web Mining and E-Commerce Data Analytics - Data Miing and Knowledge Discvoery - Web Data Mining | PowerPoint PPT presentation | free to view

The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search PowerPoint PPT Presentation

The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search - The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search Sihem Amer-Yahia AT&T Labs Research - USA Database Department | PowerPoint PPT presentation | free to view

Web Mining : A Bird PowerPoint PPT Presentation

Web Mining : A Bird - Web Mining : A Bird s Eye View Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla, MO 65401 madrias@umr.edu | PowerPoint PPT presentation | free to view

Information Retrieval and Web Search PowerPoint PPT Presentation

Information Retrieval and Web Search - Information Retrieval and Web Search Lecture 5: Index Compression Postings compression The postings file is much larger than the dictionary, factor of at least 10. | PowerPoint PPT presentation | free to view

Integrating Automatic Web Page Clustering into Web Log Association Mining PowerPoint PPT Presentation

Integrating Automatic Web Page Clustering into Web Log Association Mining - fetching the Web pages, and storing to local host; Document Clustering ... picture, gallery, pic, return, previous, completed, room, frame, ready, building ... | PowerPoint PPT presentation | free to view