6'5 Text and Web page preprocessing - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

6'5 Text and Web page preprocessing

Description:

base on n-grams (shingles) consecutive of words of a fixed window size n ... n-grams (shingles) (cont.) Sn(d) : the set of distinctive n-grams in. document d ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 21
Provided by: mach53
Category:

less

Transcript and Presenter's Notes

Title: 6'5 Text and Web page preprocessing


1
6.5 Text and Web page pre-processing
2
The tasks before retrival
  • Traditional text documents
  • stopword removal
  • stemming
  • handling of digits, hyphens, punctuations, case
    of letters
  • Web page additional
  • HTML tag removal
  • Identification of main content blocks

3
stopword removal
  • Stopwords
  • frequently occurring
  • insignificant words
  • help construct sentences
  • not represent any content
  • articles the, this, that,etc
  • prepositions as, at, by, for, with, to,etc
  • conjunctions and, so, then, or, etc
  • pronouns who, when, where, that, it, etc

4
stemming
  • stem
  • the portion of a word that is left after removing
    its prefixes or suffixes
  • nouns plural form (book? books, country?
    countries)
  • verbs gerund form (do? doing)
  • past tense (do? did, do? done)
  • ? low recall for a retrieval system

5
stemming (cont.)
  • stemming
  • suffix removal
  • stripping
  • ex computer, compute, computing
  • ? comput
  • walks, walked, walking, walker
  • ? walk

6
stemming (cont.)

  • stemming algorithm
  • Brute Force Algorithms
  • Suffix Stripping Algorithms
  • Lemmatisation Algorithms
  • Stochastic Algorithms
  • Hybrid Approaches
  • Matching Algorithms
  • Martin Porters stemming algorithm

7
stemming (cont.)
  • advantage
  • increase the recall
  • reduce the size of the indexing structure
  • disadvantage
  • irrelevant documents may be considered relevant
  • ex cope reduce to cop

8
Other pre-processing Tasks for text
  • Digits
  • except specific types, e.g. dates, pre-specified
    types expressed with regular expressions
  • Hyphens
  • replace with a space or without leaving a space
  • ? some relevant page will not be found.
  • pre-processing with query term preprocessing
  • Punctuation Marks
  • Case of Letters

9
web page pre-processing
  • Identifying different text fields
  • allows the retrieval system to treat terms in
    different fields differently
  • ex title concise description of the page,
    emphasize terms e.g. header, bold tag

10
web page pre-processing (cont.)
  • Identifying anchor text
  • anchor text associated with hyperlink represents
    a more accurate description of the information
  • point to an external page is especially valuable
    summary description of page , thus trustworthy

11
web page pre-processing (cont.)
  • removing HTML tags
  • dealt with similarly to punctuation
  • In a typical commercial page information is in
    many blocks, removing HTML tags may cause
    problems by joining texts but should not be join.
    (fig. 6.6)

12
(No Transcript)
13
web page pre-processing (cont.)
  • Identifying main content blocks
  • commercial page contains a large amount
    information not part of the page
  • lead to poor results for search and mining

14
Identifying main content blocks
  • two techniques for find such blocks in Web pages.
  • Partitioning based on visual cues
  • Tree matching

15
Partitioning based on visual cues
  • visual or rendering information could be obtained
    from the Web browser
  • ex IE provides an API can output X and Y
    coordinates, machine learning model can be built
    base on location and apperance features ( amount
    of training example need to be manually labeled)

16
Tree matching
  • base on observation that in most commercial web
    pages are generated by using some fixed
    templates. Since HTML has a nested structure,
    its easy to build a tag tree.
  • find the hidden templates
  • Once a template is found, we can identify
    which blocks are likely to be main contain blocks
    (quite different across different page of the
    same template)

17
Duplicate Detection
  • duplication(replication) copy a page
  • mirroring copy an entire site
  • improve efficiency of browsing
  • fire downloading worldwide due to limited
    bandwidth across different geographic regions
  • poor or unpredictable network performance
  • ?some duplicate page are plagiarism

18
some method to find duplicate
  • MD5 algorithm
  • computing an aggregated number
  • ?only useful for detecting exact duplicates
  • different mirror sites
  • different URLs
  • different Web masters
  • different contact information
  • different advertisements to suit local needs

19
Efficient duplicate detection technique
  • base on n-grams (shingles)
  • consecutive of words of a fixed window size n
  • ex John went to school with his brother.
  • with 3-grams
  • ?John went to ? went to school
  • ?to school with ? school with his
  • ?with his brother

20
n-grams (shingles) (cont.)
  • Sn(d) the set of distinctive n-grams in
    document d
  • each n-gram coded with a number (MD5 hash)
  • representations of d1 and d2, Sn(d1) and Sn(d2)
  • Jaccard coefficient compute the similarity
    Sn(d1) nSn(d2)
  • sim (d1,d2)
  • Sn(d1) ?Sn(d2)
  • window size n and similarity threshhold can be
    chosen through experiments.
Write a Comment
User Comments (0)
About PowerShow.com