SamudraManthan Popular terms - PowerPoint PPT Presentation

About This Presentation
Title:

SamudraManthan Popular terms

Description:

WORKER PN. WORKER P2. WORKER P1. Signal Ready (W- M) Data msg (M- W) Next ready ... All processes use binomial tree collection pattern to reduce unique Ngrams ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 21
Provided by: pras53
Category:

less

Transcript and Presenter's Notes

Title: SamudraManthan Popular terms


1
SamudraManthan Popular terms
  • Dinesh Bhirud
  • Prasad Kulkarni
  • Varada Kolhatkar

2
Architecture
WORKER PROCESSORS
M ANAG E R
3
Data Distribution
Handshake protocol
4
Data Distribution (contd)
  • Manager (processor 0) reads an article and passes
    it on to the other processors(workers) in a
    round-robin fashion
  • Before sending new article to the same worker,
    manager waits till the worker is ready to receive
    more data.
  • Worker processes the article and creates data
    structures before receiving new article.
  • Sends receives are synchronous.

5
Suffix Array, LCP Vector And Equivalence Classes
  • Suffix array is a sorted array of suffixes
  • LCP vector keeps track of repeating terms in
    suffixes
  • We use suffix arrays and LCP vector to partition
    articles into classes
  • Each class represents a group of Ngrams
  • Classes represent all Ngrams in the article and
    no Ngram is represented more than once

6
Example

A ROSE IS A ROSE
S LCP Trivial Classes Non-trivial Classes
0 A ROSE 0 lt0,0gt lt0,1gt
1 A ROSE IS A ROSE 2 lt1,1gt
2 IS A ROSE 0 lt2,2gt
3 ROSE 0 lt3,3gt lt3,4gt
4 ROSE IS A ROSE 1 lt4,4gt
5 0
7
Advantages of Suffix Arrays
  • Time Complexity
  • There can be at the most 2N-1 classes, where N is
    number of words in an article
  • Ngrams of all/any sizes can be identified with
    their tfs in linear time
  • These data structures enable us to represent all
    and any sized Ngrams without actually storing
    them

8
Intra-Processor Reduction Problem
  • Suffix array data structure gives us article
    level unique Ngrams with term frequencies
  • A processor processes multiple articles
  • Need to identify unique Ngrams across articles
  • Need to have an unique identifier for each word

9
Dictionary Our Savior
  • Dictionary is a sorted list of all unique words
    in the Gigaword corpus
  • Dictionary ids form a unified basis for
    intra/inter process reduction

10
Intra-Processor Reduction
  • Used a hash table to store unique Ngrams with tf
    and df
  • Hashing function
  • Simple mod hashing function
  • H(t) ? t(i) mod HASH_SIZE, where t(i) is the
    dictionary id of word i in Ngram t
  • Hash data structure
  • struct ngramstore
  • int word_id
  • int cnt
  • int doc_freq
  • struct ngramstore chain

11
Steps
  • Inter-Process Reduction Binomial Tree
  • i varies from 0 to log(n) - 1
  • Send -gt Recv diff (2 i)
  • For any iteration,
  • recv if(id (2i) 0) else sender
  • max_recv (reductions-1)
  • (int)pow((double)2, i1)
  • Processors enter next iteration by calling
    MPI_Barrier()

1 -gt 0 3 -gt 2 5 -gt 4 7 -gt 6
2 -gt 0 6 -gt 4
4 -gt 0
0
2
4
1
6
3
5
7
12
Inter-Process Reduction using Hashing
  • Reusing our hash technique and code from
    intra-process reduction
  • All processes use binomial tree collection
    pattern to reduce unique Ngrams
  • After log n steps process 0 has the final hash
    with all unique Ngrams

13
Scaling up to GigaWord?
  • Goal
  • Reduce per processor memory requirement
  • Cut off term frequency
  • Ngrams with low tf are not going to score high
  • Observation 66 of total trigrams have term
    frequency 1 in 1.2GB data
  • Unnecessary to carry such Ngrams
  • Solution Eliminate Ngrams with very low term
    frequency

14
Pruning stoplist motivation
  • Similarly Ngrams with high df are not going to
    score high.
  • Memory hotspot
  • This elimination can be done only after
    intra-process collection
  • Defeats the goal of per processor memory
    reduction
  • Need for an adaptive elimination

15
Pruning - Stoplist
  • Ngrams such as "IN THE FIRST" scored high using
    TFIDF measure
  • Eliminate such Ngrams to extract really
    interesting terms
  • Stoplist is a list of commonly occurring words
    such as the, a, to, from, is, first
  • Stoplist is based on our dictionary
  • Still evolving and currently contains 160 words
  • Eliminate Ngrams containing all words from the
    stoplist

16
Interesting 3-grams on GigaWord
17
Performance Analysis - Speedup
18
Space Complexity
  • Memory requirement increases for higher order
    Ngrams
  • Why?
  • Suppose there are n unique Ngrams in each article
    and m such articles
  • For higher order Ngrams, the number of unique
    ngrams increase
  • We store each unique Ngram in our hash data
    structure
  • In worst case all Ngrams across articles are
    unique. We have to store mn unique Ngrams per
    processor

19
Current Limitations
  • Static Dictionary
  • M through N Interesting Ngrams
  • Our hash data structure is designed to handle a
    single sized Ngram at a time
  • We provide M through N functionality by
    repetitively building all data structures
  • Not a scalable approach

20
Thanks ?
Write a Comment
User Comments (0)
About PowerShow.com