Title: SamudraManthan Popular terms
1SamudraManthan Popular terms
- Dinesh Bhirud
- Prasad Kulkarni
- Varada Kolhatkar
2Architecture
WORKER PROCESSORS
M ANAG E R
3Data Distribution
Handshake protocol
4Data Distribution (contd)
- Manager (processor 0) reads an article and passes
it on to the other processors(workers) in a
round-robin fashion - Before sending new article to the same worker,
manager waits till the worker is ready to receive
more data. - Worker processes the article and creates data
structures before receiving new article. - Sends receives are synchronous.
5Suffix Array, LCP Vector And Equivalence Classes
- Suffix array is a sorted array of suffixes
- LCP vector keeps track of repeating terms in
suffixes - We use suffix arrays and LCP vector to partition
articles into classes - Each class represents a group of Ngrams
- Classes represent all Ngrams in the article and
no Ngram is represented more than once
6Example
A ROSE IS A ROSE
S LCP Trivial Classes Non-trivial Classes
0 A ROSE 0 lt0,0gt lt0,1gt
1 A ROSE IS A ROSE 2 lt1,1gt
2 IS A ROSE 0 lt2,2gt
3 ROSE 0 lt3,3gt lt3,4gt
4 ROSE IS A ROSE 1 lt4,4gt
5 0
7Advantages of Suffix Arrays
- Time Complexity
- There can be at the most 2N-1 classes, where N is
number of words in an article - Ngrams of all/any sizes can be identified with
their tfs in linear time - These data structures enable us to represent all
and any sized Ngrams without actually storing
them
8Intra-Processor Reduction Problem
- Suffix array data structure gives us article
level unique Ngrams with term frequencies - A processor processes multiple articles
- Need to identify unique Ngrams across articles
- Need to have an unique identifier for each word
9Dictionary Our Savior
- Dictionary is a sorted list of all unique words
in the Gigaword corpus - Dictionary ids form a unified basis for
intra/inter process reduction
10Intra-Processor Reduction
- Used a hash table to store unique Ngrams with tf
and df - Hashing function
- Simple mod hashing function
- H(t) ? t(i) mod HASH_SIZE, where t(i) is the
dictionary id of word i in Ngram t - Hash data structure
-
- struct ngramstore
- int word_id
- int cnt
- int doc_freq
- struct ngramstore chain
-
11Steps
- Inter-Process Reduction Binomial Tree
- i varies from 0 to log(n) - 1
- Send -gt Recv diff (2 i)
- For any iteration,
- recv if(id (2i) 0) else sender
- max_recv (reductions-1)
- (int)pow((double)2, i1)
- Processors enter next iteration by calling
MPI_Barrier()
1 -gt 0 3 -gt 2 5 -gt 4 7 -gt 6
2 -gt 0 6 -gt 4
4 -gt 0
0
2
4
1
6
3
5
7
12Inter-Process Reduction using Hashing
- Reusing our hash technique and code from
intra-process reduction - All processes use binomial tree collection
pattern to reduce unique Ngrams - After log n steps process 0 has the final hash
with all unique Ngrams
13Scaling up to GigaWord?
- Goal
- Reduce per processor memory requirement
- Cut off term frequency
- Ngrams with low tf are not going to score high
- Observation 66 of total trigrams have term
frequency 1 in 1.2GB data - Unnecessary to carry such Ngrams
- Solution Eliminate Ngrams with very low term
frequency
14Pruning stoplist motivation
- Similarly Ngrams with high df are not going to
score high. - Memory hotspot
- This elimination can be done only after
intra-process collection - Defeats the goal of per processor memory
reduction - Need for an adaptive elimination
15Pruning - Stoplist
- Ngrams such as "IN THE FIRST" scored high using
TFIDF measure - Eliminate such Ngrams to extract really
interesting terms - Stoplist is a list of commonly occurring words
such as the, a, to, from, is, first - Stoplist is based on our dictionary
- Still evolving and currently contains 160 words
- Eliminate Ngrams containing all words from the
stoplist
16Interesting 3-grams on GigaWord
17Performance Analysis - Speedup
18Space Complexity
- Memory requirement increases for higher order
Ngrams - Why?
- Suppose there are n unique Ngrams in each article
and m such articles - For higher order Ngrams, the number of unique
ngrams increase - We store each unique Ngram in our hash data
structure - In worst case all Ngrams across articles are
unique. We have to store mn unique Ngrams per
processor
19Current Limitations
- Static Dictionary
- M through N Interesting Ngrams
- Our hash data structure is designed to handle a
single sized Ngram at a time - We provide M through N functionality by
repetitively building all data structures - Not a scalable approach
-
20Thanks ?