SamudraManthan Popular terms

About This Presentation

Title:

SamudraManthan Popular terms

Description:

WORKER PN. WORKER P2. WORKER P1. Signal Ready (W- M) Data msg (M- W) Next ready ... All processes use binomial tree collection pattern to reduce unique Ngrams ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 21

Provided by: pras53

Category:

more less

Transcript and Presenter's Notes

Title: SamudraManthan Popular terms

1
SamudraManthan Popular terms

Dinesh Bhirud
Prasad Kulkarni
Varada Kolhatkar

2
Architecture
WORKER PROCESSORS
M ANAG E R
3
Data Distribution
Handshake protocol
4
Data Distribution (contd)

Manager (processor 0) reads an article and passes
it on to the other processors(workers) in a
round-robin fashion
Before sending new article to the same worker,
manager waits till the worker is ready to receive
more data.
Worker processes the article and creates data
structures before receiving new article.
Sends receives are synchronous.

5
Suffix Array, LCP Vector And Equivalence Classes

Suffix array is a sorted array of suffixes
LCP vector keeps track of repeating terms in
suffixes
We use suffix arrays and LCP vector to partition
articles into classes
Each class represents a group of Ngrams
Classes represent all Ngrams in the article and
no Ngram is represented more than once

6
Example

A ROSE IS A ROSE
S LCP Trivial Classes Non-trivial Classes
0 A ROSE 0 lt0,0gt lt0,1gt
1 A ROSE IS A ROSE 2 lt1,1gt
2 IS A ROSE 0 lt2,2gt
3 ROSE 0 lt3,3gt lt3,4gt
4 ROSE IS A ROSE 1 lt4,4gt
5 0
7
Advantages of Suffix Arrays

Time Complexity
There can be at the most 2N-1 classes, where N is
number of words in an article
Ngrams of all/any sizes can be identified with
their tfs in linear time
These data structures enable us to represent all
and any sized Ngrams without actually storing
them

8
Intra-Processor Reduction Problem

Suffix array data structure gives us article
level unique Ngrams with term frequencies
A processor processes multiple articles
Need to identify unique Ngrams across articles
Need to have an unique identifier for each word

9
Dictionary Our Savior

Dictionary is a sorted list of all unique words
in the Gigaword corpus
Dictionary ids form a unified basis for
intra/inter process reduction

10
Intra-Processor Reduction

Used a hash table to store unique Ngrams with tf
and df
Hashing function
Simple mod hashing function
H(t) ? t(i) mod HASH_SIZE, where t(i) is the
dictionary id of word i in Ngram t
Hash data structure
struct ngramstore
int word_id
int cnt
int doc_freq
struct ngramstore chain

11
Steps

Inter-Process Reduction Binomial Tree

i varies from 0 to log(n) - 1
Send -gt Recv diff (2 i)
For any iteration,
recv if(id (2i) 0) else sender
max_recv (reductions-1)
(int)pow((double)2, i1)
Processors enter next iteration by calling
MPI_Barrier()

1 -gt 0 3 -gt 2 5 -gt 4 7 -gt 6
2 -gt 0 6 -gt 4
4 -gt 0
0
2
4
1
6
3
5
7
12
Inter-Process Reduction using Hashing

Reusing our hash technique and code from
intra-process reduction
All processes use binomial tree collection
pattern to reduce unique Ngrams
After log n steps process 0 has the final hash
with all unique Ngrams

13
Scaling up to GigaWord?

Goal
Reduce per processor memory requirement
Cut off term frequency
Ngrams with low tf are not going to score high
Observation 66 of total trigrams have term
frequency 1 in 1.2GB data
Unnecessary to carry such Ngrams
Solution Eliminate Ngrams with very low term
frequency

14
Pruning stoplist motivation

Similarly Ngrams with high df are not going to
score high.
Memory hotspot
This elimination can be done only after
intra-process collection
Defeats the goal of per processor memory
reduction
Need for an adaptive elimination

15
Pruning - Stoplist

Ngrams such as "IN THE FIRST" scored high using
TFIDF measure
Eliminate such Ngrams to extract really
interesting terms
Stoplist is a list of commonly occurring words
such as the, a, to, from, is, first
Stoplist is based on our dictionary
Still evolving and currently contains 160 words
Eliminate Ngrams containing all words from the
stoplist

16
Interesting 3-grams on GigaWord
17
Performance Analysis - Speedup
18
Space Complexity

Memory requirement increases for higher order
Ngrams
Why?
Suppose there are n unique Ngrams in each article
and m such articles
For higher order Ngrams, the number of unique
ngrams increase
We store each unique Ngram in our hash data
structure
In worst case all Ngrams across articles are
unique. We have to store mn unique Ngrams per
processor

19
Current Limitations

Static Dictionary
M through N Interesting Ngrams
Our hash data structure is designed to handle a
single sized Ngram at a time
We provide M through N functionality by
repetitively building all data structures
Not a scalable approach

20
Thanks ?

Write a Comment

User Comments (0)