Distributed Iterative Training - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Iterative Training

Description:

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 22
Provided by: KevinG93
Category:

less

Transcript and Presenter's Notes

Title: Distributed Iterative Training


1
Distributed Iterative Training
Kevin Gimpel Shay Cohen Severin Hacker
Noah A. Smith
2
Outline
  • The Problem
  • Distributed Architecture
  • Experiments and Hadoop Issues

3
Iterative Training
  • Many problems in NLP and machine learning require
    iterating over large training sets many times
  • Training log-linear models (logistic regression,
    conditional random fields)
  • Unsupervised or semi-supervised learning with EM
    (word alignment in MT, grammar induction)
  • Minimum Error-Rate Training in MT
  • Online learning (MIRA, perceptron, stochastic
    gradient descent)
  • All of the above except can be easily
    parallelized
  • Compute statistics on sections of the data
    independently
  • Aggregate them
  • Update parameters using statistics of full set of
    data
  • Repeat until a stopping criterion is met

4
Dependency Grammar Induction
  • Given sentences of natural language text, infer
    (dependency) parse trees
  • State-of-the-art results obtained using only a
    few thousand sentences of length 10 tokens
    (Smith and Eisner, 2006)
  • This talk scaling up to more and longer
    sentences using Hadoop!

5
Dependency Grammar Induction
  • Training
  • Input is a set of sentences (actually, POS tag
    sequences) and a grammar with initial parameter
    values
  • Run an iterative optimization algorithm (EM,
    LBFGS, etc.) that changes the parameter values on
    each iteration
  • Output is a learned set of parameter values
  • Testing
  • Use grammar with learned parameters to parse a
    small set of test sentences
  • Evaluate by computing percentage of predicted
    edges that match a human annotator

6
Outline
  • The Problem
  • Distributed Architecture
  • Experiments and Hadoop Issues

7
MapReduce for Grammar Induction
  • MapReduce was designed for
  • Large amounts of data distributed across many
    disks
  • Simple data processing
  • We have
  • (Relatively) small amounts of data
  • Expensive processing and high memory requirements

8
MapReduce for Grammar Induction
  • Algorithms require 50-100 iterations for
    convergence
  • Each iteration requires a full sweep over all
    training data
  • Computational bottleneck is computing expected
    counts for EM on each iteration (gradient for
    LBFGS)
  • Our approach run one MapReduce job for each
    iteration
  • Map compute expected counts (gradient)
  • Reduce aggregate
  • Offline renormalize (EM) or modify parameter
    values (LBFGS)
  • Note renormalization could be done in reduce
    tasks for EM with correct partition functions,
    but using LBFGS in multiple reduce tasks is
    trickier

9
MapReduce Implementation
Server
  1. Normalize expected counts to get new parameter
    values
  2. Start new MapReduce job, placing new parameter
    values on distributed cache

Distributed Cache
Map
Reduce
Compute expected counts
Aggregate expected counts
10
Running Experiments
  • We use streaming for all experiments with 2 C
    programs server and map (reduce is a simple
    summer)
  • gt cd /home/kgimpel/grammar_induction
  • gt hod allocate d /home/kgimpel/grammar_induction
    n 25
  • gt ./dep_induction_server \
  • input_file/user/kgimpel/data/train20-20parts \
  • aux_fileaux.train20 output_filemodel.train20 \
  • hod_config/home/kgimpel/grammar_induction \
  • num_reduce_tasks5 1gt stdout 2gt stderr
  • dep_induction_server runs a MapReduce job on each
    iteration

Input split into pieces for map tasks (dataset
too small for default Hadoop splitter)
11
Outline
  • The Problem
  • Distributed Architecture
  • Experiments and Hadoop Issues

12
Speed-up with Hadoop
  • 38,576 sentences
  • 40 words / sent.
  • 40 nodes
  • 5 reduce tasks
  • Average iteration
  • time reduced from
  • 2039 s to 115 s
  • Total time reduced
  • from 3400 minutes
  • to 200 minutes

13
Hadoop Issues
  1. Overhead of running a single MapReduce job
  2. Stragglers in the map phase

14
Typical Iteration (40 nodes, 38,576 sentences)
  • 231705 map 0 reduce 0
  • 231712 map 3 reduce 0
  • 231713 map 26 reduce 0
  • 231714 map 49 reduce 0
  • 231715 map 66 reduce 0
  • 231716 map 72 reduce 0
  • 231717 map 97 reduce 0
  • 231718 map 100 reduce 0
  • 231800 map 100 reduce 1
  • 231815 map 100 reduce 2
  • 231818 map 100 reduce 4
  • 231820 map 100 reduce 15
  • 231827 map 100 reduce 17
  • 231828 map 100 reduce 18
  • 231830 map 100 reduce 23
  • 231832 map 100 reduce 100

Consistent 40-second delay between map and reduce
phases
  • 115 s per iteration total
  • 40 s per iteration of overhead
  • When were running 100 iterations
  • per experiment, 40 seconds per
  • iteration really adds up!

of execution time is overhead!
15
Typical Iteration (40 nodes, 38,576 sentences)
  • 231705 map 0 reduce 0
  • 231712 map 3 reduce 0
  • 231713 map 26 reduce 0
  • 231714 map 49 reduce 0
  • 231715 map 66 reduce 0
  • 231716 map 72 reduce 0
  • 231717 map 97 reduce 0
  • 231718 map 100 reduce 0
  • 231800 map 100 reduce 1
  • 231815 map 100 reduce 2
  • 231818 map 100 reduce 4
  • 231820 map 100 reduce 15
  • 231827 map 100 reduce 17
  • 231828 map 100 reduce 18
  • 231830 map 100 reduce 23
  • 231832 map 100 reduce 100

Why does reduce take so long?
  • 5 reduce tasks used
  • Reduce phase is simply
  • aggregation of values
  • for 2600 parameters

16
Histogram of Iteration Times
Mean 115 s
17
Histogram of Iteration Times
Mean 115 s
Whats going on here?
18
Typical Iteration
  • 231705 map 0 reduce 0
  • 231712 map 3 reduce 0
  • 231713 map 26 reduce 0
  • 231714 map 49 reduce 0
  • 231715 map 66 reduce 0
  • 231716 map 72 reduce 0
  • 231717 map 97 reduce 0
  • 231718 map 100 reduce 0
  • 231800 map 100 reduce 1
  • 231815 map 100 reduce 2
  • 231818 map 100 reduce 4
  • 231820 map 100 reduce 15
  • 231827 map 100 reduce 17
  • 231828 map 100 reduce 18
  • 231830 map 100 reduce 23
  • 231832 map 100 reduce 100

19
Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100
  • 231705 map 0 reduce 0
  • 231712 map 3 reduce 0
  • 231713 map 26 reduce 0
  • 231714 map 49 reduce 0
  • 231715 map 66 reduce 0
  • 231716 map 72 reduce 0
  • 231717 map 97 reduce 0
  • 231718 map 100 reduce 0
  • 231800 map 100 reduce 1
  • 231815 map 100 reduce 2
  • 231818 map 100 reduce 4
  • 231820 map 100 reduce 15
  • 231827 map 100 reduce 17
  • 231828 map 100 reduce 18
  • 231830 map 100 reduce 23
  • 231832 map 100 reduce 100

3 minutes waiting for last map tasks to
complete
20
Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100
  • 231705 map 0 reduce 0
  • 231712 map 3 reduce 0
  • 231713 map 26 reduce 0
  • 231714 map 49 reduce 0
  • 231715 map 66 reduce 0
  • 231716 map 72 reduce 0
  • 231717 map 97 reduce 0
  • 231718 map 100 reduce 0
  • 231800 map 100 reduce 1
  • 231815 map 100 reduce 2
  • 231818 map 100 reduce 4
  • 231820 map 100 reduce 15
  • 231827 map 100 reduce 17
  • 231828 map 100 reduce 18
  • 231830 map 100 reduce 23
  • 231832 map 100 reduce 100

3 minutes waiting for last map tasks to
complete
Suggestions? (Doesnt Hadoop replicate map tasks
to avoid this?)
21
Questions?
Write a Comment
User Comments (0)
About PowerShow.com