Title: Use of CUDA for Continuous Space Language Model
1Use of CUDA for Continuous Space Language Model
- Elizabeth A. Thompson, Ph.D.a
- Timothy R. Anderson, Ph.D.b
aPurdue University, Fort Wayne Fort Wayne, IN,
USA 46805
bAir Force Research Lab Wright Patterson Air
Force Base Dayton, OH, USA 45433
2Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
3Continuous-Space Language Models (CSLM)
- This work was based on the article
Continuous-Space Language Models for Statistical
Machine Translation by Holger Schwenk of the
University of Le Mans, France, published in the
Prague Bulletin of Mathematical Linguistics,
January 2010, and his corresponding open source
implementation.
4CSLM (Cont'd)
5CSLM (Cont'd)
- The CSLM consists of a 3 layer neural network
projection layer, hidden layer, output layer. - Input ? 3 word sequence
- Output ? The probability of all words in the
vocabulary being the 4th word in the sequence.
6Training of the CSLM
- The neural network must be trained through a
process of adaptive learning. - It is trained using a series of 63,070 4-grams
- Prague Stock Market falls
- Stock Market falls to
- Market falls to minus
- falls to minus by
target word
7Training of the CSLM (Contd)
- Text file vocab.txt contains list of vocabulary
terms - Each of 14,024 terms in vocab.txt is assigned a
numerical index, which is used for training the
neural network - Index term
- 0 gt
- 1 -
-
- 619 abandon
-
8Training the Neural Network
- In the training stage, values are propagated in
the forward direction through the neural network
in order to assign weighting values to the input
data, and then errors are propagated in the
reverse direction to improve these weighting
factors.
9Projection Layer
- The projection layer maps each of the 3 input
words to a unique 256 length sequence. - Initially, these are generated as uniformly
distributed random values, but their values
change as the neural network is trained. - For each input word, the corresponding 256 length
sequence is the output of the projection layer.
10Projection layer
- The projection layer consists of a lookup table.
0 -0.100000 0.009774 ...
1 -0.099803 0.001762 ...
2 -0.091674 -0.081308 ...
3 ...
4 ...
...
...
...
...
14023 -0.079890 -0.067392
11Hidden Layer
- For the forward pass, the output of the
projection layer is fed as input to the hidden
layer.
192x768 weight matrix
192x128 bias matrix
768x128 output of projection layer
12Output Layer
- For the forward pass, the output of the hidden
layer is fed as input to the output layer. - After applying these weights and biases, a
softmax normalization is applied.
14024x192 weight matrix
192x128 output of hidden layer
14024x128 bias matrix
13Backward Pass for Training
- The error of the output compared to the target
value is propagated backward through the network. - Weights and biases in the output layer and then
the hidden layer are updated. - Finally, the projection layer table is updated to
reflect the results of the forward pass.
14Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
15CUDA for CSLM
- The GPU is specialized for compute intensive,
highly parallel computation. - All NVIDIA GPUs can support at least 768
concurrently active threads per multiprocessor. - However, there is an overhead associated with
using the GPU.
16GPU Overhead
- To use the GPU, memory must be allocated on both
the host CPU as well as on the GPU. - Variables to be used in the computation must be
transferred to the GPU. - The computation is then performed on the GPU.
- The results must be transferred back to the host
CPU.
17Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
18CUDA Architecture
GPU
Streaming multiprocessor
processors (cores)
19CUDA Architecture (Contd)
- The CUDA programmer defines functions, called
kernels. - A kernel is executed as a grid of thread blocks.
- The number of threads per block and threads per
multiprocessor depend on compute capability of
CUDA device.
20Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
21Implementation of CSLM using CUDA
- The CSLM algorithm is highly computationally
intensive and a good candidate for implementation
with CUDA. - The matrix multiplications in the hidden and
output layer, both forward and backward pass, are
highly parallel.
22CUBLAS Routines for CSLM
- CUBLAS is a CUDA implementation of BLAS (Basic
Linear Algebra Subprogram), which perform matrix
multiplication operations. - Provide matrix multiplications and handle all
overhead issues regarding programming of
threadsdoes not require programmer to define
kernels, grids, or thread blocks.
23CUBLAS Implementation of CSLM
- The matrix operations were replaced with the
CUBLAS function, cublasSgemm(), which performs
the operation - A, B, and C are matrices containing
single-precision values (floats). - a and ß are scalars.
24CUBLAS Implementation of CSLM (Contd)
- NVIDIA Performance Primitives Library (NPP)
- nppsExp_32f_I performs an exponential operation
in-place on single precision values - nppsMulC_32f_I performs in-place
multiplication of a single precision matrix by a
constant. - These functions were used to implement the
softmax normalization operations.
25Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
26CUBLAS CSML on various platforms
CUDA device Compute capability version number Number of MP Number of CUDA cores Maximum threads per block Maximum threads per MP CPU platform CPU operating system Execution time per epoch (min)
Quadro FX 380 LP 1.2 2 16 512 1024 HP Z200 SFF workstn 4 Intel Core i3-530 processrs 2.93 GHz Fedora 2.6.33.3-85.fx13x86_64 3
Quadro FX 2700M 1.1 6 48 512 768 Duo core Intel T9600 2.8 GHz Scientific Linux 6.0 2.5
Quadro FX 5800 1.3 30 240 512 1024 HP Z800 workstn 12 Intel Xeon x5660 processrs 2.8 GHz CentOS Linux 2.6.32-71.29.1e16.x86-64 1.33
27Comparison of revised CUDA version using Quadro
FX 5800 vs. original Schwenk algorithm using MKL
Algorithm Time per epoch (sec)
Original Schwenk using MKL 36
CUDA version 26
28Outline
- I. CSLM Algorithm
- II. Use of CUDA
- III. CUDA Architecture
- IV. CUDA Implementation of CSLM
- V. Results
- VI.Conclusions
29Conclusions
- A framework has been provided to introduce CUDA
to the CSLM and a time savings over the
traditional CPU approach has been demonstrated. - CUBLAS NPP libraries provide a good starting
point for the use of GPUs - For best performance, avoid redundant uploading
and downloading of interim results.
30Conclusions (Contd)
- GPUs provide a substantial performance benefit at
relatively low cost, making high performance
computing accessible to the average user. - The availability of GPUs on laptops may make it
more appealing and practical than a supercomputer
in some applications.
31Questions?