Use of CUDA for Continuous Space Language Model - PowerPoint PPT Presentation

About This Presentation

Title:

Use of CUDA for Continuous Space Language Model

Description:

Use of CUDA for Continuous Space Language Model Elizabeth A. Thompson, Ph.D.a Timothy R. Anderson, Ph.D.b aPurdue University, Fort Wayne Fort Wayne, IN, USA 46805 – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 32

Provided by: RobertJT9

Learn more at: https://ieee-hpec.org

Category:

more less

Transcript and Presenter's Notes

Title: Use of CUDA for Continuous Space Language Model

1
Use of CUDA for Continuous Space Language Model

Elizabeth A. Thompson, Ph.D.a
Timothy R. Anderson, Ph.D.b

aPurdue University, Fort Wayne Fort Wayne, IN,
USA 46805
bAir Force Research Lab Wright Patterson Air
Force Base Dayton, OH, USA 45433
2
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

3
Continuous-Space Language Models (CSLM)

This work was based on the article
Continuous-Space Language Models for Statistical
Machine Translation by Holger Schwenk of the
University of Le Mans, France, published in the
Prague Bulletin of Mathematical Linguistics,
January 2010, and his corresponding open source
implementation.

4
CSLM (Cont'd)
5
CSLM (Cont'd)

The CSLM consists of a 3 layer neural network
projection layer, hidden layer, output layer.
Input ? 3 word sequence
Output ? The probability of all words in the
vocabulary being the 4th word in the sequence.

6
Training of the CSLM

The neural network must be trained through a
process of adaptive learning.
It is trained using a series of 63,070 4-grams
Prague Stock Market falls
Stock Market falls to
Market falls to minus
falls to minus by

target word
7
Training of the CSLM (Contd)

Text file vocab.txt contains list of vocabulary
terms
Each of 14,024 terms in vocab.txt is assigned a
numerical index, which is used for training the
neural network
Index term
0 gt
1 -
619 abandon

8
Training the Neural Network

In the training stage, values are propagated in
the forward direction through the neural network
in order to assign weighting values to the input
data, and then errors are propagated in the
reverse direction to improve these weighting
factors.

9
Projection Layer

The projection layer maps each of the 3 input
words to a unique 256 length sequence.
Initially, these are generated as uniformly
distributed random values, but their values
change as the neural network is trained.
For each input word, the corresponding 256 length
sequence is the output of the projection layer.

10
Projection layer

The projection layer consists of a lookup table.

0 -0.100000 0.009774 ...
1 -0.099803 0.001762 ...
2 -0.091674 -0.081308 ...
3 ...
4 ...
...
...
...
...
14023 -0.079890 -0.067392
11
Hidden Layer

For the forward pass, the output of the
projection layer is fed as input to the hidden
layer.

192x768 weight matrix
192x128 bias matrix
768x128 output of projection layer
12
Output Layer

For the forward pass, the output of the hidden
layer is fed as input to the output layer.
After applying these weights and biases, a
softmax normalization is applied.

14024x192 weight matrix
192x128 output of hidden layer
14024x128 bias matrix
13
Backward Pass for Training

The error of the output compared to the target
value is propagated backward through the network.
Weights and biases in the output layer and then
the hidden layer are updated.
Finally, the projection layer table is updated to
reflect the results of the forward pass.

14
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

15
CUDA for CSLM

The GPU is specialized for compute intensive,
highly parallel computation.
All NVIDIA GPUs can support at least 768
concurrently active threads per multiprocessor.
However, there is an overhead associated with
using the GPU.

16
GPU Overhead

To use the GPU, memory must be allocated on both
the host CPU as well as on the GPU.
Variables to be used in the computation must be
transferred to the GPU.
The computation is then performed on the GPU.
The results must be transferred back to the host
CPU.

17
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

18
CUDA Architecture
GPU
Streaming multiprocessor
processors (cores)
19
CUDA Architecture (Contd)

The CUDA programmer defines functions, called
kernels.
A kernel is executed as a grid of thread blocks.
The number of threads per block and threads per
multiprocessor depend on compute capability of
CUDA device.

20
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

21
Implementation of CSLM using CUDA

The CSLM algorithm is highly computationally
intensive and a good candidate for implementation
with CUDA.
The matrix multiplications in the hidden and
output layer, both forward and backward pass, are
highly parallel.

22
CUBLAS Routines for CSLM

CUBLAS is a CUDA implementation of BLAS (Basic
Linear Algebra Subprogram), which perform matrix
multiplication operations.
Provide matrix multiplications and handle all
overhead issues regarding programming of
threadsdoes not require programmer to define
kernels, grids, or thread blocks.

23
CUBLAS Implementation of CSLM

The matrix operations were replaced with the
CUBLAS function, cublasSgemm(), which performs
the operation
A, B, and C are matrices containing
single-precision values (floats).
a and ß are scalars.

24
CUBLAS Implementation of CSLM (Contd)

NVIDIA Performance Primitives Library (NPP)
nppsExp_32f_I performs an exponential operation
in-place on single precision values
nppsMulC_32f_I performs in-place
multiplication of a single precision matrix by a
constant.
These functions were used to implement the
softmax normalization operations.

25
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

26
CUBLAS CSML on various platforms
CUDA device Compute capability version number Number of MP Number of CUDA cores Maximum threads per block Maximum threads per MP CPU platform CPU operating system Execution time per epoch (min)
Quadro FX 380 LP 1.2 2 16 512 1024 HP Z200 SFF workstn 4 Intel Core i3-530 processrs 2.93 GHz Fedora 2.6.33.3-85.fx13x86_64 3
Quadro FX 2700M 1.1 6 48 512 768 Duo core Intel T9600 2.8 GHz Scientific Linux 6.0 2.5
Quadro FX 5800 1.3 30 240 512 1024 HP Z800 workstn 12 Intel Xeon x5660 processrs 2.8 GHz CentOS Linux 2.6.32-71.29.1e16.x86-64 1.33
27
Comparison of revised CUDA version using Quadro
FX 5800 vs. original Schwenk algorithm using MKL
Algorithm Time per epoch (sec)
Original Schwenk using MKL 36
CUDA version 26
28
Outline

I. CSLM Algorithm
II. Use of CUDA
III. CUDA Architecture
IV. CUDA Implementation of CSLM
V. Results
VI.Conclusions

29
Conclusions

A framework has been provided to introduce CUDA
to the CSLM and a time savings over the
traditional CPU approach has been demonstrated.
CUBLAS NPP libraries provide a good starting
point for the use of GPUs
For best performance, avoid redundant uploading
and downloading of interim results.

30
Conclusions (Contd)

GPUs provide a substantial performance benefit at
relatively low cost, making high performance
computing accessible to the average user.
The availability of GPUs on laptops may make it
more appealing and practical than a supercomputer
in some applications.

31
Questions?

Write a Comment

User Comments (0)