Use of CUDA for Continuous Space Language Model - PowerPoint PPT Presentation

About This Presentation
Title:

Use of CUDA for Continuous Space Language Model

Description:

Use of CUDA for Continuous Space Language Model Elizabeth A. Thompson, Ph.D.a Timothy R. Anderson, Ph.D.b aPurdue University, Fort Wayne Fort Wayne, IN, USA 46805 – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 32
Provided by: RobertJT9
Learn more at: https://ieee-hpec.org
Category:

less

Transcript and Presenter's Notes

Title: Use of CUDA for Continuous Space Language Model


1
Use of CUDA for Continuous Space Language Model
  • Elizabeth A. Thompson, Ph.D.a
  • Timothy R. Anderson, Ph.D.b

aPurdue University, Fort Wayne Fort Wayne, IN,
USA 46805
bAir Force Research Lab Wright Patterson Air
Force Base Dayton, OH, USA 45433
2
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

3
Continuous-Space Language Models (CSLM)
  • This work was based on the article
    Continuous-Space Language Models for Statistical
    Machine Translation by Holger Schwenk of the
    University of Le Mans, France, published in the
    Prague Bulletin of Mathematical Linguistics,
    January 2010, and his corresponding open source
    implementation.

4
CSLM (Cont'd)
5
CSLM (Cont'd)
  • The CSLM consists of a 3 layer neural network
    projection layer, hidden layer, output layer.
  • Input ? 3 word sequence
  • Output ? The probability of all words in the
    vocabulary being the 4th word in the sequence.

6
Training of the CSLM
  • The neural network must be trained through a
    process of adaptive learning.
  • It is trained using a series of 63,070 4-grams
  • Prague Stock Market falls
  • Stock Market falls to
  • Market falls to minus
  • falls to minus by

target word
7
Training of the CSLM (Contd)
  • Text file vocab.txt contains list of vocabulary
    terms
  • Each of 14,024 terms in vocab.txt is assigned a
    numerical index, which is used for training the
    neural network
  • Index term
  • 0 gt
  • 1 -
  • 619 abandon

8
Training the Neural Network
  • In the training stage, values are propagated in
    the forward direction through the neural network
    in order to assign weighting values to the input
    data, and then errors are propagated in the
    reverse direction to improve these weighting
    factors.

9
Projection Layer
  • The projection layer maps each of the 3 input
    words to a unique 256 length sequence.
  • Initially, these are generated as uniformly
    distributed random values, but their values
    change as the neural network is trained.
  • For each input word, the corresponding 256 length
    sequence is the output of the projection layer.

10
Projection layer
  • The projection layer consists of a lookup table.

0 -0.100000 0.009774 ...
1 -0.099803 0.001762 ...
2 -0.091674 -0.081308 ...
3 ...
4 ...
...
...
...
...
14023 -0.079890 -0.067392
11
Hidden Layer
  • For the forward pass, the output of the
    projection layer is fed as input to the hidden
    layer.

192x768 weight matrix
192x128 bias matrix
768x128 output of projection layer
12
Output Layer
  • For the forward pass, the output of the hidden
    layer is fed as input to the output layer.
  • After applying these weights and biases, a
    softmax normalization is applied.

14024x192 weight matrix
192x128 output of hidden layer
14024x128 bias matrix
13
Backward Pass for Training
  • The error of the output compared to the target
    value is propagated backward through the network.
  • Weights and biases in the output layer and then
    the hidden layer are updated.
  • Finally, the projection layer table is updated to
    reflect the results of the forward pass.

14
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

15
CUDA for CSLM
  • The GPU is specialized for compute intensive,
    highly parallel computation.
  • All NVIDIA GPUs can support at least 768
    concurrently active threads per multiprocessor.
  • However, there is an overhead associated with
    using the GPU.

16
GPU Overhead
  • To use the GPU, memory must be allocated on both
    the host CPU as well as on the GPU.
  • Variables to be used in the computation must be
    transferred to the GPU.
  • The computation is then performed on the GPU.
  • The results must be transferred back to the host
    CPU.

17
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

18
CUDA Architecture
GPU
Streaming multiprocessor
processors (cores)
19
CUDA Architecture (Contd)
  • The CUDA programmer defines functions, called
    kernels.
  • A kernel is executed as a grid of thread blocks.
  • The number of threads per block and threads per
    multiprocessor depend on compute capability of
    CUDA device.

20
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

21
Implementation of CSLM using CUDA
  • The CSLM algorithm is highly computationally
    intensive and a good candidate for implementation
    with CUDA.
  • The matrix multiplications in the hidden and
    output layer, both forward and backward pass, are
    highly parallel.

22
CUBLAS Routines for CSLM
  • CUBLAS is a CUDA implementation of BLAS (Basic
    Linear Algebra Subprogram), which perform matrix
    multiplication operations.
  • Provide matrix multiplications and handle all
    overhead issues regarding programming of
    threadsdoes not require programmer to define
    kernels, grids, or thread blocks.

23
CUBLAS Implementation of CSLM
  • The matrix operations were replaced with the
    CUBLAS function, cublasSgemm(), which performs
    the operation
  • A, B, and C are matrices containing
    single-precision values (floats).
  • a and ß are scalars.

24
CUBLAS Implementation of CSLM (Contd)
  • NVIDIA Performance Primitives Library (NPP)
  • nppsExp_32f_I performs an exponential operation
    in-place on single precision values
  • nppsMulC_32f_I performs in-place
    multiplication of a single precision matrix by a
    constant.
  • These functions were used to implement the
    softmax normalization operations.

25
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

26
CUBLAS CSML on various platforms
CUDA device Compute capability version number Number of MP Number of CUDA cores Maximum threads per block Maximum threads per MP CPU platform CPU operating system Execution time per epoch (min)
Quadro FX 380 LP 1.2 2 16 512 1024 HP Z200 SFF workstn 4 Intel Core i3-530 processrs 2.93 GHz Fedora 2.6.33.3-85.fx13x86_64 3
Quadro FX 2700M 1.1 6 48 512 768 Duo core Intel T9600 2.8 GHz Scientific Linux 6.0 2.5
Quadro FX 5800 1.3 30 240 512 1024 HP Z800 workstn 12 Intel Xeon x5660 processrs 2.8 GHz CentOS Linux 2.6.32-71.29.1e16.x86-64 1.33
27
Comparison of revised CUDA version using Quadro
FX 5800 vs. original Schwenk algorithm using MKL
Algorithm Time per epoch (sec)
Original Schwenk using MKL 36
CUDA version 26
28
Outline
  • I. CSLM Algorithm
  • II. Use of CUDA
  • III. CUDA Architecture
  • IV. CUDA Implementation of CSLM
  • V. Results
  • VI.Conclusions

29
Conclusions
  • A framework has been provided to introduce CUDA
    to the CSLM and a time savings over the
    traditional CPU approach has been demonstrated.
  • CUBLAS NPP libraries provide a good starting
    point for the use of GPUs
  • For best performance, avoid redundant uploading
    and downloading of interim results.

30
Conclusions (Contd)
  • GPUs provide a substantial performance benefit at
    relatively low cost, making high performance
    computing accessible to the average user.
  • The availability of GPUs on laptops may make it
    more appealing and practical than a supercomputer
    in some applications.

31
Questions?
Write a Comment
User Comments (0)
About PowerShow.com