Title: GPU????lattice??
1GPU????lattice??
2???
- 1. Graphic Processing Unit (GPU)???
- 2. Nvidia CUDA programming model
- 3. GPU????
- 4. QCD with CUDA
- 5. ??
- 6. ???
3Graphics Processing Unit
- ?????????PC???
- ??????
- ????????
- 100200??processer???
- ???????
GPU????? Graphic Card
- ?? 510??
- ?? ??GFLOPS (???)
??????
O(a)????Wilson-Dirac quark? solver?CUDA???????? GP
U????????????????
????
Gyözö I. Egri, hep-lat/0611022 Lattice QCD as a
video game
? ???
????????
4Mixed Precision Solver
???solver??????? ???????? ?????? ?????(Wilson-Di
rac) Dx b ??????? ???? Dxb
?????? ?????????????????? GPU???????????
(300-900GFlops) ??????????GPU?????????????????!
5CUDA Programming Model
L???????????? (LNM) c a b // host
???????? GPU code _global_ void
vadd_kernel(float a, float b, float c)
int idx threadIdx.xblockIdx.xblockDim.x
cidx aidx bidx // host ? code
void main() // GPU???????
cudaMalloc((void)a,.) .. // c ab
?????GPU???? // thread?/blockN, block?M ?????
vadd_kernelltltltM,Ngtgtgt(a,b,c) ??????????
????????
thread 1
thread 2
thread 3
thread 4
?
thread N
block 1
block 2
block 3
block 4
?
block M
grid
block
- thread ???????
- (max 512/block)
- thread block
- ???multiprocessor??
- ?????thread????
- (max 65535)
- grid thread block????
- ?????????????
6???????
Nvidia CUDA Programming Guide ??
- ????????
- ? 1thread ? 1 ??????
- ???????????????
- ? GPU?????????????????
7Memories on GPU
- Shared Memory
- global Memory
- ??????????
- (4 clock cycles)
- read-write ????
- ??block??thread????
- 16KB/block
- device memory ?????
- ??????????
- (400600 clock cycles)
- read-write ????
- ?thread????
Shared Memory ?????
8CUDA with QCD, programming strategy
- 1?????????????????
- fermion 8?(1?)
- 3424Byte96Byte
- gauge link 2?
- 3(3-1)24Byte448Byte4
- SU(3) reconstruction method.
- clover? 1?
- 21224Byte336Byte
CUDA ????? 432 ???????? ?????128
????
???????1584 Byte ???1896 Flop
Byte/Flop 0.83 G80???? 80GB/s ???? 100
GFlops!!
fermion ? shared memory ???? 444296Byte12.3K
B, (max 16KB/block)
gauge link ? clover ? device memory ?????
9???????solver???
GPUNVIDIA GeForce 8800 GTX CPUIntel Core 2
_at_2.66GHz
354.6GFLOPS 21.3GFLOPS
- O(a)???Wilson-Dirac quark solver
- Bi-CGStab ?
- ?????
- ??????GPU???
- even-odd preconditioning
10Results Calculation Time
- ?????16332
- quench
- 0.15fm
- quark??MeV
- 23?52?81
23MeV
???
???
GPU
52MeV
81MeV
???solver? ????
GPU?????? ???1/7?
10-15
10-12
10-15
10-6
10-12
10-15
10-6
11Performance (Volume)
- quark ?? 23MeV
- ?????
- 438
- 8316
- 16332
GPU
CPU
?????
???? 17GFLOPS
???????????? ? coalesced access
12Coalesced Access
4,8,or 16Byte
??? 0
??? 0
??? 0
??? 1
??? 1
??? 1
??? 2
??? 2
??? 2
?
??? 0
??? 1
??? 2
?
??? 0
??? 1
??? 2
?
??? 0
?
thread 0
thread 1
thread 2
?
thread 0
thread 1
thread 2
?
13?????
Nvidia GeForce GTX 280 Core 2 Duo 3.0GHz (6MB)
- non coalesced access
- on shared memory
20GFLOPS
????solver
- coalesced access
- on texture cache
4050GFLOPS
hopping ? 89GFLOPS clover ? 100GFLOPS
???solver
GPU solver
22
10?
220?
14???
- GPU???????????????? ? ??QCD??
- GPU??????????
- ??????????GPU solver ?????? ? ??????
- ????solver?O(a)??????clover?????????
- GeForce 8800 GTX ????
- solver?????????17GFOLPS?
- ????? Core 2 Duo 2.66GHz CPU?1/7?
- GeForce GTX 280 ????
- coalesced access ???4050GFLOPS?
- Core 2 Duo 3.0GHz ?22??
- ???????coalesced access ????
- ???GPU??????