
Comparison of DGEMM performance on CPU and GPU: how big a matrix do you need to make GPU acceleration worthwhile? To find out, we wrote and ran a benchmark.
This test was run on the following machine:
- Intel Core2 Quad Q9450 (2.66GHz)
- Asus P5E3 Deluxe Motherboard
- NVIDIA Tesla C1060 GPU
The 'thunking' curves refer to the CPU-to-CPU times. The data start and end on the CPU, making the GPU entirely a black box. The 'pageable thunk' curve is the ultimate example of this. For the 'pinned thunk' curve, the matrices were placed in pinned host memory (allocated using cudaMallocHost) first. The 'GPU compute only' time is the actual time spend in DGEMM on the GPU, once the data are already on the card.
The main point to note is that it rapidly becomes worthwhile to do the multiplication on the GPU, even if the data have to be moved to the GPU solely for the DGEMM call. Matrices of size 256x256 are noticeably faster on the GPU. As one goes to larger matrices, the relative cost of the transfers becomes negligible, as one might expect. The step in performance at the low end of the 'pageable thunk' curve is interesting. Making a definitive statement is not easy, but it appears that, if the data being transfered to the GPU are contained on a single page, the performance penalty for the transfer from pageable memory is minimal.
