Benchmarking GPUs with a parallel Lattice-Boltzmann code

J. Kraus; M. Pivanti; SCHIFANO, Sebastiano Fabio; TRIPICCIONE, Raffaele; M. Zanella

doi:10.1109/SBAC-PAD.2013.37

Accelerators are an increasingly common option to boost performance of codes that require extensive number crunching. In this paper we report on our experience with NVIDIA accelerators to study fluid systems using the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism, such as recent multi- and many-core processors and GPUs; however, the challenge of exploiting a large fraction of the theoretically available performance of this new class of processors is not easily met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas. The computational features of this model make it a significant benchmark to analyze the performance of new computational platforms, since critical kernels in this code require both high memory-bandwidth on sparse memory addressing patterns and floating-point throughput. In this paper we consider two recent classes of GPU boards based on the Fermi and Kepler architectures; we describe in details all steps done to implement and optimize our LB code and analyze its performance first on single-GPU systems, and then on parallel multi-GPU systems based on one node as well as on a cluster of many nodes; in the latter case we use CUDA-aware MPI as an abstraction layer to assess the advantages of advanced GPU-to-GPU communication technologies like GPUDirect. On our implementation, aggregate sustained performance of the most compute intensive part of the code breaks the $1$ double-precision Tflops barrier on a single-host system with two GPUs.