A Comprehensive Study on Numerical Issues in GPU Programs

Tables

Table A shows the total GitHub repositories included in the study after performing the manual inspection

GitHub Project Stars Commits Contributors Issues Included
CuPy 6.3K 25799 263 62
TensorFlow 168K 135063 3197 42
PyTorch 58.6K 51541 2430 30
cuDF 5K 35572 194 15
Numba 7.8K 23157 272 14
Tranformers 70.1K 10611 1392 10
Ginkgo 234 5459 22 5
cuML 2.9K 14862 116 5
TensorRT 5.9K 415 48 5
Apache-incubator-mxnet 20.1K 11890 874 4
mmdetection 21.3K 2115 356 4
PyTorch-Lightning 20K 7597 744 4
MatX 549 270 15 4
Paddle 18.8K 37448 615 3
Cutlass 2.1K 266 51 3
GPUweb 3.2K 1786 83 3
Onnx 13.2K 2044 237 3
xgboost 23.2K 5923 554 3
pyVista 1.4K 3001 113 2
cuGraph 1.1K 5470 74 2
CUDA.jl 830 7244 113 2
pymc-dev 7K 8892 371 2
google-autoML 5.1K 685 34 2
CLIMA-Oceananigans.jl 707 10595 41 2
ray 22K 14191 722 1
napari 1.5K 2474 128 1
jiesutd-NCRFPP 1.8K 127 9 1
AMDMIGraphX 92 4410 28 1
Thrust 4.1K 4578 74 1
GPytorch 2.9K 3683 86 1
Stumpy 2.4K 1120 27 1
Spark-rapids 450 4587 59 1
UPIT 94 165 5 1
DeepSpeed 7.8K 1116 137 1
CSAROfeen 16 52129 1 1
OpenCV 63.7K 32140 1404 1
catboost 6.7K 23866 293 1
NX 1.9K 1410 57 1
Onnxruntime 7.4K 7334 420 1
mrdoob-three.js 85.2K 40515 1672 1
AMDVLK 1.3K 119 9 1
WGPU 5.6K 3849 260 1
FARM 1.6K 594 110 1
hoomd-blue 215 21504 79 1
blazingsql 1.8K 8208 33 1
awslabs-djl 2.7K 3383 73 1
MIOpen 720 8925 77 1
Kitty 15.9K 10546 199 1
faiss 17.9K 654 98 1
Charm 147 24927 82 1
GPFlow 1.7K 2390 72 1
OpenNMT 5.7K 2598 164 1
glslang 2.3K 4586 238 1
maskrcnn-benchmark 2 252 1 1
futhark 1.8K 11205 62 1
OMEinsum.jl 132 169 7 1
facebookresearch-meshrcnn 1K 21 7 1
uber-ludwig 8.5K 2586 126 1
tensorpack 6.2K 2939 56 1
spaCy 24.2K 15598 626 1
perses 107 3797 14 1
linux 216 983155 1 1
verification-classifier 8 911 9 1
pointcloud 7.7K 13932 471 1
darknet 19.7K 2222 1 1


Floating Point Support in GPU Architecture

GeForce 256 was the first chip to combine the vertex computations for transformation and lighting and the fragment computations on the same chip[1]. This fully integrated graphics engine was named a “graphics processing unit” or GPU. Offloading the vertex computations from the host enabled higher geometric complexity in games at the cost of requiring significant floating-point performance. For example, the perspective transformation requires a 4X4 matrix–vector multiply and a perspective division operation.

GeForce 6 had a peak performance of 108 billion single-precision floating-point operations per second (108 GFLOPS) [1]. However, programming these GPUs was challenging because the input and output were very restricted. Fragment shaders only received input from interpolated vertex attributes and textures and only deposited output into the frame buffer. Programmers had to cast their applications as rendering texture mapped and blended triangles to harness the FLOPS of these GPUs[1].

The stream processing model influenced the design of GPUs intended for computing [1]. The GeForce 8 GPU or G80 introduced streaming multiprocessors (SMs) that were used to run both vertex and fragment shaders. These SMs could also run “compute shaders” independent of the graphics pipeline specifically in the form of CUDA cooperative thread arrays. A shared memory facilitated communication between the threads in an array. The G80 “Tesla” GPU and CUDA reduced the barriers to developing GPU-based scientific and other general-purpose applications[1]. G80 was a massively parallel threaded processor, and CUDA was a thread- and data-parallel C-like programming language.

The development of Fermi generation of GPUs helps to address the floating point issues in GPUs [1][2]. Single precision floating point instructions support subnormal numbers by default in hardware. Subnormal numbers are small numbers that lie between zero and the smallest normalized number of a given floating point number system. Fermi’s floating point units handle subnormal numbers in hardware, allowing values to gradually underflow to zero with no performance penalty. Additionally, Fermi GPUs add support to double precision computation.

Fermi
Comparison of Fermi with previous GPUs[2]

With the development of Kepler GPUs, the support for atomic memory operations helped parallel programming to gain massive performance benefits[3]. Throughput of global memory atomic operations on Kepler GK110/210 are substantially improved compared to the Fermi generation. Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock. Atomic operation throughput to independent global addresses is also significantly accelerated, and logic to handle address conflicts has been made more efficient. Atomic operations can often be processed at rates similar to global load operations. This speed increase makes atomics fast enough to use frequently within kernel inner loops, eliminating the separate reduction passes that were previously required by some algorithms to consolidate results. Kepler GK110 also expands the native support for 64-bit atomic operations in global memory. In addition to atomicAdd, atomicCAS, and atomicExch (which were also supported by Fermi and Kepler GK104), GK110 supports the following: atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor. Other atomic operations which are not supported natively (for example 64-bit floating point atomics) may be emulated using the compare-and-swap (CAS) instruction.

According to Dally et al. [1], the next level of support is a set of numerical libraries including CuBLAS, CuSparse, and CuFFT that provide highly optimized code for the key functions of many numerical programs. Applications can then leverage these libraries to exploit GPU capabilities. Today, over 600 HPC applications are accelerated by GPUs,9 including molecular dynamics codes such as GROMACS, NAMD, AMBER, and LAMMPS; weather codes such as WRF; fluid dynamics codes such as ANSYS and OpenFOAM; chemistry codes such as Gaussian, VASP, Quantum Espresso, and GAMESS; and structural analysis codes such as LS-DYNA and ANSYS.

Pascal GPUs provide superior scheduling and overlapped load/store instructions to increase floating point utilization[4]. New innovations in our Pascal architecture, including native 16-bit floating point (FP) precision, allow GP100 to deliver great speedups for many Deep Learning algorithms. These algorithms do not require high levels of floating-point precision, but they gain large benefits from the additional computational power FP16 affords, and the reduced storage requirements for 16-bit datatypes. Storing FP16 data compared to higher precision FP32 or FP64 reduces memory usage of the neural network and thus allows training and deploying of larger networks. Using FP16 computation improves performance up to 2x compared to FP32 arithmetic, and similarly FP16 data transfers take less time than FP32 or FP64 transfers. Also, The atomic addition operation in global memory has been extended to include FP64 data. The atomicAdd() function in CUDA now applies to 32 and 64-bit integer and floating-point data. The rounding mode for floating-point is round-to-nearest-even for all floating-point atomic add operations (formerly, FP32 atomic addition used round-to-zero).

Pascal
Tesla P100 Compared to Prior Generation Tesla products [4]

Tesla V100 GPUs deliver industry-leading floating-point and integer performance [5]. The Tesla V100 GPU contains 640 Tensor Cores: eight (8) per SM and two (2) per each processing block (partition) within an SM. In Volta GV100, each Tensor Core performs 64 floating point FMA operations per clock, and eight Tensor Cores in an SM perform a total of 512 FMA operations (or 1024 individual floating point operations) per clock. For deep learning inference, V100 Tensor Cores provide up to 6x higher peak TFLOPS compared to standard FP16 operations on P100. Tensor Cores and their associated data paths are custom-designed to dramatically increase floating-point compute throughput with high energy efficiency.

Volta
Comparison of NVIDIA Tesla GPUs [5]

A Turing 102 GPU contains 576 Tensor Cores: eight per SM and two per each processing block within an SM [6]. Each Tensor Core can perform up to 64 floating point fused multiply-add (FMA) operations per clock using FP16 inputs. Eight Tensor Cores in an SM perform a total of 512 FP16 multiply and accumulate operations per clock, or 1024 total FP operations per clock. The new INT8 precision mode works at double this rate, or 2048 integer operations per clock. Turing Tensor Cores provide significant speedups to matrix operations and are used for both deep learning training and inference operations in addition to new neural graphics functions.

Turing
Turing
Comparison of NVIDIA Pascal GP104 to Turing TU106 GPUs [6]

Ampere GPUs provide TensorFloat support to GPU architecture [1] [7]. New TensorFloat-32 (TF32) Tensor Core operations in A100 provide an easy path to accelerate FP32 input/output data in DL frameworks and HPC, running 10x faster than V100 FP32 FMA operations or 20x faster with sparsity. The Tensor Core architecture also supports high-throughput computation for different numerical representations, including binary (INT1), INT4, INT8, FP16, and BFloat16 (8-bit exponent, 7-bit mantissa). For FP16/FP32 mixed-precision DL, the A100 Tensor Core delivers 2.5x the performance of V100, increasing to 5x with sparsity. New Bfloat16 (BF16)/FP32 mixed-precision Tensor Core operations run at the same rate as FP16/FP32 mixed-precision. Tensor Core acceleration of INT8, INT4, and binary round out support for DL inferencing, with A100 sparse INT8 running 20x faster than V100 INT8. For HPC, the A100 Tensor Core includes new IEEE-compliant FP64 processing that delivers 2.5x the FP64 performance of V100.

Ampere
Ampere
Comparison of NVIDIA Data Center GPUs [7]

Raw Data

GPU-NBDetect

Link to GPU-NBDetect

Overview
Figure 1. Overview on Detecting Numerical Bugs in GPU programs

Overview of GPU-NBDetect

Our tool, GPU-NBDetect is designed from the findings made in our study. Table 1 in our paper summarizes these findings. The process of numerical bug detection in GPU-NBDetect begins from test input generation (F-7 and I-7 in Table 1) followed by differential testing with CPU programs (F-8 and I-8 in Table 1), and applying varied manifestation strategies to trigger the bug (F-9, F-10, I-9, and I-10 in Table 1) based on the observed debugging techniques by GPU developers (F 11-14 and I 11-14 in Table 1). Next, we will highlight each component in GPU-NBDetect.

GPU-NBDetect uses a template-based approach to generate both GPU and CPU programs. We leverage Python to dynamically generate C and CUDA files. For CPU program generation, the template reads an operator from a predefined operator list and integrates it into the main function. It also defines the input variables as arguments and includes necessary header files, such as math.h and floating-point exception commands. For GPU program generation, the template handles both the host and device code. The host code manages memory allocation, memory copying, and kernel invocation for the device-side operations. The necessary mathematical operators are directly inserted into the device code from the operator list. Once the CPU and GPU program templates are generated, we compile them using gcc for the CPU code and nvcc for the GPU code to produce the respective object files. The compiled programs are then executed using Python's subprocess command.

The process of GPU-NBDetect begins with user-defined inputs, which are typically provided by GPU programmers or users who rely on GPU applications. In our case, we refer to the NVIDIA Math Library to ensure the inputs are within the supported range and select appropriate values based on each mathematical operator.

During our study, we observed that developers often test GPU programs with random inputs to identify bugs. For example, to detect rounding errors, GPU developers might experiment with values like 0.5, 0.499999, or 0.50000001. To automate this process and make it more efficient, we developed seven mutation strategies inspired by existing fuzzing techniques, offering a more systematic and affordable approach to generating test inputs. To minimize false positives, each input is subjected to two constraints: 1) verifying that the inputs fall within the supported range for floating-point numbers, and 2) ensuring that the inputs do not overlap with known bug-triggering inputs as identified in the NVIDIA Math Library documentation.

Next, we implemented the manifestation strategies outlined in Table IV, which were inspired by our findings from the study (Table 1 in our paper). To determine the appropriate thresholds for assertions in each manifestation strategy, we tested multiple error bounds to find the most effective range. Bug samples that could not be consistently detected within a specific error bound were excluded, which helped reduce false positives. The selected thresholds strike a balance between accurately detecting bugs and maintaining meaningful comparisons between expected and actual results.

From our findings, we observed that GPU programmers frequently compare the results of GPU programs with their CPU counterparts to confirm the presence of a bug. This observation led us to implement differential testing in GPU-NBDetect to initially verify whether GPU programs exhibit numerical errors. Once a bug is detected, we use assertions and manifestation strategies to further classify the type of bug involved.

Currently, GPU-NBDetect is capable of detecting overflow, underflow, rounding, special value bugs, as well as casting and data type bugs. For each mathematical function, GPU-NBDetect generates 100 input samples to evaluate its effectiveness in identifying potential GPU numerical bugs. Given the inherent randomness of the mutation operators, we conducted three independent runs to ensure the accuracy and consistency of the results.

Eliminating False-positives:

GPU-NBDetect identified 81 bugs, of which 57 were confirmed by developers. As detailed in Table IV, manifestation strategies one through six are designed to reliably detect numerical bugs. To further reduce false positives, we implemented specific constraints during the input generation process: 1) verifying that the inputs are within the supported range for floating-point numbers, and 2) ensuring that these inputs do not overlap with known bug-triggering inputs listed in the NVIDIA Math Library documentation.

However, for strategies seven through nine, false positives may still occur. This is due to the reliance on profiling mechanisms, which require both detailed profilers and a wide range of inputs to accurately determine the presence of a bug. Nonetheless, we have minimized the occurrence of false positives by carefully refining the input generation process and applying these constraints.

Confirmation with developers:

We reported the detected bugs on the NVIDIA developer forum, where NVIDIA developers could confirm or provide insights regarding the results. Our testing was conducted using the CUDA Math library. In some cases, the NVIDIA Math Library indicated that certain inputs might produce errors. However, it was unclear whether these errors fell within the supported range of inputs. GPU-NBDetect successfully identified such numerical bugs, and we reported them to NVIDIA. In response, developers write unit tests to confirm the presence of these bugs. Additionally, developers acknowledged that GPUs lack a built-in mechanism to detect exceptions, and our tool identified exception-triggering inputs that were within the valid range of supported inputs.

Fixing these identified numerical bugs presents a challenge due to the nature of GPU architecture. Developers would need to implement floating-point exception flags, rounding registers, and conditional statements in the GPU code to address these issues. Although these measures can resolve numerical bugs, they may also introduce performance-related problems, as GPUs are optimized for parallel computations. For example, addressing overflow, underflow, and special value bugs requires developing exception-handling mechanisms, which can be computationally expensive and may negatively affect performance.

Given the complexity and performance trade-offs, it is often more practical to rely on third-party tools to detect such bugs. Our tool, GPU-NBDetect, provides a valuable starting point for identifying and addressing these issues without compromising the efficiency of GPU computations.

Comparison with other tools:

While existing tools primarily focus on detecting overflow, underflow, and special value bugs, GPU-NBDetect is designed to detect a broader range of bug types. Furthermore, these tools typically rely on user-defined inputs and lack an automated test input generation strategy, which is a key feature of GPU-NBDetect. This limitation makes it more challenging to reliably identify bug-triggering inputs with those tools.

Additionally, a direct comparison between GPU-NBDetect and other tools may not be entirely fair, as GPU-NBDetect emphasizes both bug detection and input generation techniques. However, incorporating the input-generation capabilities of GPU-NBDetect into existing tools could greatly enhance their effectiveness, highlighting another significant contribution our tool offers to the community.

Examples:

Next, we will highlight some of the interesting examples observed by GPU-NBDetect.

Script to run in GitHub Archive

SELECT DISTINCT issue_html, issue_title FROM ( SELECT type, repo.name as repo,
JSON_EXTRACT(payload, '$.issue.html_url') as issue_html,
JSON_EXTRACT(payload, '$.issue.title') as issue_title,
JSON_EXTRACT(payload, '$.issue.body') as issue_desc,
JSON_EXTRACT(payload, '$.issue.state') as issue_state
FROM `githubarchive.year.2019`
WHERE type = 'IssuesEvent' )
WHERE ((lower(issue_title) LIKE '%GPU%' or lower(issue_desc) LIKE '%GPU%') OR (lower(issue_title) LIKE '%GPU Programming%' or lower(issue_desc) LIKE '%GPU Programming%')) ORDER BY issue_html DESC;

References:

1. Dally, William J., et al. “Evolution of the Graphics Processing Unit (GPU).” IEEE Micro, vol. 41, no. 6, 2021, pp. 42–51., doi:10.1109/mm.2021.3113475.
2. Nvidia Fermi Architecture Whitepaper. www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf.
3. CUDATM Compute Architecture: Kepler TM GK110/210 - Nvidia. www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf .
4. Nvidia Fermi Architecture Whitepaper. www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf
5. Volta Architecture - Nvidia. images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf .
7. Nvidia A100 Tensor Core GPU Architecture images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.