Table A shows the total GitHub repositories included in the study after performing the manual inspection

Floating Point Support in GPU Architecture

GeForce 256 was the first chip to combine the vertex computations for transformation and lighting and the fragment computations on the same chip[1]. This fully integrated graphics engine was named a “graphics processing unit” or GPU. Offloading the vertex computations from the host enabled higher geometric complexity in games at the cost of requiring significant floating-point performance. For example, the perspective transformation requires a 4X4 matrix–vector multiply and a perspective division operation.

GeForce 6 had a peak performance of 108 billion single-precision floating-point operations per second (108 GFLOPS) [1]. However, programming these GPUs was challenging because the input and output were very restricted. Fragment shaders only received input from interpolated vertex attributes and textures and only deposited output into the frame buffer. Programmers had to cast their applications as rendering texture mapped and blended triangles to harness the FLOPS of these GPUs[1].

The stream processing model influenced the design of GPUs intended for computing [1]. The GeForce 8 GPU or G80 introduced streaming multiprocessors (SMs) that were used to run both vertex and fragment shaders. These SMs could also run “compute shaders” independent of the graphics pipeline specifically in the form of CUDA cooperative thread arrays. A shared memory facilitated communication between the threads in an array. The G80 “Tesla” GPU and CUDA reduced the barriers to developing GPU-based scientific and other general-purpose applications[1]. G80 was a massively parallel threaded processor, and CUDA was a thread- and data-parallel C-like programming language.

The development of Fermi generation of GPUs helps to address the floating point issues in GPUs [1][2]. Single precision floating point instructions support subnormal numbers by default in hardware. Subnormal numbers are small numbers that lie between zero and the smallest normalized number of a given floating point number system. Fermi’s floating point units handle subnormal numbers in hardware, allowing values to gradually underflow to zero with no performance penalty. Additionally, Fermi GPUs add support to double precision computation.

With the development of Kepler GPUs, the support for atomic memory operations helped parallel programming to gain massive performance benefits[3]. Throughput of global memory atomic operations on Kepler GK110/210 are substantially improved compared to the Fermi generation. Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock. Atomic operation throughput to independent global addresses is also significantly accelerated, and logic to handle address conflicts has been made more efficient. Atomic operations can often be processed at rates similar to global load operations. This speed increase makes atomics fast enough to use frequently within kernel inner loops, eliminating the separate reduction passes that were previously required by some algorithms to consolidate results. Kepler GK110 also expands the native support for 64-bit atomic operations in global memory. In addition to atomicAdd, atomicCAS, and atomicExch (which were also supported by Fermi and Kepler GK104), GK110 supports the following: atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor. Other atomic operations which are not supported natively (for example 64-bit floating point atomics) may be emulated using the compare-and-swap (CAS) instruction.

According to Dally et al. [1], the next level of support is a set of numerical libraries including CuBLAS, CuSparse, and CuFFT that provide highly optimized code for the key functions of many numerical programs. Applications can then leverage these libraries to exploit GPU capabilities. Today, over 600 HPC applications are accelerated by GPUs,9 including molecular dynamics codes such as GROMACS, NAMD, AMBER, and LAMMPS; weather codes such as WRF; fluid dynamics codes such as ANSYS and OpenFOAM; chemistry codes such as Gaussian, VASP, Quantum Espresso, and GAMESS; and structural analysis codes such as LS-DYNA and ANSYS.

Pascal GPUs provide superior scheduling and overlapped load/store instructions to increase floating point utilization[4]. New innovations in our Pascal architecture, including native 16-bit floating point (FP) precision, allow GP100 to deliver great speedups for many Deep Learning algorithms. These algorithms do not require high levels of floating-point precision, but they gain large benefits from the additional computational power FP16 affords, and the reduced storage requirements for 16-bit datatypes. Storing FP16 data compared to higher precision FP32 or FP64 reduces memory usage of the neural network and thus allows training and deploying of larger networks. Using FP16 computation improves performance up to 2x compared to FP32 arithmetic, and similarly FP16 data transfers take less time than FP32 or FP64 transfers. Also, The atomic addition operation in global memory has been extended to include FP64 data. The atomicAdd() function in CUDA now applies to 32 and 64-bit integer and floating-point data. The rounding mode for floating-point is round-to-nearest-even for all floating-point atomic add operations (formerly, FP32 atomic addition used round-to-zero).

Tesla V100 GPUs deliver industry-leading floating-point and integer performance [5]. The Tesla V100 GPU contains 640 Tensor Cores: eight (8) per SM and two (2) per each processing block (partition) within an SM. In Volta GV100, each Tensor Core performs 64 floating point FMA operations per clock, and eight Tensor Cores in an SM perform a total of 512 FMA operations (or 1024 individual floating point operations) per clock. For deep learning inference, V100 Tensor Cores provide up to 6x higher peak TFLOPS compared to standard FP16 operations on P100. Tensor Cores and their associated data paths are custom-designed to dramatically increase floating-point compute throughput with high energy efficiency.

A Turing 102 GPU contains 576 Tensor Cores: eight per SM and two per each processing block within an SM [6]. Each Tensor Core can perform up to 64 floating point fused multiply-add (FMA) operations per clock using FP16 inputs. Eight Tensor Cores in an SM perform a total of 512 FP16 multiply and accumulate operations per clock, or 1024 total FP operations per clock. The new INT8 precision mode works at double this rate, or 2048 integer operations per clock. Turing Tensor Cores provide significant speedups to matrix operations and are used for both deep learning training and inference operations in addition to new neural graphics functions.

Ampere GPUs provide TensorFloat support to GPU architecture [1] [7]. New TensorFloat-32 (TF32) Tensor Core operations in A100 provide an easy path to accelerate FP32 input/output data in DL frameworks and HPC, running 10x faster than V100 FP32 FMA operations or 20x faster with sparsity. The Tensor Core architecture also supports high-throughput computation for different numerical representations, including binary (INT1), INT4, INT8, FP16, and BFloat16 (8-bit exponent, 7-bit mantissa). For FP16/FP32 mixed-precision DL, the A100 Tensor Core delivers 2.5x the performance of V100, increasing to 5x with sparsity. New Bfloat16 (BF16)/FP32 mixed-precision Tensor Core operations run at the same rate as FP16/FP32 mixed-precision. Tensor Core acceleration of INT8, INT4, and binary round out support for DL inferencing, with A100 sparse INT8 running 20x faster than V100 INT8. For HPC, the A100 Tensor Core includes new IEEE-compliant FP64 processing that delivers 2.5x the FP64 performance of V100.

GitHub Project	Stars	Commits	Contributors	Issues Included
CuPy	6.3K	25799	263	62
TensorFlow	168K	135063	3197	42
PyTorch	58.6K	51541	2430	30
cuDF	5K	35572	194	15
Numba	7.8K	23157	272	14
Tranformers	70.1K	10611	1392	10
Ginkgo	234	5459	22	5
cuML	2.9K	14862	116	5
TensorRT	5.9K	415	48	5
Apache-incubator-mxnet	20.1K	11890	874	4
mmdetection	21.3K	2115	356	4
PyTorch-Lightning	20K	7597	744	4
MatX	549	270	15	4
Paddle	18.8K	37448	615	3
Cutlass	2.1K	266	51	3
GPUweb	3.2K	1786	83	3
Onnx	13.2K	2044	237	3
xgboost	23.2K	5923	554	3
pyVista	1.4K	3001	113	2
cuGraph	1.1K	5470	74	2
CUDA.jl	830	7244	113	2
pymc-dev	7K	8892	371	2
google-autoML	5.1K	685	34	2
CLIMA-Oceananigans.jl	707	10595	41	2
ray	22K	14191	722	1
napari	1.5K	2474	128	1
jiesutd-NCRFPP	1.8K	127	9	1
AMDMIGraphX	92	4410	28	1
Thrust	4.1K	4578	74	1
GPytorch	2.9K	3683	86	1
Stumpy	2.4K	1120	27	1
Spark-rapids	450	4587	59	1
UPIT	94	165	5	1
DeepSpeed	7.8K	1116	137	1
CSAROfeen	16	52129	1	1
OpenCV	63.7K	32140	1404	1
catboost	6.7K	23866	293	1
NX	1.9K	1410	57	1
Onnxruntime	7.4K	7334	420	1
mrdoob-three.js	85.2K	40515	1672	1
AMDVLK	1.3K	119	9	1
WGPU	5.6K	3849	260	1
FARM	1.6K	594	110	1
hoomd-blue	215	21504	79	1
blazingsql	1.8K	8208	33	1
awslabs-djl	2.7K	3383	73	1
MIOpen	720	8925	77	1
Kitty	15.9K	10546	199	1
faiss	17.9K	654	98	1
Charm	147	24927	82	1
GPFlow	1.7K	2390	72	1
OpenNMT	5.7K	2598	164	1
glslang	2.3K	4586	238	1
maskrcnn-benchmark	2	252	1	1
futhark	1.8K	11205	62	1
OMEinsum.jl	132	169	7	1
facebookresearch-meshrcnn	1K	21	7	1
uber-ludwig	8.5K	2586	126	1
tensorpack	6.2K	2939	56	1
spaCy	24.2K	15598	626	1
perses	107	3797	14	1
linux	216	983155	1	1
verification-classifier	8	911	9	1
pointcloud	7.7K	13932	471	1
darknet	19.7K	2222	1	1

Raw Data

GPU-NBDetect

Link to GPU-NBDetect

Overview of GPU-NBDetect

Our tool, GPU-NBDetect is designed from the findings made in our study. Table 1 in our paper summarizes these findings. The process of numerical bug detection in GPU-NBDetect begins from test input generation (F-7 and I-7 in Table 1) followed by differential testing with CPU programs (F-8 and I-8 in Table 1), and applying varied manifestation strategies to trigger the bug (F-9, F-10, I-9, and I-10 in Table 1) based on the observed debugging techniques by GPU developers (F 11-14 and I 11-14 in Table 1). Next, we will highlight each component in GPU-NBDetect.

GPU-NBDetect uses a template-based approach to generate both GPU and CPU programs. We leverage Python to dynamically generate C and CUDA files. For CPU program generation, the template reads an operator from a predefined operator list and integrates it into the main function. It also defines the input variables as arguments and includes necessary header files, such as math.h and floating-point exception commands. For GPU program generation, the template handles both the host and device code. The host code manages memory allocation, memory copying, and kernel invocation for the device-side operations. The necessary mathematical operators are directly inserted into the device code from the operator list. Once the CPU and GPU program templates are generated, we compile them using gcc for the CPU code and nvcc for the GPU code to produce the respective object files. The compiled programs are then executed using Python's subprocess command.

The process of GPU-NBDetect begins with user-defined inputs, which are typically provided by GPU programmers or users who rely on GPU applications. In our case, we refer to the NVIDIA Math Library to ensure the inputs are within the supported range and select appropriate values based on each mathematical operator.

During our study, we observed that developers often test GPU programs with random inputs to identify bugs. For example, to detect rounding errors, GPU developers might experiment with values like 0.5, 0.499999, or 0.50000001. To automate this process and make it more efficient, we developed seven mutation strategies inspired by existing fuzzing techniques, offering a more systematic and affordable approach to generating test inputs. To minimize false positives, each input is subjected to two constraints: 1) verifying that the inputs fall within the supported range for floating-point numbers, and 2) ensuring that the inputs do not overlap with known bug-triggering inputs as identified in the NVIDIA Math Library documentation.

bitflip_s: Flips the sign component of a floating-point number with a 50% probability, switching between 0 and 1.
bitflip_e: Flips a single bit at a random position within the exponent
bitflip_m: Flips a single bit at a random position within the mantissa
add_e: Adds a randomly generated integer (between 1 and 32) to the exponent.
add_m: Adds an integer 2^p − 1 to the mantissa, where p is randomly generated between 1 and 33.
sub_e: Subtracts a randomly generated integer (between 1 and 32) from the exponent.
sub_m: Subtracts an integer 2^p−1 from the mantissa,, where p is randomly generated between 1 and 33.
random bit: Sets a randomly chosen bit in the exponent or mantissa to a random value

Next, we implemented the manifestation strategies outlined in Table IV, which were inspired by our findings from the study (Table 1 in our paper). To determine the appropriate thresholds for assertions in each manifestation strategy, we tested multiple error bounds to find the most effective range. Bug samples that could not be consistently detected within a specific error bound were excluded, which helped reduce false positives. The selected thresholds strike a balance between accurately detecting bugs and maintaining meaningful comparisons between expected and actual results.

From our findings, we observed that GPU programmers frequently compare the results of GPU programs with their CPU counterparts to confirm the presence of a bug. This observation led us to implement differential testing in GPU-NBDetect to initially verify whether GPU programs exhibit numerical errors. Once a bug is detected, we use assertions and manifestation strategies to further classify the type of bug involved.

Currently, GPU-NBDetect is capable of detecting overflow, underflow, rounding, special value bugs, as well as casting and data type bugs. For each mathematical function, GPU-NBDetect generates 100 input samples to evaluate its effectiveness in identifying potential GPU numerical bugs. Given the inherent randomness of the mutation operators, we conducted three independent runs to ensure the accuracy and consistency of the results.

Eliminating False-positives:

GPU-NBDetect identified 81 bugs, of which 57 were confirmed by developers. As detailed in Table IV, manifestation strategies one through six are designed to reliably detect numerical bugs. To further reduce false positives, we implemented specific constraints during the input generation process: 1) verifying that the inputs are within the supported range for floating-point numbers, and 2) ensuring that these inputs do not overlap with known bug-triggering inputs listed in the NVIDIA Math Library documentation.

However, for strategies seven through nine, false positives may still occur. This is due to the reliance on profiling mechanisms, which require both detailed profilers and a wide range of inputs to accurately determine the presence of a bug. Nonetheless, we have minimized the occurrence of false positives by carefully refining the input generation process and applying these constraints.

Confirmation with developers:

We reported the detected bugs on the NVIDIA developer forum, where NVIDIA developers could confirm or provide insights regarding the results. Our testing was conducted using the CUDA Math library. In some cases, the NVIDIA Math Library indicated that certain inputs might produce errors. However, it was unclear whether these errors fell within the supported range of inputs. GPU-NBDetect successfully identified such numerical bugs, and we reported them to NVIDIA. In response, developers write unit tests to confirm the presence of these bugs. Additionally, developers acknowledged that GPUs lack a built-in mechanism to detect exceptions, and our tool identified exception-triggering inputs that were within the valid range of supported inputs.

Fixing these identified numerical bugs presents a challenge due to the nature of GPU architecture. Developers would need to implement floating-point exception flags, rounding registers, and conditional statements in the GPU code to address these issues. Although these measures can resolve numerical bugs, they may also introduce performance-related problems, as GPUs are optimized for parallel computations. For example, addressing overflow, underflow, and special value bugs requires developing exception-handling mechanisms, which can be computationally expensive and may negatively affect performance.

Given the complexity and performance trade-offs, it is often more practical to rely on third-party tools to detect such bugs. Our tool, GPU-NBDetect, provides a valuable starting point for identifying and addressing these issues without compromising the efficiency of GPU computations.

Comparison with other tools:

While existing tools primarily focus on detecting overflow, underflow, and special value bugs, GPU-NBDetect is designed to detect a broader range of bug types. Furthermore, these tools typically rely on user-defined inputs and lack an automated test input generation strategy, which is a key feature of GPU-NBDetect. This limitation makes it more challenging to reliably identify bug-triggering inputs with those tools.

Additionally, a direct comparison between GPU-NBDetect and other tools may not be entirely fair, as GPU-NBDetect emphasizes both bug detection and input generation techniques. However, incorporating the input-generation capabilities of GPU-NBDetect into existing tools could greatly enhance their effectiveness, highlighting another significant contribution our tool offers to the community.

Examples:

Next, we will highlight some of the interesting examples observed by GPU-NBDetect.

The cosh function computes calculates the hyperbolic cosine of the input 𝑥. This is an increasing function, which is expected to overflow as the input increases. However, CUDA Math Library does not specify the range of input values that causes the results to be overflow. However, GPU-NBDetect finds that input value 3.6864e+4 as an input that causes an overflow error. This provides users with more information than the documentation regarding exception inducing inputs. The result is validated with the CPU version too.
The erfc function computes the complementary error function of the input 𝑥. According to the CUDA Math Library , the function should produce return 2 and +0 for inputs -INF and +INF respectively. However, GPU-NBDetect identifies an underflow error when the input is 6.7076092e+7. This is due to the input being too large, which causes the function to result in a value that is too close to zero that is too small to be represented as a floating point number. We verify the results with the CPU version. Therefore, GPU-NBDetect is capable of identifying such underflow bugs that can provide users with more information regarding the input values that should be used in the computations.
The casting operator __float2int_rd converts a floating point number to an integer by rounding the value towards negative infinity according to the CUDA Math Library . For example, the function outputs 24 if the input is 23.5. However, GPU-NBDetect identifies a casting error for an input 1.101256458240e+11 that produces 2.147483648e+9 as the output. The result is the maximum value representable by a 32-bit signed integer. This suggests that the __float2int_rd does not contain enough precision to obtain the result, thus producing the maximum possible value without invoking any exception. GPU-NBDetect has been successful in detecting such numerical bugs in GPU programs.

Script to run in GitHub Archive

SELECT DISTINCT issue_html, issue_title FROM ( SELECT type, repo.name as repo,
JSON_EXTRACT(payload, '$.issue.html_url') as issue_html,
JSON_EXTRACT(payload, '$.issue.title') as issue_title,
JSON_EXTRACT(payload, '$.issue.body') as issue_desc,
JSON_EXTRACT(payload, '$.issue.state') as issue_state
FROM `githubarchive.year.2019`
WHERE type = 'IssuesEvent' )
WHERE ((lower(issue_title) LIKE '%GPU%' or lower(issue_desc) LIKE '%GPU%') OR (lower(issue_title) LIKE '%GPU Programming%' or lower(issue_desc) LIKE '%GPU Programming%')) ORDER BY issue_html DESC;

References:

1. Dally, William J., et al. “Evolution of the Graphics Processing Unit (GPU).” IEEE Micro, vol. 41, no. 6, 2021, pp. 42–51., doi:10.1109/mm.2021.3113475.
2. Nvidia Fermi Architecture Whitepaper. www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf.
3. CUDATM Compute Architecture: Kepler TM GK110/210 - Nvidia. www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf .
4. Nvidia Fermi Architecture Whitepaper. www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf
5. Volta Architecture - Nvidia. images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf .
7. Nvidia A100 Tensor Core GPU Architecture images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.

A Comprehensive Study on Numerical Issues in GPU Programs

Tables