What is the difference between CUDA and ROCm for GPGPU applications?


What you will learn:

  • Differences between CUDA and ROCm.
  • What are the strengths of each platform?

Graphics Processing Units (GPUs) are traditionally designed to handle graphics computing tasks, such as image and video processing and rendering, 2D and 3D graphics, vectorization, etc. General purpose computing on GPUs became more practical and popular after 2001, with the advent of programmable shaders and floating point support on graphics processors.

Notably, it involved problems with matrices and vectors, including two-, three-, or four-dimensional vectors. These were easily translated to GPU, which acts with native speed and support on these types. A milestone for General Purpose GPUs (GPGPUs) was the year 2003, when a pair of research groups independently discovered GPU-based approaches for the solution of general linear algebra problems with GPUs running faster than CPUs.

GPGPU Evolution

Early efforts to use GPUs as general-purpose processors required reframing computational problems in terms of graphics primitives, which were supported by two major APIs for graphics processors: OpenGL and DirectX.

These were followed soon after by NVIDIA CUDA, which allowed programmers to abandon underlying graphical concepts for more common high-performance computing concepts, such as OpenCL and other high-end frameworks. This meant that modern GPGPU pipelines could take advantage of the speed of a GPU without requiring a complete and explicit conversion of the data to a graphical form.

NVIDIA describes CUDA as a parallel computing platform and application programming interface (API) that allows software to use specific GPUs for general-purpose processing. CUDA is a software layer that provides direct access to the GPU’s virtual instruction set and parallel computing elements for running compute cores.

not to be left out, AMD launched its own general-purpose computing platform in 2016 called the Radeon Open Compute Ecosystem (ROCM). ROCm is primarily intended for discrete professional GPUs, such as AMD’s Radeon Pro line. However, official support is more extensive and extends to consumer products, including gaming GPUs.

Unlike CUDA, the ROCm software stack can take advantage of multiple areas, such as general-purpose GPGPU, high-performance computing (HPC), and heterogeneous computing. It also offers several programming models, such as HIP (GPU kernel-based programming), OpenMP/Message Passing Interface (MPI), and OpenCL. These also support microarchitectures including RDNA and DNACfor a myriad of applications ranging from AI and edge computing to IoT/IIoT.


Most of NVIDIA’s Tesla and RTX series cards come with a series of CUDA cores (Fig.1) designed to tackle multiple calculations at the same time. These cores are similar to CPU cores, but they are integrated into the GPU and can process data in parallel. There can be thousands of these cores embedded in the GPU, making for incredibly efficient parallel systems capable of offloading CPU-centric tasks directly to the GPU.

Parallel computing is described as the process of breaking down larger problems into smaller, independent parts that can be executed simultaneously by multiple processors communicating through shared memory. These are then combined at the end as part of an overall algorithm. The primary purpose of parallel computing is to increase available computing power to speed up application processing and problem solving.

To this end, the CUDA architecture is designed to work with programming languages ​​such as C, C++ and Fortran, allowing parallel programmers to more easily utilize GPU resources. This contrasts with previous APIs such as Direct3D and OpenGL, which required advanced graphics programming skills. CUDA-powered GPUs also support programming frameworks such as OpenMP, OpenACC, OpenCL, and HIP by compiling this code on CUDA.

As with most APIs, SDKs, and software stacks, NVIDIA provides libraries, compiler directives and extensions for popular programming languages ​​mentioned earlier, which makes programming easier and more efficient. These include cuSPARCE, NVRTC runtime compilation, GameWorks Physx, MIG multi-instance GPU support, cuBLAS and many more.

A good portion of these software stacks are designed to handle AI-based applications, including machine learning and deep learning, computer vision, conversational AI, and recommender systems.

Computer vision applications use deep learning to acquire knowledge from digital images and videos. Conversational AI applications help computers understand and communicate through natural language. Recommender systems use a user’s images, language, and interests to deliver meaningful and relevant search results and services.

GPU-accelerated deep learning frameworks provide a level of flexibility to design and train custom neural networks and provide interfaces for commonly used programming languages. All major deep learning frameworks, such as TensorFlow, PyTorch, and others, are already GPU-accelerated, so data scientists and researchers can upgrade without GPU programming.

Current use of the CUDA architecture that goes beyond AI includes bioinformatics, distributed computing, simulations, molecular dynamics, medical analytics (CTI, MRI and other scanning imaging applications ), encryption, etc.

AMD ROCm software stack

AMD ROCm (Fig.2) The software stack is similar to the CUDA platform except that it is open source and uses the company’s GPUs to speed up computational tasks. The latest Radeon Pro W6000 and RX6000 series cards are equipped with compute cores, ray accelerators (ray tracing) and stream processors that take advantage of RDNA architecture for parallel processing, including GPGPU, HPC, HIP (CUDA-like programming model), MPI and OpenCL.

Since the ROCm ecosystem is composed of open technologies, including frameworks (TensorFlow/PyTorch), libraries (MIOpen/Blas/RCCL), programming models (HIP), interconnects (OCD), and support upstream Linux kernel load, the platform is regularly optimized. for performance and efficiency across a wide range of programming languages.

AMD’s ROCm is designed to scale, meaning it supports multi-GPU computing in and out of server-node communication via Remote Direct Memory Access (RDMA), which offers the ability to directly access host memory without CPU intervention. Thus, the more RAM the system has, the greater the processing loads that can be handled by ROCm.

ROCm also simplifies the stack when the driver directly integrates support for RDMA peer synchronization, making application development easier. Additionally, it includes ROCr System Runtime, which is language independent and leverages the HAS (Heterogeneous System Architecture) Runtime API, providing a foundation for running programming languages ​​such as HIP and OpenMP.

As with CUDA, ROCm is an ideal solution for AI applications, as some deep learning frameworks already support a ROCm backend (e.g. TensorFlow, PyTorch, MXNet, ONNX, CuPy, etc.). According to AMD, any CPU/GPU vendor can take advantage of ROCm, as it is not a proprietary technology. This means that code written in CUDA or another platform can be ported to vendor-neutral HIP format, and from there users can compile code for the ROCm platform.

The company offers a series of libraries, add-ons and extensions to deepen the functionality of ROCm, including a solution (HCC) for the C++ programming language that allows users to integrate CPU and GPU in a single file.

The feature set for ROCm is extensive and incorporates multi-GPU support for coarse-grained virtual memory, the ability to handle concurrency and preemption, HSA and atomic signals, DMA and queues in user mode. It also offers standardized loader and code object formats, dynamic and offline compilation support, P2P multi-GPU operation with RDMA support, event tracking and collection API, as well as APIs and system management tools. In addition to this, there is an increasing number of third parties ecosystem package custom ROCm distributions for a given application in a multitude of Linux flavors.


This article is just a simple overview of NVIDIA’s CUDA platform and AMD’s ROCm software stack. Detailed information including guides, walkthroughs, libraries, etc. can be found on their respective websites linked above.

There is no state-of-the-art platform in parallel computing. Each provides an excellent system to facilitate the development of applications in various sectors. Both are also easy to use with intuitive menus, navigation elements, accessible documentation, and learning materials for beginners.


Comments are closed.