← Back to Library

Nvidia's PTX

We discussed the history of Nvidia’s CUDA in Happy 18th Birthday CUDA!

Let’s recap how Wikipedia defines CUDA. It’s:

… a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels. In addition to drivers and runtime kernels, the CUDA platform includes compilers, libraries and developer tools to help programmers accelerate their applications.

So CUDA sits on top of, what Wikipedia calls, the GPU’s virtual instruction set.

Nvidia call this virtual instruction set PTX for Parallel Thread eXecution. PTX has been in the news because DeepSeek highlighted their use of PTX in their technical report on their R3 Large Language Model.

Here’s Ben Thompson of Stratechery commenting on DeepSeek’s use of PTX:

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

So Ben calls use of PTX an ‘insane level of optimization’ that does things that are ‘impossible to do in CUDA’.

So what’s PTX all about and just what is a ‘virtual instruction set’? And, perhaps, most intriguingly, why has Nvidia decided to use PTX rather than a real instruction set? In this post we’ll set out to understand why.

The Origins of Virtual Instruction Sets

The earliest programs on modern computers were written first in machine code and then in what we now call assembly language. Use of these ‘low level' languages was both inefficient and error prone, so soon led to the development of ‘high level’ languages which a compiler converted to machine code.

This also enabled a greater level of portability between machines. In theory code written in Fortran, for example, would then run on computers from IBM, Univac, GE or a number of other firms.

However, even this wasn’t ideal. In the 1950s ...

Read full article on The Chip Letter →