CUDA

NVidia chips and CUDA infrastructure are complicated. There are at least five different things that need to be compatible before a program will run:

  1. Hardware
    A chip with a specific compute capability, e.g., compute 8.6, and a more general architecture, e.g., Ampere. In fact the chip itself will have a code name, e.g., GA102, and a product name, e.g., RTX 3090.
  2. Kernel driver
    Characterised by a terse version number, e.g, 510.39.01, and may be tied to a specific range of Linux kernels.
  3. The library libcuda.so (see below)
  4. CUDA Toolkit
    Includes libraries such as libcublas.so and carries the familiar version number, e.g., 11.3.
  5. Application, such as pytorch
    Typically an application will match the API of a specific toolkit version.

Of the above, 1 and 2 are kernel-side things; the system manager will make sure that the kernel driver is compatible with the hardware.

Items 4 and 5 are user-side things. Any package manager, notably conda, will take care of that compatibility; any application in conda will have a dependency on the appropriate toolkit. Of course you need to make sure that the toolkit matches the hardware capability.

I single out item 3, libcuda.so, because it’s something that is associated with the application and toolkit. It needs to be present for an application to link. However, libcuda.so is distributed by NVidia with the driver, item 2. Following the links in Debian, libcuda.so.1 -> libcuda.so.510.39.01. Bizarrely, an application cannot link unless the driver package is installed on the machine.

That last requirement can be averted by a package called nvidia-compat. The compatibility package is actually meant to enable new applications to work on older drivers, but it happens to contain a version of libcuda.so that should allow an application to link when otherwise lacking a driver installation.

The kernel-side installation need not coincide exactly with the user-side. For instance, kernel driver 510.39.01 is a CUDA 11.6-capable driver, but works perfectly well with CUDA toolkit 11.X and several previous versions. This is important in the context of compatibility with applications: e.g., pytorch currently supports the 11.3 toolkit, but had issues with 11.2. This was independent of the driver.