doc/languages-frameworks/cuda.section.md at 25.11-pre · pyrox.dev/nixpkgs

pyrox.dev / nixpkgs
lol
nixpkgs / doc / languages-frameworks / cuda.section.md
at 25.11-pre 13 kB view raw view rendered
  1# CUDA {#cuda}
  2
  3CUDA-only packages are stored in the `cudaPackages` packages set. This set
  4includes the `cudatoolkit`, portions of the toolkit in separate derivations,
  5`cudnn`, `cutensor` and `nccl`.
  6
  7A package set is available for each CUDA version, so for example
  8`cudaPackages_11_6`. Within each set is a matching version of the above listed
  9packages. Additionally, other versions of the packages that are packaged and
 10compatible are available as well. For example, there can be a
 11`cudaPackages.cudnn_8_3` package.
 12
 13To use one or more CUDA packages in an expression, give the expression a `cudaPackages` parameter, and in case CUDA is optional
 14```nix
 15{
 16  config,
 17  cudaSupport ? config.cudaSupport,
 18  cudaPackages ? { },
 19  ...
 20}:
 21{ }
 22```
 23
 24When using `callPackage`, you can choose to pass in a different variant, e.g.
 25when a different version of the toolkit suffices
 26```nix
 27{
 28  mypkg = callPackage { cudaPackages = cudaPackages_11_5; };
 29}
 30```
 31
 32If another version of say `cudnn` or `cutensor` is needed, you can override the
 33package set to make it the default. This guarantees you get a consistent package
 34set.
 35```nix
 36{
 37  mypkg =
 38    let
 39      cudaPackages = cudaPackages_11_5.overrideScope (
 40        final: prev: {
 41          cudnn = prev.cudnn_8_3;
 42        }
 43      );
 44    in
 45    callPackage { inherit cudaPackages; };
 46}
 47```
 48
 49The CUDA NVCC compiler requires flags to determine which hardware you
 50want to target for in terms of SASS (real hardware) or PTX (JIT kernels).
 51
 52Nixpkgs tries to target support real architecture defaults based on the
 53CUDA toolkit version with PTX support for future hardware.  Experienced
 54users may optimize this configuration for a variety of reasons such as
 55reducing binary size and compile time, supporting legacy hardware, or
 56optimizing for specific hardware.
 57
 58You may provide capabilities to add support or reduce binary size through
 59`config` using `cudaCapabilities = [ "6.0" "7.0" ];` and
 60`cudaForwardCompat = true;` if you want PTX support for future hardware.
 61
 62Please consult [GPUs supported](https://en.wikipedia.org/wiki/CUDA#GPUs_supported)
 63for your specific card(s).
 64
 65Library maintainers should consult [NVCC Docs](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/)
 66and release notes for their software package.
 67
 68## Adding a new CUDA release {#adding-a-new-cuda-release}
 69
 70> **WARNING**
 71>
 72> This section of the docs is still very much in progress. Feedback is welcome in GitHub Issues tagging @NixOS/cuda-maintainers or on [Matrix](https://matrix.to/#/#cuda:nixos.org).
 73
 74The CUDA Toolkit is a suite of CUDA libraries and software meant to provide a development environment for CUDA-accelerated applications. Until the release of CUDA 11.4, NVIDIA had only made the CUDA Toolkit available as a multi-gigabyte runfile installer, which we provide through the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute. From CUDA 11.4 and onwards, NVIDIA has also provided CUDA redistributables (“CUDA-redist”): individually packaged CUDA Toolkit components meant to facilitate redistribution and inclusion in downstream projects. These packages are available in the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set.
 75
 76All new projects should use the CUDA redistributables available in [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) in place of [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit), as they are much easier to maintain and update.
 77
 78### Updating CUDA redistributables {#updating-cuda-redistributables}
 79
 801. Go to NVIDIA's index of CUDA redistributables: <https://developer.download.nvidia.com/compute/cuda/redist/>
 812. Make a note of the new version of CUDA available.
 823. Run
 83
 84   ```bash
 85   nix run github:connorbaker/cuda-redist-find-features -- \
 86      download-manifests \
 87      --log-level DEBUG \
 88      --version <newest CUDA version> \
 89      https://developer.download.nvidia.com/compute/cuda/redist \
 90      ./pkgs/development/cuda-modules/cuda/manifests
 91   ```
 92
 93   This will download a copy of the manifest for the new version of CUDA.
 944. Run
 95
 96   ```bash
 97   nix run github:connorbaker/cuda-redist-find-features -- \
 98      process-manifests \
 99      --log-level DEBUG \
100      --version <newest CUDA version> \
101      https://developer.download.nvidia.com/compute/cuda/redist \
102      ./pkgs/development/cuda-modules/cuda/manifests
103   ```
104
105   This will generate a `redistrib_features_<newest CUDA version>.json` file in the same directory as the manifest.
1065. Update the `cudaVersionMap` attribute set in `pkgs/development/cuda-modules/cuda/extension.nix`.
107
108### Updating cuTensor {#updating-cutensor}
109
1101. Repeat the steps present in [Updating CUDA redistributables](#updating-cuda-redistributables) with the following changes:
111   - Use the index of cuTensor redistributables: <https://developer.download.nvidia.com/compute/cutensor/redist>
112   - Use the newest version of cuTensor available instead of the newest version of CUDA.
113   - Use `pkgs/development/cuda-modules/cutensor/manifests` instead of `pkgs/development/cuda-modules/cuda/manifests`.
114   - Skip the step of updating `cudaVersionMap` in `pkgs/development/cuda-modules/cuda/extension.nix`.
115
116### Updating supported compilers and GPUs {#updating-supported-compilers-and-gpus}
117
1181. Update `nvcc-compatibilities.nix` in `pkgs/development/cuda-modules/` to include the newest release of NVCC, as well as any newly supported host compilers.
1192. Update `gpus.nix` in `pkgs/development/cuda-modules/` to include any new GPUs supported by the new release of CUDA.
120
121### Updating the CUDA Toolkit runfile installer {#updating-the-cuda-toolkit}
122
123> **WARNING**
124>
125> While the CUDA Toolkit runfile installer is still available in Nixpkgs as the [`cudaPackages.cudatoolkit`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages.cudatoolkit) attribute, its use is not recommended and should it be considered deprecated. Please migrate to the CUDA redistributables provided by the [`cudaPackages`](https://search.nixos.org/packages?channel=unstable&type=packages&query=cudaPackages) package set.
126>
127> To ensure packages relying on the CUDA Toolkit runfile installer continue to build, it will continue to be updated until a migration path is available.
128
1291. Go to NVIDIA's CUDA Toolkit runfile installer download page: <https://developer.nvidia.com/cuda-downloads>
1302. Select the appropriate OS, architecture, distribution, and version, and installer type.
131
132   - For example: Linux, x86_64, Ubuntu, 22.04, runfile (local)
133   - NOTE: Typically, we use the Ubuntu runfile. It is unclear if the runfile for other distributions will work.
134
1353. Take the link provided by the installer instructions on the webpage after selecting the installer type and get its hash by running:
136
137   ```bash
138   nix store prefetch-file --hash-type sha256 <link>
139   ```
140
1414. Update `pkgs/development/cuda-modules/cudatoolkit/releases.nix` to include the release.
142
143### Updating the CUDA package set {#updating-the-cuda-package-set}
144
1451. Include a new `cudaPackages_<major>_<minor>` package set in `pkgs/top-level/all-packages.nix`.
146
147   - NOTE: Changing the default CUDA package set should occur in a separate PR, allowing time for additional testing.
148
1492. Successfully build the closure of the new package set, updating `pkgs/development/cuda-modules/cuda/overrides.nix` as needed. Below are some common failures:
150
151| Unable to ... | During ... | Reason | Solution | Note |
152| --- | --- | --- | --- | --- |
153| Find headers | `configurePhase` or `buildPhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain the headers |
154| Find libraries | `configurePhase` | Missing dependency on a `dev` output | Add the missing dependency | The `dev` output typically contain CMake configuration files |
155| Find libraries | `buildPhase` or `patchelf` | Missing dependency on a `lib` or `static` output | Add the missing dependency | The `lib` or `static` output typically contain the libraries |
156
157In the scenario you are unable to run the resulting binary: this is arguably the most complicated as it could be any combination of the previous reasons. This type of failure typically occurs when a library attempts to load or open a library it depends on that it does not declare in its `DT_NEEDED` section. As a first step, ensure that dependencies are patched with [`autoAddDriverRunpath`](https://search.nixos.org/packages?channel=unstable&type=packages&query=autoAddDriverRunpath). Failing that, try running the application with [`nixGL`](https://github.com/guibou/nixGL) or a similar wrapper tool. If that works, it likely means that the application is attempting to load a library that is not in the `RPATH` or `RUNPATH` of the binary.
158
159## Running Docker or Podman containers with CUDA support {#cuda-docker-podman}
160
161It is possible to run Docker or Podman containers with CUDA support. The recommended mechanism to perform this task is to use the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html).
162
163The NVIDIA Container Toolkit can be enabled in NixOS like follows:
164
165```nix
166{
167  hardware.nvidia-container-toolkit.enable = true;
168}
169```
170
171This will automatically enable a service that generates a CDI specification (located at `/var/run/cdi/nvidia-container-toolkit.json`) based on the auto-detected hardware of your machine. You can check this service by running:
172
173```ShellSession
174$ systemctl status nvidia-container-toolkit-cdi-generator.service
175```
176
177::: {.note}
178Depending on what settings you had already enabled in your system, you might need to restart your machine in order for the NVIDIA Container Toolkit to generate a valid CDI specification for your machine.
179:::
180
181Once that a valid CDI specification has been generated for your machine on boot time, both Podman and Docker (> 25) will use this spec if you provide them with the `--device` flag:
182
183```ShellSession
184$ podman run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
185GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
186GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
187```
188
189```ShellSession
190$ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
191GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
192GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
193```
194
195You can check all the identifiers that have been generated for your auto-detected hardware by checking the contents of the `/var/run/cdi/nvidia-container-toolkit.json` file:
196
197```ShellSession
198$ nix run nixpkgs#jq -- -r '.devices[].name' < /var/run/cdi/nvidia-container-toolkit.json
1990
2001
201all
202```
203
204### Specifying what devices to expose to the container {#specifying-what-devices-to-expose-to-the-container}
205
206You can choose what devices are exposed to your containers by using the identifier on the generated CDI specification. Like follows:
207
208```ShellSession
209$ podman run --rm -it --device=nvidia.com/gpu=0 ubuntu:latest nvidia-smi -L
210GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
211```
212
213You can repeat the `--device` argument as many times as necessary if you have multiple GPU's and you want to pick up which ones to expose to the container:
214
215```ShellSession
216$ podman run --rm -it --device=nvidia.com/gpu=0 --device=nvidia.com/gpu=1 ubuntu:latest nvidia-smi -L
217GPU 0: NVIDIA GeForce RTX 4090 (UUID: <REDACTED>)
218GPU 1: NVIDIA GeForce RTX 2080 SUPER (UUID: <REDACTED>)
219```
220
221::: {.note}
222By default, the NVIDIA Container Toolkit will use the GPU index to identify specific devices. You can change the way to identify what devices to expose by using the `hardware.nvidia-container-toolkit.device-name-strategy` NixOS attribute.
223:::
224
225### Using docker-compose {#using-docker-compose}
226
227It's possible to expose GPU's to a `docker-compose` environment as well. With a `docker-compose.yaml` file like follows:
228
229```yaml
230services:
231  some-service:
232    image: ubuntu:latest
233    command: sleep infinity
234    deploy:
235      resources:
236        reservations:
237          devices:
238          - driver: cdi
239            device_ids:
240            - nvidia.com/gpu=all
241```
242
243In the same manner, you can pick specific devices that will be exposed to the container:
244
245```yaml
246services:
247  some-service:
248    image: ubuntu:latest
249    command: sleep infinity
250    deploy:
251      resources:
252        reservations:
253          devices:
254          - driver: cdi
255            device_ids:
256            - nvidia.com/gpu=0
257            - nvidia.com/gpu=1
258```