Cycles: Tweak scheduling of GPU kernel compilation #129945

Sergey Sharybin · 2024-11-07T11:38:17+01:00

Sergey Sharybin commented

2024-11-07 11:38:17 +01:00

This change makes it so only kernels of the same vendor are compiled in
parallel. For example for the release builds it will be:

All CUDA kernels
All OptiX kernels
All HIP kernels
All OneAPI kernels

This potentially leads to a lower CPU utilization, but it makes it much
easier to manage memory usage and tweak per-vendor concurrency.

The goal of this change is to solve occasional out-of-memory during the
GPU kernels compilation step on the CI/CD farm.

This change also includes tweaks to the prallel jobs for HIP-RT and
oneAPI. The tweak is based on measuring apparent memory usage peak on
Linux when doing single-thread compilation, and giving some safe margin
from the available memory on the buildbot.

This change makes it so only kernels of the same vendor are compiled in parallel. For example for the release builds it will be: 1. All CUDA kernels 2. All OptiX kernels 3. All HIP kernels 4. All OneAPI kernels This potentially leads to a lower CPU utilization, but it makes it much easier to manage memory usage and tweak per-vendor concurrency. The goal of this change is to solve occasional out-of-memory during the GPU kernels compilation step on the CI/CD farm. This change also includes tweaks to the prallel jobs for HIP-RT and oneAPI. The tweak is based on measuring apparent memory usage peak on Linux when doing single-thread compilation, and giving some safe margin from the available memory on the buildbot.

Sergey Sharybin added 1 commit 2024-11-07 11:38:25 +01:00

Cycles: Tweak scheduling of GPU kernel compilation

buildbot/vexp-code-patch-lint Build done.

Details

buildbot/vexp-code-patch-darwin-arm64 Build done.

Details

buildbot/vexp-code-patch-darwin-x86_64 Build done.

Details

buildbot/vexp-code-patch-windows-amd64 Build done.

Details

buildbot/vexp-code-patch-linux-x86_64 Build done.

Details

buildbot/vexp-code-patch-coordinator Build done.

Details

e7664e7852

This change makes it so only kernels of the same vendor are compiled in
parallel. For example for the release builds it will be:

1. All CUDA kernels
2. All OptiX kernels
3. All HIP kernels
4. All OneAPI kernels

This potentially leads to a lower CPU utilization, but it makes it much
easier to manage memory usage and tweak per-vendor concurrency.

The goal of this change is to solve occasional out-of-memory during the
GPU kernels compilation step on the CI/CD farm.

Sergey Sharybin commented

2024-11-07 11:38:39 +01:00

@blender-bot package

Blender Bot commented

2024-11-07 11:38:43 +01:00

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR129945) when ready.

Nikita Sirgienko commented

2024-11-07 13:09:07 +01:00

@Sergey, if we are now going to compile vendor kernels in stages, then it makes sense to adjust SYCL_OFFLINE_COMPILER_PARALLEL_JOBS in the same way like you have done for the HIP amount of parallel jobs. In fact, I am a bit surprised that I do not see a change of SYCL_OFFLINE_COMPILER_PARALLEL_JOBS value in buildbot/config - does it imply that all this time Blender was using 1 thread for GPU binary compilation of oneAPI kernels? And if so, then why was it a thing, because I believe back there we had enabled 2 threads at least at some moment, haven't we?

@Sergey, if we are now going to compile vendor kernels in stages, then it makes sense to adjust `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` in the same way like you have done for the HIP amount of parallel jobs. In fact, I am a bit surprised that I do not see a change of `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` value in `buildbot/config` - does it imply that all this time Blender was using 1 thread for GPU binary compilation of oneAPI kernels? And if so, then why was it a thing, because I believe back there we had enabled 2 threads at least at some moment, haven't we?

Sergey Sharybin commented

2024-11-07 14:01:21 +01:00

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build.

We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229

It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR.

The timing seems to be so that this PR compiles CUDA+OptiX+HIP in the similar time as the nightly builds compile CUDA+OptiX+HIP+oneAPI.

@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher?

I'll also do some tests locally to see if the per-GPU-thread-compilation still requires 6gig: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L485 Maybe we can tweak that and run more compilation commands in parallel.

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build. We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229 It should be possible to set the `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` from the code now when we have `blender_{linux,windows,macos}`. Initially I wanted to keep this part of the change to a separate PR. The timing seems to be so that this PR compiles CUDA+OptiX+HIP in the similar time as the nightly builds compile `CUDA+OptiX+HIP+oneAPI`. @Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` to something like 4? Or, maybe, even higher? I'll also do some tests locally to see if the per-GPU-thread-compilation still requires 6gig: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L485 Maybe we can tweak that and run more compilation commands in parallel.

Nikita Sirgienko commented

2024-11-07 14:23:59 +01:00

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build.
We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229
It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR.

Well, up to you how to split this work in PRs then - I have just noticed increase in HIP parallelization but not for oneAPI, and I was wondering why.

@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher?

Yes, I think this is a good start point for such amount of memories. And then we could tweak it to 3 or 5, depending on the results of this first runs.

>@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build. > We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229 > It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR. Well, up to you how to split this work in PRs then - I have just noticed increase in HIP parallelization but not for oneAPI, and I was wondering why. >@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher? Yes, I think this is a good start point for such amount of memories. And then we could tweak it to 3 or 5, depending on the results of this first runs.

Sergey Sharybin commented

2024-11-07 16:12:55 +01:00

I've done some graphs building GPU kernels on a single core to get a feeling of time and memory usage.

CUDA:

OptiX:

HIP:

oneAPI:

We have about 26-28 GiB available memory on Windows workers (the rest of 32gig is the OS etc).
Currently the math works out that compilation happens in 5 threads. We should be able to safely raise it to 8 (to have some margin?), or, even 10.

Interestingly enough the single core compilation outside of VM feels like taking the same amount of time as it takes on a VM on the similar HW. Maybe this is something our Proxmox/IT team can investigate eventually.

For the steps forward I'd add SYCL_OFFLINE_COMPILER_PARALLEL_JOBS setting in this PR and set it to 6 (just safe-sounding number for now, we can tweak it later).

I've done some graphs building GPU kernels on a single core to get a feeling of time and memory usage. CUDA: ![memory_usage_cuda.png](/attachments/0efbb4c4-b27f-4755-92cb-23a99d0faa0f) OptiX: ![memory_usage_optix.png](/attachments/6383c21b-e745-4afe-b6af-a7ee14009503) HIP: ![memory_usage_hip.png](/attachments/5eb21fdc-e773-4390-a77f-c47158c36067) oneAPI: ![memory_usage_oneapi.png](/attachments/b8a22697-6ebe-40eb-9946-7ce6a7f080c9) We have about 26-28 GiB available memory on Windows workers (the rest of 32gig is the OS etc). Currently the math works out that compilation happens in 5 threads. We should be able to safely raise it to 8 (to have some margin?), or, even 10. Interestingly enough the single core compilation outside of VM feels like taking the same amount of time as it takes on a VM on the similar HW. Maybe this is something our Proxmox/IT team can investigate eventually. For the steps forward I'd add `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` setting in this PR and set it to 6 (just safe-sounding number for now, we can tweak it later).

memory_usage_hip.png

51 KiB

memory_usage_oneapi.png

38 KiB

memory_usage_cuda.png

41 KiB

memory_usage_optix.png

37 KiB

👍 1

Sergey Sharybin added 1 commit 2024-11-07 16:16:17 +01:00

Increase parallel jobs for HIP-RT and oneAPI 9faaaafec1

Sergey Sharybin merged commit f58522fc10 into blender-v4.3-release