Cycles: Tweak scheduling of GPU kernel compilation #129945

Merged
Sergey Sharybin merged 2 commits from Sergey/blender:cycles_kernel into blender-v4.3-release 2024-11-08 11:05:53 +01:00

This change makes it so only kernels of the same vendor are compiled in
parallel. For example for the release builds it will be:

  1. All CUDA kernels
  2. All OptiX kernels
  3. All HIP kernels
  4. All OneAPI kernels

This potentially leads to a lower CPU utilization, but it makes it much
easier to manage memory usage and tweak per-vendor concurrency.

The goal of this change is to solve occasional out-of-memory during the
GPU kernels compilation step on the CI/CD farm.

This change also includes tweaks to the prallel jobs for HIP-RT and
oneAPI. The tweak is based on measuring apparent memory usage peak on
Linux when doing single-thread compilation, and giving some safe margin
from the available memory on the buildbot.

This change makes it so only kernels of the same vendor are compiled in parallel. For example for the release builds it will be: 1. All CUDA kernels 2. All OptiX kernels 3. All HIP kernels 4. All OneAPI kernels This potentially leads to a lower CPU utilization, but it makes it much easier to manage memory usage and tweak per-vendor concurrency. The goal of this change is to solve occasional out-of-memory during the GPU kernels compilation step on the CI/CD farm. This change also includes tweaks to the prallel jobs for HIP-RT and oneAPI. The tweak is based on measuring apparent memory usage peak on Linux when doing single-thread compilation, and giving some safe margin from the available memory on the buildbot.
Sergey Sharybin added 1 commit 2024-11-07 11:38:25 +01:00
Cycles: Tweak scheduling of GPU kernel compilation
Some checks failed
buildbot/vexp-code-patch-lint Build done.
buildbot/vexp-code-patch-darwin-arm64 Build done.
buildbot/vexp-code-patch-darwin-x86_64 Build done.
buildbot/vexp-code-patch-windows-amd64 Build done.
buildbot/vexp-code-patch-linux-x86_64 Build done.
buildbot/vexp-code-patch-coordinator Build done.
e7664e7852
This change makes it so only kernels of the same vendor are compiled in
parallel. For example for the release builds it will be:

1. All CUDA kernels
2. All OptiX kernels
3. All HIP kernels
4. All OneAPI kernels

This potentially leads to a lower CPU utilization, but it makes it much
easier to manage memory usage and tweak per-vendor concurrency.

The goal of this change is to solve occasional out-of-memory during the
GPU kernels compilation step on the CI/CD farm.
Author
Owner

@blender-bot package

@blender-bot package
Member

Package build started. Download here when ready.

Package build started. [Download here](https://builder.blender.org/download/patch/PR129945) when ready.

@Sergey, if we are now going to compile vendor kernels in stages, then it makes sense to adjust SYCL_OFFLINE_COMPILER_PARALLEL_JOBS in the same way like you have done for the HIP amount of parallel jobs. In fact, I am a bit surprised that I do not see a change of SYCL_OFFLINE_COMPILER_PARALLEL_JOBS value in buildbot/config - does it imply that all this time Blender was using 1 thread for GPU binary compilation of oneAPI kernels? And if so, then why was it a thing, because I believe back there we had enabled 2 threads at least at some moment, haven't we?

@Sergey, if we are now going to compile vendor kernels in stages, then it makes sense to adjust `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` in the same way like you have done for the HIP amount of parallel jobs. In fact, I am a bit surprised that I do not see a change of `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` value in `buildbot/config` - does it imply that all this time Blender was using 1 thread for GPU binary compilation of oneAPI kernels? And if so, then why was it a thing, because I believe back there we had enabled 2 threads at least at some moment, haven't we?
Author
Owner

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build.

We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229

It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR.

The timing seems to be so that this PR compiles CUDA+OptiX+HIP in the similar time as the nightly builds compile CUDA+OptiX+HIP+oneAPI.

@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher?

I'll also do some tests locally to see if the per-GPU-thread-compilation still requires 6gig: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L485 Maybe we can tweak that and run more compilation commands in parallel.

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build. We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229 It should be possible to set the `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` from the code now when we have `blender_{linux,windows,macos}`. Initially I wanted to keep this part of the change to a separate PR. The timing seems to be so that this PR compiles CUDA+OptiX+HIP in the similar time as the nightly builds compile `CUDA+OptiX+HIP+oneAPI`. @Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` to something like 4? Or, maybe, even higher? I'll also do some tests locally to see if the per-GPU-thread-compilation still requires 6gig: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L485 Maybe we can tweak that and run more compilation commands in parallel.

@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build.
We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229
It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR.

Well, up to you how to split this work in PRs then - I have just noticed increase in HIP parallelization but not for oneAPI, and I was wondering why.

@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher?

Yes, I think this is a good start point for such amount of memories. And then we could tweak it to 3 or 5, depending on the results of this first runs.

>@Sirgienko This is a bit confusing situation. The oneAPI kernels are still (historically) disabled for PR compilation. This is somewhere next in our TODO list w.r.t CI/CD. That's why you wouldn't see certain things in the logs of this build. > We do set parallel jobs to 2, but we do it via the buildbot itself: https://projects.blender.org/infrastructure/blender-devops/src/branch/main/buildbot/worker/blender/compile.py#L229 > It should be possible to set the SYCL_OFFLINE_COMPILER_PARALLEL_JOBS from the code now when we have blender_{linux,windows,macos}. Initially I wanted to keep this part of the change to a separate PR. Well, up to you how to split this work in PRs then - I have just noticed increase in HIP parallelization but not for oneAPI, and I was wondering why. >@Sirgienko Assuming we have 24-30 gig of RAM available for the build process, do you think we can raise SYCL_OFFLINE_COMPILER_PARALLEL_JOBS to something like 4? Or, maybe, even higher? Yes, I think this is a good start point for such amount of memories. And then we could tweak it to 3 or 5, depending on the results of this first runs.
Author
Owner

I've done some graphs building GPU kernels on a single core to get a feeling of time and memory usage.

CUDA:
memory_usage_cuda.png

OptiX:
memory_usage_optix.png

HIP:
memory_usage_hip.png

oneAPI:
memory_usage_oneapi.png

We have about 26-28 GiB available memory on Windows workers (the rest of 32gig is the OS etc).
Currently the math works out that compilation happens in 5 threads. We should be able to safely raise it to 8 (to have some margin?), or, even 10.

Interestingly enough the single core compilation outside of VM feels like taking the same amount of time as it takes on a VM on the similar HW. Maybe this is something our Proxmox/IT team can investigate eventually.

For the steps forward I'd add SYCL_OFFLINE_COMPILER_PARALLEL_JOBS setting in this PR and set it to 6 (just safe-sounding number for now, we can tweak it later).

I've done some graphs building GPU kernels on a single core to get a feeling of time and memory usage. CUDA: ![memory_usage_cuda.png](/attachments/0efbb4c4-b27f-4755-92cb-23a99d0faa0f) OptiX: ![memory_usage_optix.png](/attachments/6383c21b-e745-4afe-b6af-a7ee14009503) HIP: ![memory_usage_hip.png](/attachments/5eb21fdc-e773-4390-a77f-c47158c36067) oneAPI: ![memory_usage_oneapi.png](/attachments/b8a22697-6ebe-40eb-9946-7ce6a7f080c9) We have about 26-28 GiB available memory on Windows workers (the rest of 32gig is the OS etc). Currently the math works out that compilation happens in 5 threads. We should be able to safely raise it to 8 (to have some margin?), or, even 10. Interestingly enough the single core compilation outside of VM feels like taking the same amount of time as it takes on a VM on the similar HW. Maybe this is something our Proxmox/IT team can investigate eventually. For the steps forward I'd add `SYCL_OFFLINE_COMPILER_PARALLEL_JOBS` setting in this PR and set it to 6 (just safe-sounding number for now, we can tweak it later).
Sergey Sharybin added 1 commit 2024-11-07 16:16:17 +01:00
Sergey Sharybin merged commit f58522fc10 into blender-v4.3-release 2024-11-08 11:05:53 +01:00
Sergey Sharybin deleted branch cycles_kernel 2024-11-08 11:05:57 +01:00
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset System
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#129945
No description provided.