VSE Improve video decoding delays #118155

Open
opened 2024-02-12 20:00:02 +01:00 by Richard Antalik · 4 comments

I have experimented a bit yesterday and created 2 builds to compare delays during cuts: #118112 and #118114. The results on their own look promissing, but the performance improvement is quite marginal.

Looking at the results, I have realized, that some memory could be reserved for AVCodecContext handles. About 10GB can cover about 100 strips. Now this can be 100 frames or whole timeline (more likely). VSE could ensure, that n strips on either side of playhead are fully loaded (in thread). While others do have AVCodecContext freed. This would ensure very good scrubbing performance near area where you work. Technically this would work similar to cache prefetching.

Technically AVCodecContext can be shared across multiple ImBufAnim structs. Or it would be possible to share ImBufAnim across multiple strips using the same file. this way even more memory could be saved allowing prefetching idea above cover even bigger timeline portion

I have experimented a bit yesterday and created 2 builds to compare delays during cuts: #118112 and #118114. The results on their own look promissing, but the performance improvement is quite marginal. Looking at the results, I have realized, that some memory could be reserved for `AVCodecContext` handles. About 10GB can cover about 100 strips. Now this can be 100 frames or whole timeline (more likely). VSE could ensure, that n strips on either side of playhead are fully loaded (in thread). While others do have `AVCodecContext` freed. This would ensure very good scrubbing performance near area where you work. Technically this would work similar to cache prefetching. Technically `AVCodecContext` can be shared across multiple `ImBufAnim` structs. Or it would be possible to share `ImBufAnim` across multiple strips using the same file. this way even more memory could be saved allowing prefetching idea above cover even bigger timeline portion
Richard Antalik added the
Type
Design
Module
VFX & Video
Interest
Video Sequencer
labels 2024-02-12 20:00:14 +01:00

Tangentially related, but my thinking was like below (most of it might be wrong/incorrect, since I've no idea what I'm talking about):

Primarily, you want two things out of sequencer:

  1. "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now.
  2. "render performance". How long does it take to render the movie out.

Now, from the (limited) data sets I've seen, both of the above could be improved by using more CPU cores, i.e. doing more things in parallel. While most of the sequencer effects and image filtering are multi-threaded and try to use all the available CPU cores, "the rest" (reading image files, and possibly reading/writing movie frames) do not go anywhere near using all the CPU power.

So what I had in mind, was something similar to a "prefetch" system, but only for input images / movie frames. That would work something like this:

  • At any time, look at current play head, plus N (e.g. +10) frames ahead of it,
  • Take all the image/movie strips that are in these frames,
  • Fire off jobs that "preload" these into an image cache.

The jobs would be not processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel.

The rest of VSE playback/rendering pipeline would stay similar to how it is now: at any point, if an image/frame is needed, it is fetched from the cache (hopefully pre-filled by the jobs above already). If it's not, then load it synchronously.

This whole system would also be used while rendering animation, since today during render time the whole CPU usage is like 20-30% of the available CPU cores, since input image/movie reading is very serial.

Tangentially related, but my thinking was like below (most of it might be wrong/incorrect, since I've no idea what I'm talking about): Primarily, you want two things out of sequencer: 1) "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now. 2) "render performance". How long does it take to render the movie out. Now, from the (limited) data sets I've seen, both of the above could be improved by using more CPU cores, i.e. doing more things in parallel. While most of the sequencer effects and image filtering are multi-threaded and try to use all the available CPU cores, "the rest" (reading image files, and possibly reading/writing movie frames) do not go anywhere near using all the CPU power. So what I had in mind, was something similar to a "prefetch" system, but only for input images / movie frames. That would work something like this: - At any time, look at current play head, plus N (e.g. +10) frames ahead of it, - Take all the image/movie strips that are in these frames, - Fire off jobs that "preload" these into an image cache. The jobs would be _not_ processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel. The rest of VSE playback/rendering pipeline would stay similar to how it is now: at any point, if an image/frame is needed, it is fetched from the cache (hopefully pre-filled by the jobs above already). If it's not, then load it synchronously. This whole system would also be used while rendering animation, since today during render time the whole CPU usage is like 20-30% of the available CPU cores, since input image/movie reading is very serial.
Author
Member

Primarily, you want two things out of sequencer:

  1. "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now.
  2. "render performance". How long does it take to render the movie out.

...3) good scrubbing performance :)

The jobs would be not processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel.

I mean I get your reasoning, and it is not impossible to do this. There are few things to consider:
To render image composites you need Scene which current prefetch copies. It shouldn't be large amount of data for VSE, but it can be. So this should be considered. Ultimately, VSE shouldn't have to need this, but it does now.

Your proposal would make sense to load "raw images". But ffmpeg for example can use threads to decode most used codecs. Perhaps this is not done perfectly, but still. For other strip types, I get the idea. So this would create 2 kinds of prefetching - decoding and compositing.

I will explain bit more in detail how current cache and prefetching loop works:

  • Cache must store images in orded as they are rendered. This is because it links the images as a means to see their dependency
  • So when any image is freed from cache, it follows the "children" and frees them. This is partly because speed effect strip may cause mayhem, but it could be broken already as well. Another reason is for cache limiting
  • When cache is full and another image needs to be stored, it does go over the limit until a final image for the frame is stored
  • When final image for the frame is stored, and cache limit is exceeded, it will free data until limit is honored
  • When any image is freed due to limit being applied, it does free all the "children" and "parents". When whole frame is not freed at once, you can end up with "holes" in the cache - random raw images would be freed, because composite image of strip above it exists. Then if you decide to tweak some modifier, you need raw image and you free "children", which is opposite state of what you have...
    As far as prefetch is concerned, it does not know, when to stop. It stops, when cache reports, that it is full. There is a bit diferent logic to decide what images are freed from cache when prefetching is enabled, otherwise it wouldn't work.

Now because prefetch does not know when to stop and how cache limiting works, this multithreaded preloading would be impossible to implement with current design. So the question is, how much work it would take to change the design in order for it to work well.

Or you can cheat a bit - Say you implement function to tell you if cache is less than 3/4 full - then you can use "burst" mode to load images multithreaded. Also during playback you can free images from cache as soon as they are behind playback, so the cache is freed up faster if playback is "catching up" to prefetch. This would effectively implement what you propose with minimal changes to quite delicate caching mechanism.

This is indeed bit off-topic, but this would be IMO good design for prefetch V2, which could improve performance quite a bit.

> Primarily, you want two things out of sequencer: > 1) "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now. > 2) "render performance". How long does it take to render the movie out. ...3) good scrubbing performance :) > The jobs would be _not_ processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel. I mean I get your reasoning, and it is not impossible to do this. There are few things to consider: To render image composites you need `Scene` which current prefetch copies. It shouldn't be large amount of data for VSE, but it can be. So this should be considered. Ultimately, VSE shouldn't have to need this, but it does now. Your proposal would make sense to load "raw images". But ffmpeg for example can use threads to decode most used codecs. Perhaps this is not done perfectly, but still. For other strip types, I get the idea. So this would create 2 kinds of prefetching - decoding and compositing. I will explain bit more in detail how current cache and prefetching loop works: - Cache must store images in orded as they are rendered. This is because it links the images as a means to see their dependency - So when any image is freed from cache, it follows the "children" and frees them. This is partly because speed effect strip may cause mayhem, but it could be broken already as well. Another reason is for cache limiting - When cache is full and another image needs to be stored, it does go over the limit until a final image for the frame is stored - When final image for the frame is stored, and cache limit is exceeded, it will free data until limit is honored - When any image is freed due to limit being applied, it does free all the "children" and "parents". When whole frame is not freed at once, you can end up with "holes" in the cache - random raw images would be freed, because composite image of strip above it exists. Then if you decide to tweak some modifier, you need raw image and you free "children", which is opposite state of what you have... As far as prefetch is concerned, it does not know, when to stop. It stops, when cache reports, that it is full. There is a bit diferent logic to decide what images are freed from cache when prefetching is enabled, otherwise it wouldn't work. Now because prefetch does not know when to stop and how cache limiting works, this multithreaded preloading would be impossible to implement with current design. So the question is, how much work it would take to change the design in order for it to work well. Or you can cheat a bit - Say you implement function to tell you if cache is less than 3/4 full - then you can use "burst" mode to load images multithreaded. Also during playback you can free images from cache as soon as they are behind playback, so the cache is freed up faster if playback is "catching up" to prefetch. This would effectively implement what you propose with minimal changes to quite delicate caching mechanism. This is indeed bit off-topic, but this would be IMO good design for prefetch V2, which could improve performance quite a bit.

That's a curious cache design!

Some caching systems I've worked on in the past would have approached this differently, something like:

  • There's no "linking" of related results at all,
  • Cache is more or less just a very dumb 128 bit hash of inputs -> resulting image key-value store. Probably with timestamps on "when a value was last accessed", so it can remove oldest entries when it gets full.
  • Now, the key is how to compute the "hash of inputs" bit. And this one is crucial to get correct, in order for cache to work correctly. This would be like:
    • For image input, it's hash of filepath & frame index (for multi-image files), plus whatever else might affect "loading" the image.
    • For movie frame, it's hash of filepath & frame number, plus whatever else might affect "decoding" a movie frame.
    • For preprocessed image, it's hash of (input strip), plus hash of all preprocessing settings (transform, color correction, whatever).
    • For final image, it's combined hashes of all the input strips and their settings that might affect the result.

One big advantage of such a design, that the cache is "very dumb", so to speak. It does not know or care where the images come from, etc. A 128-bit good quality hash ensures that you won't run into hash collisions before the heat death of universe. And things like "same input -> same resulting image" fall out of that automatically, without taking up extra space in the cache (e.g. during a previs, an image strip placed and transformed over 10 frames is just one entry in the cache).

One big disadvantage, is that doing "hash of all relevant inputs" needs someone to actually hash all the relevant inputs, and not accidentally forget to hash some.

That's a curious cache design! Some caching systems I've worked on in the past would have approached this differently, something like: - There's no "linking" of related results at all, - Cache is more or less just a very dumb `128 bit hash of inputs -> resulting image` key-value store. Probably with timestamps on "when a value was last accessed", so it can remove oldest entries when it gets full. - Now, the key is how to compute the "hash of inputs" bit. And this one is crucial to get correct, in order for cache to work correctly. This would be like: - For image input, it's hash of filepath & frame index (for multi-image files), plus whatever else might affect "loading" the image. - For movie frame, it's hash of filepath & frame number, plus whatever else might affect "decoding" a movie frame. - For preprocessed image, it's hash of (input strip), plus hash of all preprocessing settings (transform, color correction, whatever). - For final image, it's combined hashes of all the input strips and their settings that might affect the result. One big advantage of such a design, that the cache is "very dumb", so to speak. It does not know or care where the images come from, etc. A 128-bit good quality hash ensures that you won't run into hash collisions before the heat death of universe. And things like "same input -> same resulting image" fall out of that automatically, without taking up extra space in the cache (e.g. during a previs, an image strip placed and transformed over 10 frames is just one entry in the cache). One big disadvantage, is that doing "hash of all relevant inputs" needs someone to actually hash all the relevant inputs, and not accidentally forget to hash some.
Author
Member

There is hashing in VSE cache, bit simpler though. The linking is really just to ensure everything is fully loaded. Sure this could be probably simplified at this point a bit. At the time the cache invalidation was lacking, so linking helped with that as well.

Even without linking, there would have to be system in place where cache can only report being full after final image is inserted. This is because if prefetch is done (all images are in cache after playhead), but there is 1B free, it would go on to render more and could accidentally free image right after the playhead.

My solutions could be over complicated(I really did not know how to write good code at that time. Still probably don't lol), but I remember solving all these issues.

...Also originally the cache freed oldest image, but that is rarely what you want. Currently it frees image furthest from playhead position. If prefetching, images on left side of playhead are prioritized.

There is hashing in VSE cache, bit simpler though. The linking is really just to ensure everything is fully loaded. Sure this could be probably simplified at this point a bit. At the time the cache invalidation was lacking, so linking helped with that as well. Even without linking, there would have to be system in place where cache can only report being full after final image is inserted. This is because if prefetch is done (all images are in cache after playhead), but there is 1B free, it would go on to render more and could accidentally free image right after the playhead. My solutions could be over complicated(I really did not know how to write good code at that time. Still probably don't lol), but I remember solving all these issues. ...Also originally the cache freed oldest image, but that is rarely what you want. Currently it frees image furthest from playhead position. If prefetching, images on left side of playhead are prioritized.
Sign in to join this conversation.
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#118155
No description provided.