VSE Improve video decoding delays #118155

New Issue

Open

opened 2024-02-12 20:00:02 +01:00 by Richard Antalik · 4 comments

Richard Antalik commented

2024-02-12 20:00:02 +01:00

Member

I have experimented a bit yesterday and created 2 builds to compare delays during cuts: #118112 and #118114. The results on their own look promissing, but the performance improvement is quite marginal.

Looking at the results, I have realized, that some memory could be reserved for AVCodecContext handles. About 10GB can cover about 100 strips. Now this can be 100 frames or whole timeline (more likely). VSE could ensure, that n strips on either side of playhead are fully loaded (in thread). While others do have AVCodecContext freed. This would ensure very good scrubbing performance near area where you work. Technically this would work similar to cache prefetching.

Technically AVCodecContext can be shared across multiple ImBufAnim structs. Or it would be possible to share ImBufAnim across multiple strips using the same file. this way even more memory could be saved allowing prefetching idea above cover even bigger timeline portion

I have experimented a bit yesterday and created 2 builds to compare delays during cuts: #118112 and #118114. The results on their own look promissing, but the performance improvement is quite marginal. Looking at the results, I have realized, that some memory could be reserved for `AVCodecContext` handles. About 10GB can cover about 100 strips. Now this can be 100 frames or whole timeline (more likely). VSE could ensure, that n strips on either side of playhead are fully loaded (in thread). While others do have `AVCodecContext` freed. This would ensure very good scrubbing performance near area where you work. Technically this would work similar to cache prefetching. Technically `AVCodecContext` can be shared across multiple `ImBufAnim` structs. Or it would be possible to share `ImBufAnim` across multiple strips using the same file. this way even more memory could be saved allowing prefetching idea above cover even bigger timeline portion

Richard Antalik added the

Video Sequencer

labels 2024-02-12 20:00:14 +01:00

Aras Pranckevicius commented

2024-02-13 06:10:33 +01:00

Member

Tangentially related, but my thinking was like below (most of it might be wrong/incorrect, since I've no idea what I'm talking about):

Primarily, you want two things out of sequencer:

"no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now.
"render performance". How long does it take to render the movie out.

Now, from the (limited) data sets I've seen, both of the above could be improved by using more CPU cores, i.e. doing more things in parallel. While most of the sequencer effects and image filtering are multi-threaded and try to use all the available CPU cores, "the rest" (reading image files, and possibly reading/writing movie frames) do not go anywhere near using all the CPU power.

So what I had in mind, was something similar to a "prefetch" system, but only for input images / movie frames. That would work something like this:

At any time, look at current play head, plus N (e.g. +10) frames ahead of it,
Take all the image/movie strips that are in these frames,
Fire off jobs that "preload" these into an image cache.

The jobs would be not processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel.

The rest of VSE playback/rendering pipeline would stay similar to how it is now: at any point, if an image/frame is needed, it is fetched from the cache (hopefully pre-filled by the jobs above already). If it's not, then load it synchronously.

This whole system would also be used while rendering animation, since today during render time the whole CPU usage is like 20-30% of the available CPU cores, since input image/movie reading is very serial.

Tangentially related, but my thinking was like below (most of it might be wrong/incorrect, since I've no idea what I'm talking about): Primarily, you want two things out of sequencer: 1) "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now. 2) "render performance". How long does it take to render the movie out. Now, from the (limited) data sets I've seen, both of the above could be improved by using more CPU cores, i.e. doing more things in parallel. While most of the sequencer effects and image filtering are multi-threaded and try to use all the available CPU cores, "the rest" (reading image files, and possibly reading/writing movie frames) do not go anywhere near using all the CPU power. So what I had in mind, was something similar to a "prefetch" system, but only for input images / movie frames. That would work something like this: - At any time, look at current play head, plus N (e.g. +10) frames ahead of it, - Take all the image/movie strips that are in these frames, - Fire off jobs that "preload" these into an image cache. The jobs would be _not_ processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel. The rest of VSE playback/rendering pipeline would stay similar to how it is now: at any point, if an image/frame is needed, it is fetched from the cache (hopefully pre-filled by the jobs above already). If it's not, then load it synchronously. This whole system would also be used while rendering animation, since today during render time the whole CPU usage is like 20-30% of the available CPU cores, since input image/movie reading is very serial.

Richard Antalik commented

2024-02-13 18:26:03 +01:00

Author

Member

Primarily, you want two things out of sequencer:

"no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now.

"render performance". How long does it take to render the movie out.

...3) good scrubbing performance :)

The jobs would be not processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel.

I mean I get your reasoning, and it is not impossible to do this. There are few things to consider:
To render image composites you need Scene which current prefetch copies. It shouldn't be large amount of data for VSE, but it can be. So this should be considered. Ultimately, VSE shouldn't have to need this, but it does now.

Your proposal would make sense to load "raw images". But ffmpeg for example can use threads to decode most used codecs. Perhaps this is not done perfectly, but still. For other strip types, I get the idea. So this would create 2 kinds of prefetching - decoding and compositing.

I will explain bit more in detail how current cache and prefetching loop works:

Cache must store images in orded as they are rendered. This is because it links the images as a means to see their dependency
So when any image is freed from cache, it follows the "children" and frees them. This is partly because speed effect strip may cause mayhem, but it could be broken already as well. Another reason is for cache limiting
When cache is full and another image needs to be stored, it does go over the limit until a final image for the frame is stored
When final image for the frame is stored, and cache limit is exceeded, it will free data until limit is honored
When any image is freed due to limit being applied, it does free all the "children" and "parents". When whole frame is not freed at once, you can end up with "holes" in the cache - random raw images would be freed, because composite image of strip above it exists. Then if you decide to tweak some modifier, you need raw image and you free "children", which is opposite state of what you have...
As far as prefetch is concerned, it does not know, when to stop. It stops, when cache reports, that it is full. There is a bit diferent logic to decide what images are freed from cache when prefetching is enabled, otherwise it wouldn't work.

Now because prefetch does not know when to stop and how cache limiting works, this multithreaded preloading would be impossible to implement with current design. So the question is, how much work it would take to change the design in order for it to work well.

Or you can cheat a bit - Say you implement function to tell you if cache is less than 3/4 full - then you can use "burst" mode to load images multithreaded. Also during playback you can free images from cache as soon as they are behind playback, so the cache is freed up faster if playback is "catching up" to prefetch. This would effectively implement what you propose with minimal changes to quite delicate caching mechanism.

This is indeed bit off-topic, but this would be IMO good design for prefetch V2, which could improve performance quite a bit.

> Primarily, you want two things out of sequencer: > 1) "no hitching" during playback (i.e. no dropped frames). This should be achievable for 24FPS target at around 1080p resolution, somewhat harder for 60FPS or 4K resolutions now. > 2) "render performance". How long does it take to render the movie out. ...3) good scrubbing performance :) > The jobs would be _not_ processed on a single thread, since that severely under-utilizes the CPU (e.g. decoding a single JPG image is typically using just one CPU core). Instead, one "job" is just "read an image file / decode a movie frame and put it into the cache". And you'd fire off multiple of those, all in parallel. I mean I get your reasoning, and it is not impossible to do this. There are few things to consider: To render image composites you need `Scene` which current prefetch copies. It shouldn't be large amount of data for VSE, but it can be. So this should be considered. Ultimately, VSE shouldn't have to need this, but it does now. Your proposal would make sense to load "raw images". But ffmpeg for example can use threads to decode most used codecs. Perhaps this is not done perfectly, but still. For other strip types, I get the idea. So this would create 2 kinds of prefetching - decoding and compositing. I will explain bit more in detail how current cache and prefetching loop works: - Cache must store images in orded as they are rendered. This is because it links the images as a means to see their dependency - So when any image is freed from cache, it follows the "children" and frees them. This is partly because speed effect strip may cause mayhem, but it could be broken already as well. Another reason is for cache limiting - When cache is full and another image needs to be stored, it does go over the limit until a final image for the frame is stored - When final image for the frame is stored, and cache limit is exceeded, it will free data until limit is honored - When any image is freed due to limit being applied, it does free all the "children" and "parents". When whole frame is not freed at once, you can end up with "holes" in the cache - random raw images would be freed, because composite image of strip above it exists. Then if you decide to tweak some modifier, you need raw image and you free "children", which is opposite state of what you have... As far as prefetch is concerned, it does not know, when to stop. It stops, when cache reports, that it is full. There is a bit diferent logic to decide what images are freed from cache when prefetching is enabled, otherwise it wouldn't work. Now because prefetch does not know when to stop and how cache limiting works, this multithreaded preloading would be impossible to implement with current design. So the question is, how much work it would take to change the design in order for it to work well. Or you can cheat a bit - Say you implement function to tell you if cache is less than 3/4 full - then you can use "burst" mode to load images multithreaded. Also during playback you can free images from cache as soon as they are behind playback, so the cache is freed up faster if playback is "catching up" to prefetch. This would effectively implement what you propose with minimal changes to quite delicate caching mechanism. This is indeed bit off-topic, but this would be IMO good design for prefetch V2, which could improve performance quite a bit.

Aras Pranckevicius commented

2024-02-13 19:52:53 +01:00

Member

That's a curious cache design!

Some caching systems I've worked on in the past would have approached this differently, something like:

There's no "linking" of related results at all,
Cache is more or less just a very dumb 128 bit hash of inputs -> resulting image key-value store. Probably with timestamps on "when a value was last accessed", so it can remove oldest entries when it gets full.
Now, the key is how to compute the "hash of inputs" bit. And this one is crucial to get correct, in order for cache to work correctly. This would be like:
- For image input, it's hash of filepath & frame index (for multi-image files), plus whatever else might affect "loading" the image.
- For movie frame, it's hash of filepath & frame number, plus whatever else might affect "decoding" a movie frame.
- For preprocessed image, it's hash of (input strip), plus hash of all preprocessing settings (transform, color correction, whatever).
- For final image, it's combined hashes of all the input strips and their settings that might affect the result.

One big advantage of such a design, that the cache is "very dumb", so to speak. It does not know or care where the images come from, etc. A 128-bit good quality hash ensures that you won't run into hash collisions before the heat death of universe. And things like "same input -> same resulting image" fall out of that automatically, without taking up extra space in the cache (e.g. during a previs, an image strip placed and transformed over 10 frames is just one entry in the cache).

One big disadvantage, is that doing "hash of all relevant inputs" needs someone to actually hash all the relevant inputs, and not accidentally forget to hash some.

That's a curious cache design! Some caching systems I've worked on in the past would have approached this differently, something like: - There's no "linking" of related results at all, - Cache is more or less just a very dumb `128 bit hash of inputs -> resulting image` key-value store. Probably with timestamps on "when a value was last accessed", so it can remove oldest entries when it gets full. - Now, the key is how to compute the "hash of inputs" bit. And this one is crucial to get correct, in order for cache to work correctly. This would be like: - For image input, it's hash of filepath & frame index (for multi-image files), plus whatever else might affect "loading" the image. - For movie frame, it's hash of filepath & frame number, plus whatever else might affect "decoding" a movie frame. - For preprocessed image, it's hash of (input strip), plus hash of all preprocessing settings (transform, color correction, whatever). - For final image, it's combined hashes of all the input strips and their settings that might affect the result. One big advantage of such a design, that the cache is "very dumb", so to speak. It does not know or care where the images come from, etc. A 128-bit good quality hash ensures that you won't run into hash collisions before the heat death of universe. And things like "same input -> same resulting image" fall out of that automatically, without taking up extra space in the cache (e.g. during a previs, an image strip placed and transformed over 10 frames is just one entry in the cache). One big disadvantage, is that doing "hash of all relevant inputs" needs someone to actually hash all the relevant inputs, and not accidentally forget to hash some.

Richard Antalik commented

2024-02-13 21:16:10 +01:00

Author

Member

There is hashing in VSE cache, bit simpler though. The linking is really just to ensure everything is fully loaded. Sure this could be probably simplified at this point a bit. At the time the cache invalidation was lacking, so linking helped with that as well.

Even without linking, there would have to be system in place where cache can only report being full after final image is inserted. This is because if prefetch is done (all images are in cache after playhead), but there is 1B free, it would go on to render more and could accidentally free image right after the playhead.

My solutions could be over complicated(I really did not know how to write good code at that time. Still probably don't lol), but I remember solving all these issues.

...Also originally the cache freed oldest image, but that is rarely what you want. Currently it frees image furthest from playhead position. If prefetching, images on left side of playhead are prioritized.

There is hashing in VSE cache, bit simpler though. The linking is really just to ensure everything is fully loaded. Sure this could be probably simplified at this point a bit. At the time the cache invalidation was lacking, so linking helped with that as well. Even without linking, there would have to be system in place where cache can only report being full after final image is inserted. This is because if prefetch is done (all images are in cache after playhead), but there is 1B free, it would go on to render more and could accidentally free image right after the playhead. My solutions could be over complicated(I really did not know how to write good code at that time. Still probably don't lol), but I remember solving all these issues. ...Also originally the cache freed oldest image, but that is rarely what you want. Currently it frees image furthest from playhead position. If prefetching, images on left side of playhead are prioritized.

Sign in to join this conversation.

No Label

Animation & Rigging

Asset Browser Project Overview

Automated Testing

Blender Asset Bundle

Dependency Graph

Development Management

EEVEE & Viewport

Images & Movies

Motion Tracking

Nodes & Physics

Pipeline, Assets & IO

Platforms, Builds & Tests

Render & Cycles

Render Pipeline

Sculpt, Paint & Texture

Video Sequencer

Virtual Reality

Blender 2.8 Project

Milestone 1: Basic, Local Asset Browser

Good First Issue

Animation & Rigging

Development Management

EEVEE & Viewport

Nodes & Physics

Pipeline, Assets & IO

Platforms, Builds & Tests

Render & Cycles

Sculpt, Paint & Texture

Needs Info from Developers

Needs Information from User

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#118155

No description provided.