Unit Tests - Docker/arm64 - FAIL projects.blender.org/studio/flamenco/internal/worker - context deadline exceeded #104290

Closed
opened 2024-03-07 15:58:57 +01:00 by MichaelC · 12 comments
Contributor

System Information
Operating System(s): Linux

Flamenco Version
Is Broken: Latest main branch arm64 build, when running in Docker
Worked OK: Latest main branch amd64 build, when running directly on the host

Short description of error

After unblocking other test failures, (#104288) this new failure is consistently appearing only on arm64 docker buildx builds fail 100% of the time. The amd64 buildx equivalent does not fail.

There is a huge amount of traceback info in the unit test log (see attached). I think it starts with a context deadline error for a Job status.

#16 49.55 --- FAIL: TestFetchJobsInStatus (0.61s)
#16 49.55     jobs_test.go:400: 
#16 49.55         	Error Trace:	/code/internal/manager/persistence/jobs_test.go:400
#16 49.55         	Error:      	Received unexpected error:
#16 49.55         	            	fetching jobs in status ["canceled" "under-construction"]: context deadline

Exact steps for others to reproduce the error

Using Docker buildx cross-platform build for arm64 on an amd64 host machine.
Example Docker buildx Dockerfile here

make with-deps
make test
**System Information** Operating System(s): Linux **Flamenco Version** Is Broken: Latest main branch arm64 build, when running in Docker Worked OK: Latest main branch amd64 build, when running directly on the host **Short description of error** After unblocking other test failures, (https://projects.blender.org/studio/flamenco/issues/104288) this new failure is consistently appearing only on arm64 docker buildx builds fail 100% of the time. The amd64 buildx equivalent does not fail. There is a huge amount of traceback info in the unit test log (see attached). I think it starts with a context deadline error for a Job status. ``` #16 49.55 --- FAIL: TestFetchJobsInStatus (0.61s) #16 49.55 jobs_test.go:400: #16 49.55 Error Trace: /code/internal/manager/persistence/jobs_test.go:400 #16 49.55 Error: Received unexpected error: #16 49.55 fetching jobs in status ["canceled" "under-construction"]: context deadline ``` **Exact steps for others to reproduce the error** Using Docker buildx cross-platform build for arm64 on an amd64 host machine. Example Docker buildx Dockerfile [here](https://docs.dtmc.ca/section/flamenco-development/dev_container/example_dockerfile/) ``` make with-deps make test ```

This is likely caused by Docker based tests being slower than regularly running tests. That test uses a 500ms timeout, which in general should be fine. After all, it shouldn't take more than half a second to just fetch a few jobs in an in-memory database. If that were to happen in a production environment, I'd be worried.

Can you figure out what is causing this slowdown when running in Docker?

This is likely caused by Docker based tests being slower than regularly running tests. That test uses a 500ms timeout, which in general should be fine. After all, it shouldn't take more than half a second to just fetch a few jobs in an in-memory database. If that were to happen in a production environment, I'd be worried. Can you figure out what is causing this slowdown when running in Docker?
Sybren A. Stüvel added the
Status
Needs Information from User
label 2024-03-07 16:16:42 +01:00
Sybren A. Stüvel changed title from Unit Tests - arm64 - FAIL projects.blender.org/studio/flamenco/internal/worker - context deadline exceeded to Unit Tests - Docker/arm64 - FAIL projects.blender.org/studio/flamenco/internal/worker - context deadline exceeded 2024-03-07 16:17:14 +01:00
Author
Contributor

I agree with the sentiment that 500ms is more than enough time for the unit test db fetch. This timeout is likely due to slowness in buildx (which uses a QEMU emulator for arm64 on an amd64 host). Often reported online by many people trying to just build cross-platform where build times are significantly longer. Interestingly, the build time isn't any different for the flamenco app. But clearly, the emulated runtime environment probably is introducing this timeout scenario.

I will see what I can learn and optimise with the buildx environment.

I agree with the sentiment that 500ms is more than enough time for the unit test db fetch. This timeout is likely due to slowness in buildx (which uses a QEMU emulator for arm64 on an amd64 host). Often reported online by many people trying to just build cross-platform where build times are significantly longer. Interestingly, the build time isn't any different for the flamenco app. But clearly, the emulated runtime environment probably is introducing this timeout scenario. I will see what I can learn and optimise with the buildx environment.

If you can't get the execution speed fast enough, we can of course increase the timeout. But better to see if this can be solved at the root first.

If you can't get the execution speed fast enough, we can of course increase the timeout. But better to see if this can be solved at the root first.

@michael-2 Did you manage to speed up the build? Is this still an issue?

@michael-2 Did you manage to speed up the build? Is this still an issue?
Author
Contributor

Ive just got back from vacation but some reading around and qemu is just slower. Its not so much about docker being slow.

Its a problem for me but not a blocker. Insofar as arm nightlies consistently fail tests. But I can build the executables without running tests during build.

So probably not worth changing the test timeouts for this scenario.

Ive just got back from vacation but some reading around and qemu is just slower. Its not so much about docker being slow. Its a problem for me but not a blocker. Insofar as arm nightlies consistently fail tests. But I can build the executables without running tests during build. So probably not worth changing the test timeouts for this scenario.

If you can find a reliable way to detect that the tests are running within an emulator, we could use that ;-)

If you can find a reliable way to detect that the tests are running within an emulator, we could use that ;-)
Author
Contributor

I think the most reliable way would be to pass in the hosts architecture to the multi-arch Dockerfile as an ENV or ARG and then update the flamenco makefile to compare host architecture value against the Docker Buildx when the flamenco make runs. This will be the QEMU arm64 "host" environment when using Docker Buildx on an amd64 host in my loadbuild setup.

If the Docker host and the make host are different, then we know make is running in QEMU and likely to be slower.

What's your thoughts on how you'd like to remedy this?

I foresee a few possibilities...
a) ignore time-based tests for emulated builds
b) have some longer timeouts on the tests for emulated builds
c) only fail if not running in an emulated build environment
d) ignore this as its related to loadbuild environment and assume unit tests should not time out (and assume they are run natively)

I think the most reliable way would be to pass in the hosts architecture to the multi-arch Dockerfile as an ENV or ARG and then update the flamenco makefile to compare host architecture value against the Docker Buildx when the flamenco make runs. This will be the QEMU arm64 "host" environment when using Docker Buildx on an amd64 host in my loadbuild setup. If the Docker host and the make host are different, then we know make is running in QEMU and likely to be slower. What's your thoughts on how you'd like to remedy this? I foresee a few possibilities... a) ignore time-based tests for emulated builds b) have some longer timeouts on the tests for emulated builds c) only fail if not running in an emulated build environment d) ignore this as its related to loadbuild environment and assume unit tests should not time out (and assume they are run natively)
Author
Contributor

I also reviewed the past couple of weeks of nightly builds and noticed this arm64 error now is consistently happening...

#16 34.86 --- FAIL: TestAddWorkerToJobBlocklist (1.34s)
#16 34.86     jobs_test.go:765: 
#16 34.86         	Error Trace:	/code/internal/manager/persistence/jobs_test.go:765
#16 34.86         	            				/code/internal/manager/persistence/jobs_test.go:815
#16 34.86         	            				/code/internal/manager/persistence/jobs_blocklist_test.go:13
#16 34.86         	Error:      	Received unexpected error:
#16 34.86         	            	storing job: interrupted (9)
#16 34.86         	Test:       	TestAddWorkerToJobBlocklist
#16 34.86         	Messages:   	error storing authored job in DB

See attached logs.

I also reviewed the past couple of weeks of nightly builds and noticed this arm64 error now is consistently happening... ``` #16 34.86 --- FAIL: TestAddWorkerToJobBlocklist (1.34s) #16 34.86 jobs_test.go:765: #16 34.86 Error Trace: /code/internal/manager/persistence/jobs_test.go:765 #16 34.86 /code/internal/manager/persistence/jobs_test.go:815 #16 34.86 /code/internal/manager/persistence/jobs_blocklist_test.go:13 #16 34.86 Error: Received unexpected error: #16 34.86 storing job: interrupted (9) #16 34.86 Test: TestAddWorkerToJobBlocklist #16 34.86 Messages: error storing authored job in DB ``` See attached logs.

If the Docker host and the make host are different, then we know make is running in QEMU and likely to be slower.

I think this decision should be made independent on whether Docker is used or not. QEmu can also be run in different ways, without Docker.

I foresee a few possibilities...
a) ignore time-based tests for emulated builds
b) have some longer timeouts on the tests for emulated builds
c) only fail if not running in an emulated build environment
d) ignore this as its related to loadbuild environment and assume unit tests should not time out (and assume they are run natively)

I'd prefer option b) here. Running tests but ignoring failure seems rather useless to me.

> If the Docker host and the make host are different, then we know make is running in QEMU and likely to be slower. I think this decision should be made independent on whether Docker is used or not. QEmu can also be run in different ways, without Docker. > I foresee a few possibilities... > a) ignore time-based tests for emulated builds > b) have some longer timeouts on the tests for emulated builds > c) only fail if not running in an emulated build environment > d) ignore this as its related to loadbuild environment and assume unit tests should not time out (and assume they are run natively) I'd prefer option `b)` here. Running tests but ignoring failure seems rather useless to me.
Author
Contributor

Thanks for the input... I wasn't sure how accommodating in the unit tests you wanted to be for the loadbuild/runtime environments.

I quickly looked at the code base and it was not clear to me where these timeouts are defined. They dont appear to be in the tests themselves, which just examine Error or No Error. If you have time, perhaps you can help direct me to them.

For (b) - I was thinking as a first-step it would be prudent to no error on these timeouts without first letting the underlying calls complete, then check/print out the actual time vs the permissible time. Then we would be able to document/configure "acceptable" timeouts based on runtime environment. This is a bit more work but would allow easier tuning in the future, than say a hard coded list of platform timeouts. Perhaps in front of these application tests are some platform tests to attenuate the acceptable timeouts with simple multipliers for slower platforms?

This of course may be a bit of wild goose chase as probably QEMU on a modern PC running these tests may fall under the acceptable timeouts... but in the light of Docker, k8s, slower PCs (like mine) and possible even IoT platforms like RaspPi it may occur for some people and not others. So thats why I'm thinking about automating identifying these thresholds.

Is this over-thinking it or something worth heading towards given Flamenco is best-effort multi-platform application?

Thanks for the input... I wasn't sure how accommodating in the unit tests you wanted to be for the loadbuild/runtime environments. I quickly looked at the code base and it was not clear to me where these timeouts are defined. They dont appear to be in the tests themselves, which just examine Error or No Error. If you have time, perhaps you can help direct me to them. For (b) - I was thinking as a first-step it would be prudent to no error on these timeouts without first letting the underlying calls complete, then check/print out the actual time vs the permissible time. Then we would be able to document/configure "acceptable" timeouts based on runtime environment. This is a bit more work but would allow easier tuning in the future, than say a hard coded list of platform timeouts. Perhaps in front of these application tests are some platform tests to attenuate the acceptable timeouts with simple multipliers for slower platforms? This of course may be a bit of wild goose chase as probably QEMU on a modern PC running these tests may fall under the acceptable timeouts... but in the light of Docker, k8s, slower PCs (like mine) and possible even IoT platforms like RaspPi it may occur for some people and not others. So thats why I'm thinking about automating identifying these thresholds. Is this over-thinking it or something worth heading towards given Flamenco is best-effort multi-platform application?

I quickly looked at the code base and it was not clear to me where these timeouts are defined. They dont appear to be in the tests themselves, which just examine Error or No Error. If you have time, perhaps you can help direct me to them.

These are set as timeouts on the context variables. For example, job_compilers_test.go has:

	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)

The Manager's persistence tests use similar timeouts, but then as parameter to the persistenceTestFixtures() function, which sets up things in a centralised way. Other tests have similar approaches.

For (b) - I was thinking as a first-step it would be prudent to no error on these timeouts without first letting the underlying calls complete, then check/print out the actual time vs the permissible time.

Not sure if that's a good idea, as then we'd have to add those timeout checks explicitly everywhere. Right now they're simply handled by Go's default approach for limiting how long a certain process can take. This means that it also simply will break off hanging I/O calls.

Is this over-thinking it or something worth heading towards given Flamenco is best-effort multi-platform application?

I think you're over-thinking things. On the tests that fail, keep doubling the timeout until it doesn't fail any more, then if you want to be extra precise, start shaving off like 10% at a time until it starts failing again. There you'll have the actual timeout that you need for that test to be succesful.

Since you haven't even mentioned yet what kind of timeout would work for you, I think any way of automatically trying to figure out what timeout would work is overkill. If we need to increase that one test from 500ms to 600ms to have it run reliably, then let's just do that.

> I quickly looked at the code base and it was not clear to me where these timeouts are defined. They dont appear to be in the tests themselves, which just examine Error or No Error. If you have time, perhaps you can help direct me to them. These are set as timeouts on the context variables. For example, `job_compilers_test.go` has: ```go ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond) ``` The Manager's persistence tests use similar timeouts, but then as parameter to the `persistenceTestFixtures()` function, which sets up things in a centralised way. Other tests have similar approaches. > For (b) - I was thinking as a first-step it would be prudent to no error on these timeouts without first letting the underlying calls complete, then check/print out the actual time vs the permissible time. Not sure if that's a good idea, as then we'd have to add those timeout checks explicitly everywhere. Right now they're simply handled by Go's default approach for limiting how long a certain process can take. This means that it also simply will break off hanging I/O calls. > Is this over-thinking it or something worth heading towards given Flamenco is best-effort multi-platform application? I think you're over-thinking things. On the tests that fail, keep doubling the timeout until it doesn't fail any more, then if you want to be extra precise, start shaving off like 10% at a time until it starts failing again. There you'll have the actual timeout that you need for that test to be succesful. Since you haven't even mentioned yet what kind of timeout would work for you, I think any way of automatically trying to figure out what timeout would work is overkill. If we need to increase that one test from 500ms to 600ms to have it run reliably, then let's just do that.

Closing due to silence by the original poster.

Closing due to silence by the original poster.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: studio/flamenco#104290
No description provided.