Check for number of workers before soft failing the task. #104195

Merged
Sybren A. Stüvel merged 15 commits from Nitin-Rawat-1/flamenco:104190-job-stuck into main 2023-04-20 11:53:43 +02:00
Contributor

Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold,
before soft-failing check the number of workers to decide if job should be failed or not.

Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold, before soft-failing check the number of workers to decide if job should be failed or not.
Nitin-Rawat-1 added 1 commit 2023-03-09 19:19:27 +01:00
before soft-failing check the number of workers to decide if job should be failed or not.
Nitin-Rawat-1 requested review from Sybren A. Stüvel 2023-03-09 19:19:50 +01:00
Nitin-Rawat-1 changed title from Check for number of workers before soft failing the task. to WIP: Check for number of workers before soft failing the task. 2023-03-09 19:58:23 +01:00
Nitin-Rawat-1 changed title from WIP: Check for number of workers before soft failing the task. to Check for number of workers before soft failing the task. 2023-03-09 20:04:23 +01:00
Sybren A. Stüvel requested changes 2023-03-13 11:31:29 +01:00
Sybren A. Stüvel left a comment
Owner

Thanks for the patch!

Thanks for the patch!
@ -188,6 +188,13 @@ func (f *Flamenco) onTaskFailed(
Int("threshold", threshold).
Logger()
if numFailed < threshold {

If you flip this condition, you can return early and reduce the nesting level of the following code.

if numFailed > threshold {
    return f.hardFailTask(ctx, logger, worker, task, numFailed)
}
numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task)
...
If you flip this condition, you can return early and reduce the nesting level of the following code. ```go if numFailed > threshold { return f.hardFailTask(ctx, logger, worker, task, numFailed) } numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task) ... ```
Nitin-Rawat-1 marked this conversation as resolved
@ -191,0 +192,4 @@
if err != nil {
return err
}
if numWorkers == 1 {

I'm assuming that 1 is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the 1 comes from ;-)

Also it's better to use numWorkers <= 1 here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.

I'm assuming that `1` is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the `1` comes from ;-) Also it's better to use `numWorkers <= 1` here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.
Nitin-Rawat-1 marked this conversation as resolved
Nitin-Rawat-1 added 1 commit 2023-03-17 10:43:33 +01:00
Nitin-Rawat-1 requested review from Sybren A. Stüvel 2023-03-17 11:11:31 +01:00

This is looking good, thanks! Excited to see this nasty thing getting fixed.

Could you also extend worker_task_updates_test.go with a test that covers this situation? The best way is to comment out your changes first, then write a test that actually fails. Re-enabling your changes should then make the test pass. That way you'll be more confident that the test is doing the right thing.

This is looking good, thanks! Excited to see this nasty thing getting fixed. Could you also extend `worker_task_updates_test.go` with a test that covers this situation? The best way is to comment out your changes first, then write a test that actually fails. Re-enabling your changes should then make the test pass. That way you'll be more confident that the test is doing the right thing.
Nitin-Rawat-1 added 4 commits 2023-04-04 06:31:59 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-04 08:19:54 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-04 09:24:48 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-04 14:29:54 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-04 16:02:36 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-10 07:51:26 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-10 07:52:20 +02:00
Author
Contributor

@dr.sybren I have written the tests could you review it.

@dr.sybren I have written the tests could you review it.
Nitin-Rawat-1 added 1 commit 2023-04-14 11:11:58 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-14 17:08:23 +02:00
Nitin-Rawat-1 added 1 commit 2023-04-17 17:35:15 +02:00
Sybren A. Stüvel approved these changes 2023-04-20 11:53:29 +02:00
Sybren A. Stüvel left a comment
Owner

Looks good, thanks for the PR!

Looks good, thanks for the PR!
Sybren A. Stüvel merged commit 752597b8e1 into main 2023-04-20 11:53:43 +02:00
Sign in to join this conversation.
No description provided.