Check for number of workers before soft failing the task. #104195

Nitin-Rawat-1 · 2023-03-09T19:19:25+01:00

Nitin-Rawat-1 commented

2023-03-09 19:19:25 +01:00

Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold,
before soft-failing check the number of workers to decide if job should be failed or not.

Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold, before soft-failing check the number of workers to decide if job should be failed or not.

Nitin-Rawat-1 added 1 commit 2023-03-09 19:19:27 +01:00

Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold 9fdf5aa7c5

before soft-failing check the number of workers to decide if job should be failed or not.

Nitin-Rawat-1 requested review from Sybren A. Stüvel 2023-03-09 19:19:50 +01:00

Nitin-Rawat-1 changed title from ~~Check for number of workers before soft failing the task.~~ to WIP: Check for number of workers before soft failing the task.

2023-03-09 19:58:23 +01:00

Nitin-Rawat-1 changed title from ~~WIP: Check for number of workers before soft failing the task.~~ to Check for number of workers before soft failing the task.

2023-03-09 20:04:23 +01:00

Sybren A. Stüvel requested changes 2023-03-13 11:31:29 +01:00

Sybren A. Stüvel left a comment

Thanks for the patch!

internal/manager/api_impl/worker_task_updates.go Outdated

						
				@ -188,6 +188,13 @@ func (f *Flamenco) onTaskFailed(

						Int("threshold", threshold).

						Logger()

					if numFailed < threshold {

Sybren A. Stüvel commented

2023-03-13 11:27:53 +01:00

If you flip this condition, you can return early and reduce the nesting level of the following code.

if numFailed > threshold {
    return f.hardFailTask(ctx, logger, worker, task, numFailed)
}
numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task)
...

If you flip this condition, you can return early and reduce the nesting level of the following code. ```go if numFailed > threshold { return f.hardFailTask(ctx, logger, worker, task, numFailed) } numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task) ... ```

Nitin-Rawat-1 marked this conversation as resolved

internal/manager/api_impl/worker_task_updates.go Outdated

						
				@ -191,0 +192,4 @@

						if err != nil {

							return err

						}

						if numWorkers == 1 {

Sybren A. Stüvel commented

2023-03-13 11:29:08 +01:00

I'm assuming that 1 is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the 1 comes from ;-)

Also it's better to use numWorkers <= 1 here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.

I'm assuming that `1` is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the `1` comes from ;-) Also it's better to use `numWorkers <= 1` here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.

Nitin-Rawat-1 marked this conversation as resolved

Nitin-Rawat-1 added 1 commit 2023-03-17 10:43:33 +01:00

reviese the conditions for job failure 6e24e0be3b

Nitin-Rawat-1 requested review from Sybren A. Stüvel 2023-03-17 11:11:31 +01:00

Sybren A. Stüvel commented

2023-03-20 12:26:55 +01:00

This is looking good, thanks! Excited to see this nasty thing getting fixed.

Could you also extend worker_task_updates_test.go with a test that covers this situation? The best way is to comment out your changes first, then write a test that actually fails. Re-enabling your changes should then make the test pass. That way you'll be more confident that the test is doing the right thing.

This is looking good, thanks! Excited to see this nasty thing getting fixed. Could you also extend `worker_task_updates_test.go` with a test that covers this situation? The best way is to comment out your changes first, then write a test that actually fails. Re-enabling your changes should then make the test pass. That way you'll be more confident that the test is doing the right thing.

👍 1

Nitin-Rawat-1 added 4 commits 2023-04-04 06:31:59 +02:00

Merge remote-tracking branch 'upstream/main' into 104190-job-stuck 65c6be1fe2

Merge remote-tracking branch 'upstream/main' into 104190-job-stuck 5ceafb1a9f

Merge remote-tracking branch 'upstream/main' into 104190-job-stuck ad96e3bb25

We should also hard fail the task when numFailed == threshold ac88d57ede

Nitin-Rawat-1 added 1 commit 2023-04-04 08:19:54 +02:00

Tests for TaskUpdate needs to be updated. 03533c1e49

Nitin-Rawat-1 added 1 commit 2023-04-04 09:24:48 +02:00

Add test to check the job failure condition when number of workers available for the job is less than failure threshold. ff0a36d19d

Nitin-Rawat-1 added 1 commit 2023-04-04 14:29:54 +02:00

Merge remote-tracking branch 'upstream/main' into 104190-job-stuck 5c101c47fb