Check for number of workers before soft failing the task. #104195

Merged
Sybren A. Stüvel merged 15 commits from Nitin-Rawat-1/flamenco:104190-job-stuck into main 2023-04-20 11:53:43 +02:00
Showing only changes of commit 9fdf5aa7c5 - Show all commits

View File

@ -188,6 +188,13 @@ func (f *Flamenco) onTaskFailed(
Int("threshold", threshold). Int("threshold", threshold).
Logger() Logger()
if numFailed < threshold { if numFailed < threshold {
Nitin-Rawat-1 marked this conversation as resolved Outdated

If you flip this condition, you can return early and reduce the nesting level of the following code.

if numFailed > threshold {
    return f.hardFailTask(ctx, logger, worker, task, numFailed)
}
numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task)
...
If you flip this condition, you can return early and reduce the nesting level of the following code. ```go if numFailed > threshold { return f.hardFailTask(ctx, logger, worker, task, numFailed) } numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task) ... ```
numWorkers, err := f.numWorkersCapableOfRunningTask(ctx, task)
if err != nil {
return err
}
if numWorkers == 1 {
Nitin-Rawat-1 marked this conversation as resolved Outdated

I'm assuming that 1 is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the 1 comes from ;-)

Also it's better to use numWorkers <= 1 here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.

I'm assuming that `1` is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the `1` comes from ;-) Also it's better to use `numWorkers <= 1` here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.
return f.failJobAfterCatastroficTaskFailure(ctx, logger, worker, task)
}
return f.softFailTask(ctx, logger, worker, task, numFailed) return f.softFailTask(ctx, logger, worker, task, numFailed)
} }
return f.hardFailTask(ctx, logger, worker, task, numFailed) return f.hardFailTask(ctx, logger, worker, task, numFailed)