Check for number of workers before soft failing the task. #104195
No reviewers
Labels
No Label
Good First Issue
Priority
High
Priority
Low
Priority
Normal
Status
Archived
Status
Confirmed
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Job Type
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: studio/flamenco#104195
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "Nitin-Rawat-1/flamenco:104190-job-stuck"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Manager: fixed issue #104190 job getting stuck with less workers than soft-failed threshold,
before soft-failing check the number of workers to decide if job should be failed or not.
Check for number of workers before soft failing the task.to WIP: Check for number of workers before soft failing the task.WIP: Check for number of workers before soft failing the task.to Check for number of workers before soft failing the task.Thanks for the patch!
@ -188,6 +188,13 @@ func (f *Flamenco) onTaskFailed(
Int("threshold", threshold).
Logger()
if numFailed < threshold {
If you flip this condition, you can return early and reduce the nesting level of the following code.
@ -191,0 +192,4 @@
if err != nil {
return err
}
if numWorkers == 1 {
I'm assuming that
1
is because this worker hasn't been registered as failing this task yet, and thus is still counted. If this is indeed the case, it should be mentioned in a comment, as it's not entirely obvious from glancing at the code. Or maybe I'm wrong, and then there should definitely be a comment that explains where the1
comes from ;-)Also it's better to use
numWorkers <= 1
here, as this should also hard-fail if there are zero workers left to run this task. Given that it's a very asynchronous system, there could be other factors that we just don't see right now here in this part of the code, that could lead to unexpected results.This is looking good, thanks! Excited to see this nasty thing getting fixed.
Could you also extend
worker_task_updates_test.go
with a test that covers this situation? The best way is to comment out your changes first, then write a test that actually fails. Re-enabling your changes should then make the test pass. That way you'll be more confident that the test is doing the right thing.@dr.sybren I have written the tests could you review it.
Looks good, thanks for the PR!