Job can get stuck with less workers than soft-fail threshold #104190
Labels
No Label
Good First Issue
Priority
High
Priority
Low
Priority
Normal
Status
Archived
Status
Confirmed
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Job Type
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: studio/flamenco#104190
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Flamenco Version
Is Broken: 3.2
Worked OK: never
Short description of error
If a Worker fails a task, the task is marked as
soft-failed
. It only moves tofailed
status when more than 3 workers fail the same task. This means that if there are less than 3 workers in the farm, the job gets stuck.Exact steps for others to reproduce the error
blender --python-expr 'raise SystemExit("fake failure")'
)Expected behaviour: after the failure, Flamenco detects that there are no workers left to retry the task, so the task immediately fails. Because the job cannot complete, the job fails as well.
Actual behaviour: after the failure the task sits indefinitely at
soft-failed
, and the job remainsactive
.Marked as high priority as on smaller render farms this can make any task failure a blocker of the entire job.
If number of workers capable of running the failed task again is "1" that means we have no worker besides the one that actually failed the task. In such condition we should just fail the job itself.
This was actually fixed by @Nitin-Rawat-1 , thanks again for the patch!