Job can get stuck with less workers than soft-fail threshold #104190

Closed
opened 2023-02-28 11:56:08 +01:00 by Sybren A. Stüvel · 2 comments

Flamenco Version
Is Broken: 3.2
Worked OK: never

Short description of error
If a Worker fails a task, the task is marked as soft-failed. It only moves to failed status when more than 3 workers fail the same task. This means that if there are less than 3 workers in the farm, the job gets stuck.

Exact steps for others to reproduce the error

  • Set up Flamenco with one Worker
  • Submit a blend file that will fail (for example blender --python-expr 'raise SystemExit("fake failure")')
  • Start the job

Expected behaviour: after the failure, Flamenco detects that there are no workers left to retry the task, so the task immediately fails. Because the job cannot complete, the job fails as well.

Actual behaviour: after the failure the task sits indefinitely at soft-failed, and the job remains active.

Marked as high priority as on smaller render farms this can make any task failure a blocker of the entire job.

**Flamenco Version** Is Broken: 3.2 <!-- the Flamenco version you have this issue with. --> Worked OK: never <!-- the version that still worked ok, if this worked before. --> **Short description of error** If a Worker fails a task, the task is marked as `soft-failed`. It only moves to `failed` status when more than 3 workers fail the same task. This means that if there are less than 3 workers in the farm, the job gets stuck. **Exact steps for others to reproduce the error** <!-- Include steps to reproduce the issue, and make sure you describe both what is happening and what you expected to happen. --> - Set up Flamenco with one Worker - Submit a blend file that will fail (for example `blender --python-expr 'raise SystemExit("fake failure")'`) - Start the job **Expected behaviour:** after the failure, Flamenco detects that there are no workers left to retry the task, so the task immediately fails. Because the job cannot complete, the job fails as well. **Actual behaviour:** after the failure the task sits indefinitely at `soft-failed`, and the job remains `active`. Marked as high priority as on smaller render farms this can make *any* task failure a blocker of the entire job.
Sybren A. Stüvel added the
Priority
High
Status
Confirmed
Type
Bug
labels 2023-02-28 11:56:21 +01:00
Sybren A. Stüvel added the
Good First Issue
label 2023-02-28 12:44:30 +01:00
Contributor

If number of workers capable of running the failed task again is "1" that means we have no worker besides the one that actually failed the task. In such condition we should just fail the job itself.

If number of workers capable of running the failed task again is "1" that means we have no worker besides the one that actually failed the task. In such condition we should just fail the job itself.
Author
Owner

This was actually fixed by @Nitin-Rawat-1 , thanks again for the patch!

This was actually fixed by @Nitin-Rawat-1 , thanks again for the patch!
Sybren A. Stüvel added this to the v3.3 milestone 2023-04-24 13:50:52 +02:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: studio/flamenco#104190
No description provided.