Handle Worker re-requesting tasks without finishing them #104319
Labels
No Label
Good First Issue
Priority
High
Priority
Low
Priority
Normal
Status
Archived
Status
Confirmed
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Job Type
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: studio/flamenco#104319
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In the (not-so-hypothetical) case, that a Worker runs out of memory and the OOM killer kicks in, killing the Flamenco Worker process itself, it won't actually report the failure as such to the Manager. These cases should be detected by the Manager, and handled as an actual failure. Probably best to block-list the Worker for that job.
Hi, I'm doing some research to work on this and currently this is what I have.
In workers.go under the
SignOn
method I can see which worker is signing in, I'm able to get their names, UUID, last seen time, and more.In worker_task_updates.go under the
TaskUpdate
method I'm able to programmaticaly blocklist a worker by its UUID like this:Now, with that knowledge I'm confident that I can achieve the solution, I'd need to calculate by using the last seen date a frequency of sign ins.
My issue is that I would be able to tell if a worker has signed in in under a second for example, by taking the last seen date and checking with the present, but to get a frequency of 2 or more sign ins I'd need to store that data somewhere for posterior validation.
I'm not sure if there is a possibility of state memory or if this feature even deserves a place in the database, this is where things are getting fuzzy for me. I know that I can see and blocklist any worker but I can't yet design a robust frequency calculator logic.
I don't think this is the right place to do this. The problem is that the Worker gets killed, and so it does not send any task update at all. When it comes back to life, it just signs on and resumes whatever task it got assigned.
I think you're heading in the right direction, though. It's indeed something that'll have to be stored in the database. To determine what should be stored & how, let's look at the behaviour we want:
offline
, something is fishy. I think we can be more lenient when it wasasleep
orerror
, but when the last-known state wasawake
, increment a "this Worker is fishy" counter.I think such a counter would form a good basis for this feature. I'm not entirely sure if it is enough, or whether more complex tracking is necessary (or whether that more complex tracking would cause a hard-to-predict system).
active
:active
. To me this feels fragile, though, as some future update to the Worker could optimize this call away (only sending actual status changes to the Manager), and that would break our fault detection system.For more complex tracking, we could check, when the worker signs on, which task type of which job it was working on before. The counter could be split up into multiple counters, one per job UUID + task type combo. That'll make it more robust when there are multiple jobs to work on simultaneously, and the worker is alternating between them. It might be overkill, though.