Handle Worker re-requesting tasks without finishing them #104319

New Issue

Sybren A. Stüvel · 2024-07-01T10:33:44+02:00

Sybren A. Stüvel commented

2024-07-01 10:33:44 +02:00

In the (not-so-hypothetical) case, that a Worker runs out of memory and the OOM killer kicks in, killing the Flamenco Worker process itself, it won't actually report the failure as such to the Manager. These cases should be detected by the Manager, and handled as an actual failure. Probably best to block-list the Worker for that job.

Sybren A. Stüvel added the

Type

Design

label 2024-07-01 10:33:44 +02:00

Mateus Abelli commented

2024-07-10 01:32:39 +02:00

Hi, I'm doing some research to work on this and currently this is what I have.

In workers.go under the SignOn method I can see which worker is signing in, I'm able to get their names, UUID, last seen time, and more.

In worker_task_updates.go under the TaskUpdate method I'm able to programmaticaly blocklist a worker by its UUID like this:

if worker.UUID == "03ec316b-849b-4404-b33c-4b136b14fc57" {
	f.blocklistWorker(bgCtx, logger, worker, dbTask)
}

Now, with that knowledge I'm confident that I can achieve the solution, I'd need to calculate by using the last seen date a frequency of sign ins.

My issue is that I would be able to tell if a worker has signed in in under a second for example, by taking the last seen date and checking with the present, but to get a frequency of 2 or more sign ins I'd need to store that data somewhere for posterior validation.

I'm not sure if there is a possibility of state memory or if this feature even deserves a place in the database, this is where things are getting fuzzy for me. I know that I can see and blocklist any worker but I can't yet design a robust frequency calculator logic.

Hi, I'm doing some research to work on this and currently this is what I have. In [workers.go](https://projects.blender.org/studio/flamenco/src/branch/main/internal/manager/api_impl/workers.go) under the `SignOn` method I can see which worker is signing in, I'm able to get their names, UUID, last seen time, and more. In [worker_task_updates.go](https://projects.blender.org/studio/flamenco/src/branch/main/internal/manager/api_impl/worker_task_updates.go) under the `TaskUpdate` method I'm able to programmaticaly blocklist a worker by its UUID like this: ``` if worker.UUID == "03ec316b-849b-4404-b33c-4b136b14fc57" { f.blocklistWorker(bgCtx, logger, worker, dbTask) } ``` Now, with that knowledge I'm confident that I can achieve the solution, I'd need to calculate by using the last seen date a frequency of sign ins. My issue is that I would be able to tell if a worker has signed in in under a second for example, by taking the last seen date and checking with the present, but to get a frequency of 2 or more sign ins I'd need to store that data somewhere for posterior validation. I'm not sure if there is a possibility of state memory or if this feature even deserves a place in the database, this is where things are getting fuzzy for me. I know that I can see and blocklist any worker but I can't yet design a robust frequency calculator logic.

Sybren A. Stüvel commented

2024-07-11 11:06:36 +02:00

In worker_task_updates.go under the TaskUpdate method I'm able to programmaticaly blocklist a worker by its UUID like this:

I don't think this is the right place to do this. The problem is that the Worker gets killed, and so it does not send any task update at all. When it comes back to life, it just signs on and resumes whatever task it got assigned.

I think you're heading in the right direction, though. It's indeed something that'll have to be stored in the database. To determine what should be stored & how, let's look at the behaviour we want:

On worker sign-on:: Check what the last-known status of the Worker was. If this was anything but offline, something is fishy. I think we can be more lenient when it was asleep or error, but when the last-known state was awake, increment a "this Worker is fishy" counter.
On worker sign-off:: Reset the counter to zero, as we've seen the Worker behave normally.

I think such a counter would form a good basis for this feature. I'm not entirely sure if it is enough, or whether more complex tracking is necessary (or whether that more complex tracking would cause a hard-to-predict system).

When the worker requests a task:: There is some logic in the task scheduler to see if the Worker already had an active task assigned. If so, it just gets that task again. This function could be expanded, so that the caller knows whether the task was already assigned, or whether it just got assigned to this worker for the first time. The caller can then determine what to do with this info, like checking the counter to see if we're in the to-be-blocklisted situation.
When the worker sets the task status to active:: This could be an alternative point to implement the above behaviour. Here the old & new task statuses could be compared, and we could see that the task was already active. To me this feels fragile, though, as some future update to the Worker could optimize this call away (only sending actual status changes to the Manager), and that would break our fault detection system.

For more complex tracking, we could check, when the worker signs on, which task type of which job it was working on before. The counter could be split up into multiple counters, one per job UUID + task type combo. That'll make it more robust when there are multiple jobs to work on simultaneously, and the worker is alternating between them. It might be overkill, though.

> In [worker_task_updates.go](https://projects.blender.org/studio/flamenco/src/branch/main/internal/manager/api_impl/worker_task_updates.go) under the `TaskUpdate` method I'm able to programmaticaly blocklist a worker by its UUID like this: I don't think this is the right place to do this. The problem is that the Worker gets killed, and so it does _not_ send any task update at all. When it comes back to life, it just signs on and resumes whatever task it got assigned. I think you're heading in the right direction, though. It's indeed something that'll have to be stored in the database. To determine what should be stored & how, let's look at the behaviour we want: On worker sign-on: : Check what the last-known status of the Worker was. If this was anything but `offline`, something is fishy. I think we can be more lenient when it was `asleep` or `error`, but when the last-known state was `awake`, increment a "this Worker is fishy" counter. On worker sign-off: : Reset the counter to zero, as we've seen the Worker behave normally. I think such a counter would form a good basis for this feature. I'm not entirely sure if it is enough, or whether more complex tracking is necessary (or whether that more complex tracking would cause a hard-to-predict system). When the worker requests a task: : There is some logic in the task scheduler to see if the Worker already had an active task assigned. If so, it just gets that task again. This function could be expanded, so that the caller knows whether the task was already assigned, or whether it just got assigned to this worker for the first time. The caller can then determine what to do with this info, like checking the counter to see if we're in the to-be-blocklisted situation. When the worker sets the task status to `active`: : This could be an alternative point to implement the above behaviour. Here the old & new task statuses could be compared, and we could see that the task was already `active`. To me this feels fragile, though, as some future update to the Worker could optimize this call away (only sending actual status changes to the Manager), and that would break our fault detection system. For more complex tracking, we could check, when the worker signs on, which task type of which job it was working on before. The counter could be split up into multiple counters, one per job UUID + task type combo. That'll make it more robust when there are multiple jobs to work on simultaneously, and the worker is alternating between them. It might be overkill, though.

Sign in to join this conversation.