After an initial delay (to allow workers to come back online after the
Manager was down) the flamenco_workers collection is scaned for workers
that have `status="offline"` and haven't been seen in longer than
`worker_cleanup_max_age`. If that setting is zero, auto-removal is disabled.
When the task log file is uncompressed we only show the first X and last
Y kilobytes of logging. When the log file is compressed this isn't (easily)
possible, so then the file is sent as a GZipped attachment (so forcing the
browser to download it to disk instead of loading it all in memory).
When the user agent is WGet or Curl, always the entire log is served.
For this we keep track of which worker failed which task (in
`Task.FailedByWorkers`). The scheduler will not assign a worker with
tasks it failed before.
When there are no more workers left to run a task (either because of
blacklisting or because all workers have tried & failed this particular
task) the status will be 'failed', otherwise 'soft-failed'.
The `extraUpdates` parameter should now be the "outer" update dict, so
instead of passing
`M{"field": "value-to-set"}`
pass
`M{"$set": M{"field": "value-to-set"}`
This allows future code to pass things like `$unset` or `$addToSet`.
This makes it possible for a worker to disappear from the planet and
still have the task finished by another worker.
For this to work, the `active_task_timeout_interval` setting must be
bigger than the `active_worker_timeout_interval` setting.
When a Worker sends a task update with `status='failed'`, that status is
actually overridden by the Manager to `status='soft-failed'` if there is
a worker that is *not* blacklisted for that specific task type/job. This
happens until the soft-failing worker is actually blacklisted, in which
case it is assumed to be an issue with the worker. All the previously
soft-failed tasks are set to `'claimed-by-manager'` so that they can be
picked up by another worker.
We now test the actually queued statuses, rather than just the queue size.
This didn't uncover any errors, but is a good preparation for introducing
new functionality in the future.
After blacklisting, the tasks failed by the blacklisted worker are now
only requeued if there is still a worker left who can execute it (based
on worker's supported task types + blacklist).
When the Server asks for a log file that does not exist, just create a
log file that states it does not exist, and send that. This makes the
Server stop asking us for that file over and over again.
The server can pass us (job ID, task ID) tuples in the response of the
'task-update-batch' endpoint. These tuples are then used to find the
task's log file, compress it, and send it to the Flamenco Manager.
The queue of logfiles to send is maintained by the Server. This means
we'll repeatedly get the same (job ID, task ID) until we've actually
uploaded the logfile to the Server's satisfaction. As a result, we don't
persist the requested IDs, but rely on the server to pass us the list
again if need be.
Status changes can now be marked as 'lazy', in which case they are only
applied when the worker has finished its current task. This only required
changes to the 'may-I-run' endpoint; it now ignores lazy requests.
The browser would pop out a `<worker-row>` element from a table because it
ejects all non-`<tr>` elements there. Apparently this doesn't happen when
the template is in a `<script type='text/x-template'>` tag, so we can
simplify.
When a Worker notifies the Manager a task failed, the number of failed
tasks of this worker, on this job, of the same task type as the
currently failed task is counted. If this count is above a threshold,
the (worker ID, job ID, task type) tuple is added to the blacklist. This
prevents the worker from getting such tasks. Matching failed tasks are
re-queued so that they can be executed by another worker.
This requies a new setting `blacklist_threshold`, which indicates the
number of failed tasks at which the above behaviour is triggered. It
defaults to 3. This means that it's likely that we should also increase
the TASK_FAIL_JOB_PERCENTAGE constant in Flamenco Server so that it's
more lenient towards failure (as excessive failure will trigger
requeueing anyway).
Note that there is NO starvation detection. In other words, if a job has
certain tasks that were failed by all available workers (and thus all
workers are blacklisted for this job & task type) there is no detection
that this happened. As a result, the job will be stuck in 'active'
status without it ever having a chance of being finished.
It wouldn't handle implicit end times properly when computing the 'next
check' timestamp. They are now correctly interpreted as 'midnight the next
day'.