Commit Graph

1958 Commits

Author SHA1 Message Date
Sybren A. Stüvel 4fe11d99f6 Configurable name in dashboard
Now the title & version is also dynamically updated with Vue.
2019-02-22 16:12:41 +01:00
Sybren A. Stüvel 84b9eb2b09 Bumped version to 2.4-dev5 2019-02-21 17:42:58 +01:00
Sybren A. Stüvel 141addc371 build-via-docker.sh: made bundle-creation conditional based on $TARGET
This makes it possible to uncomment the if-target-specified-then-don't-bundle
condition and bundle for the given target.
2019-02-21 17:41:10 +01:00
Sybren A. Stüvel 72761a60fa Handle task timing metrics from the Worker
They are simply stored & forwarded to the Server, no processing is done.
2019-02-21 17:19:20 +01:00
Sybren A. Stüvel 1143a5b957 Bumped version to 2.4-dev4 2019-02-21 13:45:39 +01:00
Sybren A. Stüvel eb59c020dd Worker cleanup: Requeue active tasks before deleting worker 2019-02-21 13:45:22 +01:00
Sybren A. Stüvel c52c65d2b4 Worker cleanup: configurable set of statuses to auto-remove
This allows us to configure the Manager to also auto-delete timed-out
workers.
2019-02-21 12:11:53 +01:00
Sybren A. Stüvel 7bbbe3c0c5 Automatically delete offline workers
After an initial delay (to allow workers to come back online after the
Manager was down) the flamenco_workers collection is scaned for workers
that have `status="offline"` and haven't been seen in longer than
`worker_cleanup_max_age`. If that setting is zero, auto-removal is disabled.
2019-02-21 12:11:53 +01:00
Sybren A. Stüvel 7db66cb69d Log server: human-readable sizes in 'Skipped ... bytes' message 2019-02-20 09:47:50 +01:00
Sybren A. Stüvel ea917be44c When serving log file, conditionally only show head + tail of the log
When the task log file is uncompressed we only show the first X and last
Y kilobytes of logging. When the log file is compressed this isn't (easily)
possible, so then the file is sent as a GZipped attachment (so forcing the
browser to download it to disk instead of loading it all in memory).

When the user agent is WGet or Curl, always the entire log is served.
2019-02-19 18:44:19 +01:00
Sybren A. Stüvel c0de578817 Bumped version to 2.4-dev3 2019-02-19 17:03:42 +01:00
Sybren A. Stüvel 3478225239 Updated changelog 2019-02-19 16:34:15 +01:00
Sybren A. Stüvel 5f9bc2e2d4 Limit number of workers that can retry a task after it failed
This defaults to 3 workers, e.g. after three different workers have run
the task and failed, it will *not* be soft-failed, but really stay failed.
2019-02-19 16:14:53 +01:00
Sybren A. Stüvel e47780a633 Allow soft-failed tasks to be run by other workers
For this we keep track of which worker failed which task (in
`Task.FailedByWorkers`). The scheduler will not assign a worker with
tasks it failed before.

When there are no more workers left to run a task (either because of
blacklisting or because all workers have tried & failed this particular
task) the status will be 'failed', otherwise 'soft-failed'.
2019-02-19 15:38:42 +01:00
Sybren A. Stüvel 8f827a2eb2 TaskUpdateQueue::QueueTaskUpdateWithExtra now expects outer update dict
The `extraUpdates` parameter should now be the "outer" update dict, so
instead of passing
    `M{"field": "value-to-set"}`
pass
    `M{"$set": M{"field": "value-to-set"}`

This allows future code to pass things like `$unset` or `$addToSet`.
2019-02-19 15:38:36 +01:00
Sybren A. Stüvel 9639e1cb47 Typo fix 2019-02-19 15:38:36 +01:00
Sybren A. Stüvel 2dc054b23f When a worker times out, its active task is now re-queued
This makes it possible for a worker to disappear from the planet and
still have the task finished by another worker.

For this to work, the `active_task_timeout_interval` setting must be
bigger than the `active_worker_timeout_interval` setting.
2019-02-15 17:26:10 +01:00
Sybren A. Stüvel 5a1b95f097 Bumped version to 2.4-dev2 2019-02-14 15:13:11 +01:00
Sybren A. Stüvel 338218f02a Updated CHANGELOG.md 2019-02-14 15:13:06 +01:00
Sybren A. Stüvel e9c67553a3 Soft-fail tasks when there are workers left to retry it
When a Worker sends a task update with `status='failed'`, that status is
actually overridden by the Manager to `status='soft-failed'` if there is
a worker that is *not* blacklisted for that specific task type/job. This
happens until the soft-failing worker is actually blacklisted, in which
case it is assumed to be an issue with the worker. All the previously
soft-failed tasks are set to `'claimed-by-manager'` so that they can be
picked up by another worker.
2019-02-14 14:32:00 +01:00
Sybren A. Stüvel d62e23d6f8 More detailed testing of task updates when blacklisting
We now test the actually queued statuses, rather than just the queue size.
This didn't uncover any errors, but is a good preparation for introducing
new functionality in the future.
2019-02-14 11:52:52 +01:00
Sybren A. Stüvel 46dd7659d4 Bumped version to 2.4-dev1 2019-02-12 15:06:25 +01:00
Sybren A. Stüvel 72c46706ea Fix T59491: Manager should detect starvation due to blacklisting
After blacklisting, the tasks failed by the blacklisted worker are now
only requeued if there is still a worker left who can execute it (based
on worker's supported task types + blacklist).
2019-02-12 15:05:23 +01:00
Sybren A. Stüvel 0f9fb203b4 Send "this log file does not exist" as log file when it doesn't exist.
When the Server asks for a log file that does not exist, just create a
log file that states it does not exist, and send that. This makes the
Server stop asking us for that file over and over again.
2019-01-11 18:20:24 +01:00
Sybren A. Stüvel cfb5cc825d Added missing return statement 2019-01-11 17:48:45 +01:00
Sybren A. Stüvel 84f1718a4b Bumped version to 2.4-dev0 2019-01-11 11:06:18 +01:00
Sybren A. Stüvel fa2e914245 Updated example config with more concrete variables
Especially the Blender location on macOS is now more realistic.
2019-01-11 10:42:44 +01:00
Sybren A. Stüvel 916333dc25 Bumped version to 2.3 v2.3 2019-01-10 11:54:19 +01:00
Sybren A. Stüvel 5735bbee2e Upload task log files when requested from the Flamenco Server
The server can pass us (job ID, task ID) tuples in the response of the
'task-update-batch' endpoint. These tuples are then used to find the
task's log file, compress it, and send it to the Flamenco Manager.

The queue of logfiles to send is maintained by the Server. This means
we'll repeatedly get the same (job ID, task ID) until we've actually
uploaded the logfile to the Server's satisfaction. As a result, we don't
persist the requested IDs, but rely on the server to pass us the list
again if need be.
2019-01-09 17:00:00 +01:00
Sybren A. Stüvel 65c74bc303 Less strict timeout checks
This makes the unit test less likely to fail while the computer is already
doing other stuff.
2019-01-09 17:00:00 +01:00
Sybren A. Stüvel 135f195d9c Don't pass pointer to array
Arrays are by-reference structures already, so no need to use pointers.
2019-01-09 14:20:01 +01:00
Sybren A. Stüvel ea368d5c9b Dashboard: added checkbox to (de)select all workers 2018-12-18 15:53:24 +01:00
Sybren A. Stüvel e85a902fb7 Include ffmpeg variable in default settings 2018-12-18 14:35:16 +01:00
Sybren A. Stüvel 9a84e8cb7d Dashboard: Fixed tiling issue on latest-image viewer 2018-12-18 14:17:39 +01:00
Sybren A. Stüvel a2377990be Dashboard: Hide blacklist header when the blacklist is empty 2018-12-18 12:20:48 +01:00
Sybren A. Stüvel 9feccdfe4a Dashboard: reduced number of columns in worker table
The 'blacklist' toggle now toggles 'details' instead, which consists of the
blacklist and worker details (currently ID and Address).
2018-12-18 12:17:47 +01:00
Sybren A. Stüvel cc827807bb Bumped version to 2.3-dev2 2018-12-18 10:53:50 +01:00
Sybren A. Stüvel c5d6b6c6c2 Update changelog 2018-12-18 10:53:41 +01:00
Sybren A. Stüvel cfe561c79e Fix T58779: allow lazy status change requests
Status changes can now be marked as 'lazy', in which case they are only
applied when the worker has finished its current task. This only required
changes to the 'may-I-run' endpoint; it now ignores lazy requests.
2018-12-18 10:51:55 +01:00
Sybren A. Stüvel e64ffe098d Compatibility with older MongoDB 2018-12-17 17:18:05 +01:00
Sybren A. Stüvel 0742684326 Update changelog 2018-12-17 17:11:15 +01:00
Sybren A. Stüvel f15e445baa Dashboard: make worker blacklist visible 2018-12-17 17:08:17 +01:00
Sybren A. Stüvel 43050af48b Vue.js: no need for <tr is="worker-row"> in <script> template
The browser would pop out a `<worker-row>` element from a table because it
ejects all non-`<tr>` elements there. Apparently this doesn't happen when
the template is in a `<script type='text/x-template'>` tag, so we can
simplify.
2018-12-17 15:16:47 +01:00
Sybren A. Stüvel a6e5900f09 Fix T50981 Worker deallocation from job if fails n tasks
When a Worker notifies the Manager a task failed, the number of failed
tasks of this worker, on this job, of the same task type as the
currently failed task is counted. If this count is above a threshold,
the (worker ID, job ID, task type) tuple is added to the blacklist. This
prevents the worker from getting such tasks. Matching failed tasks are
re-queued so that they can be executed by another worker.

This requies a new setting `blacklist_threshold`, which indicates the
number of failed tasks at which the above behaviour is triggered. It
defaults to 3. This means that it's likely that we should also increase
the TASK_FAIL_JOB_PERCENTAGE constant in Flamenco Server so that it's
more lenient towards failure (as excessive failure will trigger
requeueing anyway).

Note that there is NO starvation detection. In other words, if a job has
certain tasks that were failed by all available workers (and thus all
workers are blacklisted for this job & task type) there is no detection
that this happened. As a result, the job will be stuck in 'active'
status without it ever having a chance of being finished.
2018-12-17 14:28:27 +01:00
Sybren A. Stüvel 014788af1d Ignore files in default task logs directory 2018-12-17 14:26:15 +01:00
Sybren A. Stüvel 1cf3c1971f Fixed bug in sleep scheduler
It wouldn't handle implicit end times properly when computing the 'next
check' timestamp. They are now correctly interpreted as 'midnight the next
day'.
2018-12-17 14:26:08 +01:00
Sybren A. Stüvel 5fb98d6591 Dashboard: shortening more (time display + task ID) 2018-12-17 14:05:19 +01:00
Sybren A. Stüvel e9d6cb017f Sleep scheduler: only log at debug level when there is nothing to do 2018-12-14 16:39:18 +01:00
Sybren A. Stüvel 0c217968fa Formatting 2018-12-14 16:24:33 +01:00
Sybren A. Stüvel b63917b47c Sorted services in main.go 2018-12-14 16:13:52 +01:00