Sync branch magefile with main #104308

Merged
Sybren A. Stüvel merged 85 commits from abelli/flamenco:magefile into magefile 2024-05-13 16:26:32 +02:00
Contributor

This PR will merge the commits from main into magefile and sync it, so that we can continue its development.

This PR will merge the commits from `main` into `magefile` and sync it, so that we can continue its development.
Mateus Abelli added 85 commits 2024-05-12 18:55:34 +02:00
Add an OpenAPI operation to fetch the overall farm status from the Manager.
Add a new API operation to get the overall farm status. This is based on
the jobs and workers, and their status.

The statuses are:

- `active`: Actively working on jobs.
- `idle`: Farm could be active, but has no work to do.
- `waiting`: Work has been queued, but all workers are asleep.
- `asleep`: Farm is idle, and all workers are asleep.
- `inoperative`: Cannot work: no workers, or all are offline/error.
- `starting`: Farm is starting up.
- `unknown`: Unexpected configuration of worker and job statuses.
Send an event to the event bus whenever the farm status changes. The event
contains a farm status report (like `{status: "active"}`), and is sent to
the `/status` topic.

Note that at this moment the status is only polled every X seconds, and
thus may lag behind other events.
SocketIO has 'rooms' and 'event types'. The 'event type' is set via
reflection of the OpenAPI type of the event payload. This has to be set
up in a mapping, though, and if that mapping is incomplete, an error will
now be logged.
This introduces the concept of 'event listener', which is now used by
the farm status service to respond to events on the event bus.

This makes it possible to reduce the regular poll period from 5 to 30
seconds. That's now only necessary as backup, just in case events are
missed or otherwise things change without the event bus logic noticing.
Show the farm status in the webapp header bar, and respond to farm status
events to update it when necessary.
There are still issues with foreign keys getting disabled, so enable them
in the periodic database consistency check.

A more permanent solution is likely to drop GORM and switch to something
else that gives us an on-connect-callback, which can then be used to
turn on foreign key constraints for every connection made.
The database is polled every 30 seconds to determine the farm status; at
startup the first poll is done after 1 second to get a faster status.

Note that when jobs and workers change their status, the farm status is
always updated.
Better to not show the farm status if the connection is lost.
The exponential backoff was getting a bit too long, making the webapp
sometimes very slow to reconnect. This is now limited to max 3 seconds.
Split the header into two or three parts, depending on the number of
columns shown. The farm status indicator will be above the middle column
(in 3 col mode) or at the right edge of the left column (in 2 col mode).

Also I reverted the hiding of the farm status when SocketIO has
disconnected, as that disconnect happens when navigation between tabs.
That created a too 'blinky' interface, so now it just shows the last-known
farm status.
GORM has certain downsides:

- Code-first approach, where queries have to be translated to the Go code
  required to execute them.
- GORM comes with its own SQLite implementation, which doesn't provide an
  on-connect callback. This means that new connections cannot correctly
  enable foreign key constraints, causing database consistency issues.

[SQLC](https://sqlc.dev/) solves these issues for us.

This commit doesn't fully replace GORM with SQLC, but introduces it for
a few queries. Once all queries have been converted, GORM can be removed
completely.
No functional changes.
No functional changes.
No functional changes.
This makes it easier to later also create `query_workesr.sql`,
`query_meta.sql` etc. so that the sqlc-generated code can follow the
same subdivision as the persistence service code itself.

No functional changes.
No functional changes.
Fix the database migration that adds `NOT NULL` clauses. It used
`INSERT INTO temp_x SELECT * from x;`, and the `*` returns the fields in
the order they are defined on the table. Since this might be different from
the order that the `INSERT INTO temp_x` expects, strange problems can
happen where columns get swapped (or constraints can fail on columns that
they should not fail for, because they got fed data from a different
column).
Instead of storing the cached manager info in the Blender preferences,
store the info in a JSON file. The file is located in the user prefs
folder (`~/.config/blender/{version}/config` on Linux).

This also reduces the number of 'refresh' operators to a single one, which
then fetches all necessary info from the Manager.

This fixes an issue (reported via chat) where worker tags were sometimes
not retained across file saves.
No functional changes.
With a fuller database, 2 seconds is apparently not always long enough,
so increase the timeout to 10 seconds.
Add a 1ms delay in the test loop, so that other goroutines can be scheduled
as well. This should fix #104288.
Increase the 'database open' timeout from 5 seconds to 1 minute. This
timeout also covers database migrations, and the recently added one that
adds a bunch of `NOT NULL` clauses could time out with the old 5 sec
limit.

The reason this takes long, is that SQLite doesn't directly support
adding `NOT NULL` clauses to columns. The only way to do this is to
create a new table with the desired schema, copy all data over, then
drop the old table. And with a big enough database, this takes time.
Just to make sure the DB is properly cleaned up after a big migration
happened.
Task log updates are big and frequent, and should not be sent via MQTT.
At least not until we have a practical reason to do so.
Remove commented-out sections in the configuration defaults. They're a
leftover from Flamenco v2.
Set the default MQTT topic prefix to 'flamenco'. It can still be overridden
by the config in the YAML file, but it's nice to have a sensible default
when people don't configure this.
Avoid these warnings on the console:

```
WARN (bpy.rna): source/blender/python/intern/bpy_rna.cc:1339
  pyrna_enum_to_py: current value '0' matches no enum in 'Scene', 'Scene',
  'flamenco_job_type'
```

The solution was two-fold:
- Use a non-empty string as the identifier for the 'Select a Job Type'
  choice.
- Give the property a default value.
Change the Tabulator layout mode from `fitData` to `fitDataFill`. The new
value adjusts the layout when the data has changed.
You can now set a page title and a separate title for the table of
contents with:

```
---
title: "Manager Configuration: MQTT"
titleTOC: MQTT
---
```
Add recent add-on improvements.
Back in the days when I wrote the code, I didn't know about the
`require` package yet. Using `require.NoError()` makes the test code
more straight-forward.

No functional changes, except that when tests fail, they now fail
without panicking.
Back in the days when I wrote the code, I didn't know about the
`require` package yet. Using `require.NoError()` makes the test code
more straight-forward.

No functional changes, except that when tests fail, they now fail
without panicking.
Pass `-failfast` to the `go test` command, so that it immediately stops
on test failure. This prevents the need to scroll back to see the actual
error, at the expense of only seeing one failure at a time.
There's still some confusion that this is a thing to solve, whereas it can
usually safely be ignored. Reduced the log level from Warn to Info to make
the message look more innocent.
Fix a bunch of security issues by upgrading to Go 1.22.2 and bumping
a few packages to their secure versions.

- [Incorrect forwarding of sensitive headers and cookies on HTTP redirect in net/http](https://pkg.go.dev/vuln/GO-2024-2600)
- [Memory exhaustion in multipart form parsing in net/textproto and net/http](https://pkg.go.dev/vuln/GO-2024-2599)
- [Verify panics on certificates with an unknown public key algorithm in crypto/x509](https://pkg.go.dev/vuln/GO-2024-2600)
- [HTTP/2 CONTINUATION flood in net/http](https://pkg.go.dev/vuln/GO-2024-2687)
This description will be shown as a tooltip in the job submission UI.
This description will be shown as a tooltip in the job submission UI.
The documentation itself has disappeared from the website, and it already
was obsolete for a long time anyway.
3rd part job compiler scripts should have their own tracker and handle
their own bug reports.
Explicitly use the `--mode` flag for the webapp development server
(`vite`) to make the web frontend choose the appropriate HTTP and
WebSocket port to communicate with the backend. This also makes sure
that when accessing the frontend via `https://`, the websocket
connection uses `wss://`.

As a side-effect, this also makes port `:8081` usable in production
environments; it would assume it was the development server and try to
access the backend on port `:8080`.

Reviewed-on: #104296
Reviewed-by: Sybren A. Stüvel <sybren@blender.org>
Move some of the Worker Tags test code into a function of its own, to have
a clearer separation between 'the test' and 'what needs to happen to do
this part of the test'.

Also it'll make an upcoming change easier to implement.

No functional changes.
Before deleting a Worker Tag, check that foreign key constraints are
active for the current database connection.

Sometimes GORM decides to create a new database connection by itself,
without telling us, and then foreign key constraints are not active on
it. This commit is a workaround to avoid database corruption.
As a safety measure, refuse to delete Workers from the Manager's database
when foreign key constraints are disabled.

In the long term, the underlying problem should be solved. This is a stop-
gap measure to ensure database consistency.
Add a Worker configuration option to configure the Linux out-of-memory
behaviour. Add `oom_score_adjust=500` to `flamenco-worker.yaml` to increase
the chance that Blender gets killed when the machine runs out of memory,
instead of Flamenco Worker itself.
Reduce the log level from 'info' to 'debug' on some internal components
of Flamenco Worker. This makes the console output slightly less noisy,
and it's unlikely that these particular messages are commonly needed.
This reverts commit 7f14e6705d. v3.5 still
needs today's date as release date in the changelog.
Remove some Python 3.10 features to make the add-on compatible with py39.
This is the Python version that's bundled with Blender 2.93 LTS, for which
I got a request to see if it could be supported.

The Blender version still isn't officially supported, but this should make
things at least not immediately fail.
Updated the troubleshooting section of the FAQ to include guidance on checking the firewall and potential third-party antivirus issues when the Worker cannot connect to the Manager. This enhances the user experience by addressing common connectivity issues more comprehensively.
Reword so that the section starts with the suggestion that each problem
has a solution. And make it an enumerated list to clarify the structure
of the answer.
This gives job type authors more control over how settings are presented
in Blender's job submission GUI. If a job setting does not define a
label, its `key` is used to generate one (like Flamenco 3.5 and older).

Note that this isn't used in the web interface yet.
Add a function `shellSplit(string)` to the global namespace of job
compiler scripts. It splits a string into an array of strings using
shell/CLI semantics.

For example: `shellSplit("--python-expr 'print(1 + 1)'")` will return
`["--python-expr", "print(1 + 1)"]`.
Add a few more unit tests for the persistence layer. The goal is to have
100% coverage of the happy flow, to aid in conversion from GORM to sqlc.

No functional changes.
Sybren A. Stüvel approved these changes 2024-05-13 16:25:56 +02:00
Sybren A. Stüvel left a comment
Owner

Thanks!

Thanks!
Sybren A. Stüvel merged commit b69640912d into magefile 2024-05-13 16:26:32 +02:00
Mateus Abelli deleted branch magefile 2024-05-14 00:32:10 +02:00
Sign in to join this conversation.
No description provided.