Support pausing jobs #104313
No reviewers
Labels
No Label
Good First Issue
Priority
High
Priority
Low
Priority
Normal
Status
Archived
Status
Confirmed
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Job Type
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: studio/flamenco#104313
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "David-Zhang-10/flamenco:paused-job-status"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
This pull request implements Issue #99399 Support pausing jobs.
Deliverables
pause-requested
before the actualpause
stateWIP: Support pausing jobs and allow jobs to be submitted in paused statusto WIP: Support pausing jobs247c60b27d
toec540e47a7
Pause Job
button ba454cd4a0When pausing a job, I see this warning in the Manager's log:
WIP: Support pausing jobsto Support pausing jobspause-requested
status c32a591e55There's a unit test failing:
Ideally each commit should bring the source from working state to working state. Of course that's not always necessary, but it's good practice to ensure unit tests are working before the PR is reviewed.
I've left some inline notes.
@ -5,7 +5,6 @@ package task_state_machine
import (
"context"
"fmt"
Keep the newline between built-in packages and 3rd party packages.
@ -180,6 +179,16 @@ func (sm *StateMachine) updateJobOnTaskStatusCanceled(ctx context.Context, logge
return sm.JobStatusChange(ctx, job, api.JobStatusCanceled, "canceled task was last runnable task of job, canceling job")
}
numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
This code block feels a bit out of place. If I understand correctly, this is the flow that this code would handle:
{queued, soft-failed, paused}
).pause-requested
.To me this doesn't look like a status in which the pausing was complete.
This logic is also repeated quite a bit below. I think it's better to make a new function responsible for answering the question "is the job pausing, and is that process complete now?", and call that function from the places that need this.
Finally, some reordering can be done to optimise the code. Currently every check will count the tasks in the job, even when
job.Status != api.JobStatusPauseRequested
. It's good practice to do the cheapest check first, and the most expensive check last.The comment makes sense to me! The only thing I am confused about is what you meant by "the pausing was complete".
"Pausing is complete" for me means that the work that should be done when the job goes to
pause-requested
is done, and it can be moved to statepaused
.So for the pausing to be complete, we need all tasks to be in
paused
state, including tasks that are runnable but not inactive
state? In other words, instead of checking the number of active tasks here, we should check if the job has any tasks other thanpaused
?@ -183,0 +185,4 @@
}
if numActive == 0 && job.Status == api.JobStatusPauseRequested {
// there is no active task, and the job is in pause-requested status, so we can pause the job
logger.Info().Msg("all tasks of job are completed, job is paused")
This log entry is giving the wrong information. If
updateJobOnTaskStatusCanceled()
is called, a task was cancelled and thus the log message cannot be correct. Same for the job status change reason "all tasks completed".The same is true for other copies of this code. Don't make your code lie; in a task-failure-handling part of the code, don't make it log "all tasks completed".
@ -441,0 +479,4 @@
logger.Info().Msg("pausing tasks of job")
// Any task that might run in the future should get paused.
// Active jobs should remain active until finished
jobs → tasks
And end comments with a period.
I think it's fine to keep cancelled tasks as cancelled. That's not a runnable state anyway, so that'll make the code consistent with the comment.
@ -22,0 +24,4 @@
const activeJob = jobs.activeJob;
// Check if the job status is 'pause-requested'
if (activeJob && activeJob.status === 'pause-requested') {
There's generally two reasons to test something with an
if
statement:In this case the check for
activeJob
is there to prevent errors when accessingactiveJob.status
in case there is no active job. This mixes the two above reasons, though.Another way to look at this, is asking yourself: does it make sense to even answer the query ("can I cancel the active task?") if there is no active job? If too little information is given, in this particular case we know the task but not the job, is that an error? Or is that a sensible state to be in?
When the front-end is displaying tasks, it should be aware of which job those tasks belong to. If that is not the case, I feel that this is a bug. Such bugs should not be hidden, not be defensively worked around. They should be visible, so that they are easy to recognise as bugs. That way people report them, and we can get them fixed.
All that to say: it's probably better to write something like:
Also the triple
===
is unnecessary. The job status is never going to be something other than a string, and if it's some other type, if it compares equal to"pause-requested"
, it's good enough a reason to disable cancellation of the task.@ -22,0 +27,4 @@
if (activeJob && activeJob.status === 'pause-requested') {
return false;
}
// Allow cancellation for specified task statuses
Make comments proper sentences, so end with a period.
@ -207,0 +223,4 @@
if job.Status != api.JobStatusPauseRequested {
return sm.jobStatusIfAThenB(ctx, logger, job, api.JobStatusQueued, api.JobStatusActive,
"task failed, but not enough to fail the job")
} else {
I don't think this is the right structure. The
if {}
block now contains a copy of the code below it, which is error-prone. There should be one piece of code responsible for one thing. By duplicating this code, there is the chance that one of the copies goes out of sync with the other, introducing hard to reason about bugs.Also, don't use
else
after areturn
, it doesn't mean anything.Better to just use:
It'll also make it easier to move the code into a function of its own. It's the same "single responsibility principle" as before: there should be one piece of code that checks whether pausing a job is complete.
Looking much better now :)
Just a few small nitpicks, I've wrote some inline notes.
Also I think there is one unit test missing (of course more could be written, but I think this one is important to cover):
active
, one tasks isactive
, one iscomplete
, the rest are queued.active
task goes tosoft-failed
.And, if you feel like it, a similar test, except that the
active
task goes tofailed
.@ -166,6 +166,21 @@ func (sm *StateMachine) jobStatusIfAThenB(
return sm.JobStatusChange(ctx, job, thenStatus, reason)
}
func (sm *StateMachine) shouldJobBePaused(ctx context.Context, logger zerolog.Logger, job *persistence.Job) (bool, error) {
I think the function name can be improved. The name
shouldJobBePaused
describes what should be done by the caller. I always try to pick a function name that matches what the function does. So for this one, I feel thatisJobPausingComplete
would be better.Also a documentation comment would be good. In Go, these always start with
// {function name} ...
. It should describe that this returns true when the job status ispause-requested
and there are no more active tasks.@ -167,2 +167,4 @@
}
func (sm *StateMachine) shouldJobBePaused(ctx context.Context, logger zerolog.Logger, job *persistence.Job) (bool, error) {
if job.Status == api.JobStatusPauseRequested {
Right now, the structure of the function is such that the entire function body is contained in a conditional. This is not a good idea, as it unnecessarily increases the complexity. Flipping the condition of the first
if
will un-indent the remaining code:@ -169,0 +174,4 @@
}
if numActive == 0 {
// There is no active task, and the job is in pause-requested status, so we can pause the job.
logger.Info().Msg("No more active tasks, job is paused")
I think it would be be better to leave the logging to the caller of this function. Also what is logged here is not entirely true -- the job is still in state
pause-requested
, not in statepaused
, so logging that it is paused is incorrect.By removing the logging, you can reduce the final
return
to justreturn numActive == 0, nil
.@ -441,0 +511,4 @@
// If pausing was requested, it has now happened, so the job can transition.
toBePaused, err := sm.shouldJobBePaused(ctx, logger, job)
if err != nil {
return "", fmt.Errorf("error when accessing number of active tasks")
This hides the underlying error, making things hard to debug. When wrapping an error, always use the
%w
format specifier to include the underlying error. This will make sure that functions likeerrors.Is()
anderrors.As()
can do their work. You can find more about this at https://pkg.go.dev/errorsAlso, this function shouldn't know what's going on inside of
shouldJobBePaused()
, so if you want to have "when accessing number of active tasks" in the error message, that should be done insideshouldJobBePaused()
, not here.@ -441,0 +514,4 @@
return "", fmt.Errorf("error when accessing number of active tasks")
}
if toBePaused {
logger.Info().Msg("all tasks of job paused, job can go to 'paused' status")
Perfect example of why
shouldJobBePaused()
should not be logging anything. It's better to leave that to places like this, where you know the context in which that function is called.Looks good to me!
In our meeting this afternoon, let's go over the steps to get this landed on the
main
branch together.pull
'ed the latestmain
main
.git rebase --interactive
to squash all commits together.git gui
to split out that one squashed commits into a few separate ones (and get rid of thego.sum
changes, that file shouldn't be touched in this PR):flamenco-openapi.yaml
make test
againgitk
to be sure of the final changes one last time.git push