Support pausing jobs #104313
@ -169,7 +169,7 @@ func (sm *StateMachine) jobStatusIfAThenB(
|
|||||||
func (sm *StateMachine) updateJobOnTaskStatusCanceled(ctx context.Context, logger zerolog.Logger, job *persistence.Job) error {
|
func (sm *StateMachine) updateJobOnTaskStatusCanceled(ctx context.Context, logger zerolog.Logger, job *persistence.Job) error {
|
||||||
David-Zhang-10 marked this conversation as resolved
Outdated
|
|||||||
// If no more tasks can run, cancel the job.
|
// If no more tasks can run, cancel the job.
|
||||||
David-Zhang-10 marked this conversation as resolved
Outdated
Sybren A. Stüvel
commented
Right now, the structure of the function is such that the entire function body is contained in a conditional. This is not a good idea, as it unnecessarily increases the complexity. Flipping the condition of the first
Right now, the structure of the function is such that the entire function body is contained in a conditional. This is not a good idea, as it unnecessarily increases the complexity. Flipping the condition of the first `if` will un-indent the remaining code:
```go
if job.Status != api.JobStatusPauseRequested {
return false, nil
}
// rest of the code
```
|
|||||||
numRunnable, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job,
|
numRunnable, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job,
|
||||||
api.TaskStatusActive, api.TaskStatusQueued, api.TaskStatusSoftFailed)
|
api.TaskStatusActive, api.TaskStatusQueued, api.TaskStatusSoftFailed, api.TaskStatusPaused)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return err
|
return err
|
||||||
}
|
}
|
||||||
@ -179,6 +179,16 @@ func (sm *StateMachine) updateJobOnTaskStatusCanceled(ctx context.Context, logge
|
|||||||
return sm.JobStatusChange(ctx, job, api.JobStatusCanceled, "canceled task was last runnable task of job, canceling job")
|
return sm.JobStatusChange(ctx, job, api.JobStatusCanceled, "canceled task was last runnable task of job, canceling job")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
|
||||||
David-Zhang-10 marked this conversation as resolved
Outdated
Sybren A. Stüvel
commented
This code block feels a bit out of place. If I understand correctly, this is the flow that this code would handle:
To me this doesn't look like a status in which the pausing was complete. This logic is also repeated quite a bit below. I think it's better to make a new function responsible for answering the question "is the job pausing, and is that process complete now?", and call that function from the places that need this. Finally, some reordering can be done to optimise the code. Currently every check will count the tasks in the job, even when This code block feels a bit out of place. If I understand correctly, this is the flow that this code would handle:
1. User cancels a task.
2. There are other tasks that are still runnable but not active (i.e. status in `{queued, soft-failed, paused}`).
3. Job is in status `pause-requested`.
To me this doesn't look like a status in which the pausing was complete.
This logic is also repeated quite a bit below. I think it's better to make a new function responsible for answering the question "_is the job pausing, and is that process complete now?_", and call that function from the places that need this.
Finally, some reordering can be done to optimise the code. Currently every check will count the tasks in the job, even when `job.Status != api.JobStatusPauseRequested`. It's good practice to do the cheapest check first, and the most expensive check last.
David Zhang
commented
The comment makes sense to me! The only thing I am confused about is what you meant by "the pausing was complete". The comment makes sense to me! The only thing I am confused about is what you meant by "the pausing was complete".
Sybren A. Stüvel
commented
"Pausing is complete" for me means that the work that should be done when the job goes to > The only thing I am confused about is what you meant by "the pausing was complete".
"Pausing is complete" for me means that the work that should be done when the job goes to `pause-requested` is done, and it can be moved to state `paused`.
David Zhang
commented
So for the pausing to be complete, we need all tasks to be in So for the pausing to be complete, we need all tasks to be in `paused` state, including tasks that are runnable but not in `active` state? In other words, instead of checking the number of active tasks here, we should check if the job has any tasks other than `paused`?
|
|||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
if numActive == 0 && job.Status == api.JobStatusPauseRequested {
|
||||||
|
// there is no active task, and the job is in pause-requested status, so we can pause the job
|
||||||
|
logger.Info().Msg("all tasks of job are completed, job is paused")
|
||||||
David-Zhang-10 marked this conversation as resolved
Outdated
Sybren A. Stüvel
commented
This log entry is giving the wrong information. If The same is true for other copies of this code. Don't make your code lie; in a task-failure-handling part of the code, don't make it log "all tasks completed". This log entry is giving the wrong information. If `updateJobOnTaskStatusCanceled()` is called, a task was cancelled and thus the log message cannot be correct. Same for the job status change reason "all tasks completed".
The same is true for other copies of this code. Don't make your code lie; in a task-failure-handling part of the code, don't make it log "all tasks completed".
|
|||||||
|
return sm.JobStatusChange(ctx, job, api.JobStatusPaused, "all tasks completed")
|
||||||
|
}
|
||||||
|
|
||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -203,6 +213,16 @@ func (sm *StateMachine) updateJobOnTaskStatusFailed(ctx context.Context, logger
|
|||||||
}
|
}
|
||||||
// If the job didn't fail, this failure indicates that at least the job is active.
|
// If the job didn't fail, this failure indicates that at least the job is active.
|
||||||
failLogger.Info().Msg("task failed, but not enough to fail the job")
|
failLogger.Info().Msg("task failed, but not enough to fail the job")
|
||||||
|
|
||||||
|
numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
|
||||||
|
if err != nil {
|
||||||
|
return err
|
||||||
|
}
|
||||||
|
if numActive == 0 && job.Status == api.JobStatusPauseRequested {
|
||||||
|
// there is no active task, and the job is in pause-requested status, so we can pause the job
|
||||||
|
logger.Info().Msg("all tasks of job are completed, job is paused")
|
||||||
|
return sm.JobStatusChange(ctx, job, api.JobStatusPaused, "all tasks completed")
|
||||||
|
}
|
||||||
return sm.jobStatusIfAThenB(ctx, logger, job, api.JobStatusQueued, api.JobStatusActive,
|
return sm.jobStatusIfAThenB(ctx, logger, job, api.JobStatusQueued, api.JobStatusActive,
|
||||||
David-Zhang-10 marked this conversation as resolved
Outdated
Sybren A. Stüvel
commented
I don't think this is the right structure. The Also, don't use Better to just use:
It'll also make it easier to move the code into a function of its own. It's the same "single responsibility principle" as before: there should be one piece of code that checks whether pausing a job is complete. I don't think this is the right structure. The `if {}` block now contains a copy of the code below it, which is error-prone. There should be one piece of code responsible for one thing. By duplicating this code, there is the chance that one of the copies goes out of sync with the other, introducing hard to reason about bugs.
Also, don't use `else` after a `return`, it doesn't mean anything.
Better to just use:
```go
if job.Status == api.JobStatusPauseRequested {
numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
if err != nil {
return err
}
if numActive == 0 {
// There is no active task, and the job is in pause-requested status, so we can pause the job.
failLogger.Info().Msg("No more active tasks, job is paused")
return sm.JobStatusChange(ctx, job, api.JobStatusPaused, "all tasks completed")
}
}
```
It'll also make it easier to move the code into a function of its own. It's the same "single responsibility principle" as before: there should be one piece of code that checks whether pausing a job is complete.
|
|||||||
"task failed, but not enough to fail the job")
|
"task failed, but not enough to fail the job")
|
||||||
}
|
}
|
||||||
@ -473,6 +493,16 @@ func (sm *StateMachine) pauseTasks(
|
|||||||
return "", fmt.Errorf("pausing tasks of job %s: %w", job.UUID, err)
|
return "", fmt.Errorf("pausing tasks of job %s: %w", job.UUID, err)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// If pausing was requested, it has now happened, so the job can transition.
|
||||||
|
numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
|
||||||
|
if err != nil {
|
||||||
|
return "", fmt.Errorf("error when accessing number of active tasks")
|
||||||
|
}
|
||||||
|
if job.Status == api.JobStatusPauseRequested && numActive == 0 {
|
||||||
|
logger.Info().Msg("all tasks of job paused, job can go to 'paused' status")
|
||||||
|
return api.JobStatusPaused, nil
|
||||||
|
}
|
||||||
|
|
||||||
return api.JobStatusPauseRequested, nil
|
return api.JobStatusPauseRequested, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
I think the function name can be improved. The name
shouldJobBePaused
describes what should be done by the caller. I always try to pick a function name that matches what the function does. So for this one, I feel thatisJobPausingComplete
would be better.Also a documentation comment would be good. In Go, these always start with
// {function name} ...
. It should describe that this returns true when the job status ispause-requested
and there are no more active tasks.