Support pausing jobs #104313

David Zhang · 2024-05-31T09:47:13+02:00

David Zhang commented

2024-05-31 09:47:13 +02:00

Overview
This pull request implements Issue #99399 Support pausing jobs.

Deliverables

"Pause Job" button in the Manager web interface in the left panel next to the "Cancel Job" button
Backend implementation of job status transition logic, mostly in Manager
Introduce an intermediate state pause-requested before the actual pause state

**Overview** This pull request implements [Issue #99399 Support pausing jobs](https://projects.blender.org/studio/flamenco/issues/99399). **Deliverables** - "Pause Job" button in the Manager web interface in the left panel next to the "Cancel Job" button - Backend implementation of job status transition logic, mostly in Manager - Introduce an intermediate state `pause-requested` before the actual `pause` state

David Zhang added 1 commit 2024-05-31 09:47:17 +02:00

Manager: draft task and job status transition; regenerated openapi with pause-requested status; start frontend work ec540e47a7

David Zhang changed title from ~~WIP: Support pausing jobs and allow jobs to be submitted in paused status~~ to WIP: Support pausing jobs

2024-05-31 10:07:14 +02:00

David Zhang force-pushed paused-job-status from 247c60b27d to ec540e47a7

2024-06-01 10:59:35 +02:00

Compare

David Zhang added 1 commit 2024-06-01 11:10:15 +02:00

Manager: basic test case for paused status with all queued tasks 562f77bd74

David Zhang added 1 commit 2024-06-01 11:13:20 +02:00

Manager: leave out updateJobOnTaskStatusPaused 716f86c0f4

David Zhang added 1 commit 2024-06-01 11:18:31 +02:00

Manager: taskStatusesToPause should include soft-failed e30832aab4

David Zhang added 1 commit 2024-06-01 11:27:14 +02:00

Manager: improve error handling of pauseTasks() a84097e59f

David Zhang added 1 commit 2024-06-04 15:56:01 +02:00

Webapp: interactive Pause Job button ba454cd4a0

David Zhang added 1 commit 2024-06-06 19:05:23 +02:00

Manager: improve state transition logic 050c779995

David Zhang added 1 commit 2024-06-07 08:35:21 +02:00

Manager: cancelling tasks should not cause a pause-requested job to be cancelled 71c7715eab

David Zhang added 2 commits 2024-06-07 10:28:32 +02:00

Merge remote-tracking branch 'official/main' into paused-job-status 526e0f98fb

Manager: passing TestTaskStatusChangeActiveToCompleted 0e523a9b31

Sybren A. Stüvel reviewed 2024-06-07 16:35:58 +02:00

Sybren A. Stüvel left a comment

When pausing a job, I see this warning in the Manager's log:

2024-06-07T16:35:14+02:00 WRN unknown job status change, ignoring job=fcb01313-96aa-463d-ba08-ea0669145fa1 jobStatusNew=paused jobStatusOld=pause-requested reason="requested from web interface"

When pausing a job, I see this warning in the Manager's log: ``` 2024-06-07T16:35:14+02:00 WRN unknown job status change, ignoring job=fcb01313-96aa-463d-ba08-ea0669145fa1 jobStatusNew=paused jobStatusOld=pause-requested reason="requested from web interface" ```

David Zhang added 1 commit 2024-06-12 18:18:46 +02:00

Manager: fix manager warning 1c35617cb1

David Zhang changed title from ~~WIP: Support pausing jobs~~ to Support pausing jobs

2024-06-14 13:09:21 +02:00

David Zhang added 2 commits 2024-06-14 13:15:01 +02:00

Merge remote-tracking branch 'official/main' into paused-job-status ac5627e57a

Manager: pass TestTaskStatusChangeQueuedToFailed and TestTaskStatusChangeCancelSingleTask c3af84f6a6

David Zhang requested review from Sybren A. Stüvel 2024-06-14 13:15:24 +02:00

David Zhang added 1 commit 2024-06-15 17:52:34 +02:00

Manager: disallow task cancellation when the job has pause-requested status c32a591e55

Sybren A. Stüvel commented

2024-06-17 15:17:33 +02:00

There's a unit test failing:

--- FAIL: TestTaskStatusChangeCancelSingleTaskWithOtherFailed (0.00s)
    task_state_machine.go:171: Unexpected call to *mocks.MockPersistenceService.CountTasksOfJobInStatus([context.Background 0xc000374ee0 active queued soft-failed paused]) at /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine.go:171 because: 
        expected call at /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:224 has the wrong number of arguments. Got: 6, want: 5
    controller.go:269: missing call(s) to *mocks.MockChangeBroadcaster.BroadcastJobUpdate(is equal to {<nil> test-job-f3f5-4cef-9cd7-e67eb28eaf3e 0xc000374f28 0xc00036bac0 0 false canceled  0001-01-01 00:00:00 +0000 UTC <nil>} (api.EventJobUpdate)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:460
    controller.go:269: missing call(s) to *mocks.MockPersistenceService.CountTasksOfJobInStatus(is equal to context.Background (context.backgroundCtx), is equal to &{{47 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC} test-job-f3f5-4cef-9cd7-e67eb28eaf3e   0 cancel-requested  map[] map[] {0001-01-01 00:00:00 +0000 UTC false} {} <nil> <nil>} (*persistence.Job), is equal to active (api.TaskStatus), is equal to queued (api.TaskStatus), is equal to soft-failed (api.TaskStatus)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:224
    controller.go:269: missing call(s) to *mocks.MockPersistenceService.SaveJobStatus(is anything, is equal to &{{47 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC} test-job-f3f5-4cef-9cd7-e67eb28eaf3e   0 cancel-requested  map[] map[] {0001-01-01 00:00:00 +0000 UTC false} {} <nil> <nil>} (*persistence.Job)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:441
    controller.go:269: aborting test due to missing call(s)
FAIL

Ideally each commit should bring the source from working state to working state. Of course that's not always necessary, but it's good practice to ensure unit tests are working before the PR is reviewed.

There's a unit test failing: ``` --- FAIL: TestTaskStatusChangeCancelSingleTaskWithOtherFailed (0.00s) task_state_machine.go:171: Unexpected call to *mocks.MockPersistenceService.CountTasksOfJobInStatus([context.Background 0xc000374ee0 active queued soft-failed paused]) at /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine.go:171 because: expected call at /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:224 has the wrong number of arguments. Got: 6, want: 5 controller.go:269: missing call(s) to *mocks.MockChangeBroadcaster.BroadcastJobUpdate(is equal to {<nil> test-job-f3f5-4cef-9cd7-e67eb28eaf3e 0xc000374f28 0xc00036bac0 0 false canceled 0001-01-01 00:00:00 +0000 UTC <nil>} (api.EventJobUpdate)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:460 controller.go:269: missing call(s) to *mocks.MockPersistenceService.CountTasksOfJobInStatus(is equal to context.Background (context.backgroundCtx), is equal to &{{47 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC} test-job-f3f5-4cef-9cd7-e67eb28eaf3e 0 cancel-requested map[] map[] {0001-01-01 00:00:00 +0000 UTC false} {} <nil> <nil>} (*persistence.Job), is equal to active (api.TaskStatus), is equal to queued (api.TaskStatus), is equal to soft-failed (api.TaskStatus)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:224 controller.go:269: missing call(s) to *mocks.MockPersistenceService.SaveJobStatus(is anything, is equal to &{{47 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC} test-job-f3f5-4cef-9cd7-e67eb28eaf3e 0 cancel-requested map[] map[] {0001-01-01 00:00:00 +0000 UTC false} {} <nil> <nil>} (*persistence.Job)) /home/sybren/workspace/flamenco/internal/manager/task_state_machine/task_state_machine_test.go:441 controller.go:269: aborting test due to missing call(s) FAIL ``` Ideally each commit should bring the source from working state to working state. Of course that's not always necessary, but it's good practice to ensure unit tests are working before the PR is reviewed.

Sybren A. Stüvel requested changes 2024-06-17 15:48:32 +02:00

Dismissed

Sybren A. Stüvel left a comment

I've left some inline notes.

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -5,7 +5,6 @@ package task_state_machine

				import (

					"context"

					"fmt"

Sybren A. Stüvel commented

2024-06-17 15:42:40 +02:00

Keep the newline between built-in packages and 3rd party packages.

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -180,6 +179,16 @@ func (sm *StateMachine) updateJobOnTaskStatusCanceled(ctx context.Context, logge

						return sm.JobStatusChange(ctx, job, api.JobStatusCanceled, "canceled task was last runnable task of job, canceling job")

					}

					numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)

Sybren A. Stüvel commented

2024-06-17 15:35:21 +02:00

This code block feels a bit out of place. If I understand correctly, this is the flow that this code would handle:

User cancels a task.
There are other tasks that are still runnable but not active (i.e. status in {queued, soft-failed, paused}).
Job is in status pause-requested.

To me this doesn't look like a status in which the pausing was complete.

This logic is also repeated quite a bit below. I think it's better to make a new function responsible for answering the question "is the job pausing, and is that process complete now?", and call that function from the places that need this.

Finally, some reordering can be done to optimise the code. Currently every check will count the tasks in the job, even when job.Status != api.JobStatusPauseRequested. It's good practice to do the cheapest check first, and the most expensive check last.

This code block feels a bit out of place. If I understand correctly, this is the flow that this code would handle: 1. User cancels a task. 2. There are other tasks that are still runnable but not active (i.e. status in `{queued, soft-failed, paused}`). 3. Job is in status `pause-requested`. To me this doesn't look like a status in which the pausing was complete. This logic is also repeated quite a bit below. I think it's better to make a new function responsible for answering the question "_is the job pausing, and is that process complete now?_", and call that function from the places that need this. Finally, some reordering can be done to optimise the code. Currently every check will count the tasks in the job, even when `job.Status != api.JobStatusPauseRequested`. It's good practice to do the cheapest check first, and the most expensive check last.

David Zhang commented

2024-06-18 22:03:23 +02:00

The comment makes sense to me! The only thing I am confused about is what you meant by "the pausing was complete".

Sybren A. Stüvel commented

2024-06-19 10:32:31 +02:00

The only thing I am confused about is what you meant by "the pausing was complete".

"Pausing is complete" for me means that the work that should be done when the job goes to pause-requested is done, and it can be moved to state paused.

> The only thing I am confused about is what you meant by "the pausing was complete". "Pausing is complete" for me means that the work that should be done when the job goes to `pause-requested` is done, and it can be moved to state `paused`.

David Zhang commented

2024-06-19 17:45:08 +02:00

So for the pausing to be complete, we need all tasks to be in paused state, including tasks that are runnable but not in active state? In other words, instead of checking the number of active tasks here, we should check if the job has any tasks other than paused?

So for the pausing to be complete, we need all tasks to be in `paused` state, including tasks that are runnable but not in `active` state? In other words, instead of checking the number of active tasks here, we should check if the job has any tasks other than `paused`?

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -183,0 +185,4 @@

					}

					if numActive == 0 && job.Status == api.JobStatusPauseRequested {

						// there is no active task, and the job is in pause-requested status, so we can pause the job

						logger.Info().Msg("all tasks of job are completed, job is paused")

Sybren A. Stüvel commented

2024-06-17 15:42:03 +02:00

This log entry is giving the wrong information. If updateJobOnTaskStatusCanceled() is called, a task was cancelled and thus the log message cannot be correct. Same for the job status change reason "all tasks completed".

The same is true for other copies of this code. Don't make your code lie; in a task-failure-handling part of the code, don't make it log "all tasks completed".

This log entry is giving the wrong information. If `updateJobOnTaskStatusCanceled()` is called, a task was cancelled and thus the log message cannot be correct. Same for the job status change reason "all tasks completed". The same is true for other copies of this code. Don't make your code lie; in a task-failure-handling part of the code, don't make it log "all tasks completed".

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -441,0 +479,4 @@

					logger.Info().Msg("pausing tasks of job")

					// Any task that might run in the future should get paused.

					// Active jobs should remain active until finished

Sybren A. Stüvel commented

2024-06-17 15:44:37 +02:00

jobs → tasks

And end comments with a period.

I think it's fine to keep cancelled tasks as cancelled. That's not a runnable state anyway, so that'll make the code consistent with the comment.

jobs → tasks And end comments with a period. I think it's fine to keep cancelled tasks as cancelled. That's not a runnable state anyway, so that'll make the code consistent with the comment.

David-Zhang-10 marked this conversation as resolved

web/app/src/stores/tasks.js Outdated

						
				@ -22,0 +24,4 @@

				      const activeJob = jobs.activeJob;

				      // Check if the job status is 'pause-requested'

				      if (activeJob && activeJob.status === 'pause-requested') {

Sybren A. Stüvel commented

2024-06-17 15:27:39 +02:00

There's generally two reasons to test something with an if statement:

checking preconditions (for example to avoid errors), and
actual business logic.

In this case the check for activeJob is there to prevent errors when accessing activeJob.status in case there is no active job. This mixes the two above reasons, though.

Another way to look at this, is asking yourself: does it make sense to even answer the query ("can I cancel the active task?") if there is no active job? If too little information is given, in this particular case we know the task but not the job, is that an error? Or is that a sensible state to be in?

When the front-end is displaying tasks, it should be aware of which job those tasks belong to. If that is not the case, I feel that this is a bug. Such bugs should not be hidden, not be defensively worked around. They should be visible, so that they are easy to recognise as bugs. That way people report them, and we can get them fixed.

All that to say: it's probably better to write something like:

    canCancel() {
      const jobs = useJobs();
      const activeJob = jobs.activeJob;

      if (!activeJob) {
        console.warn('no active job, unable to determine whether the active task is cancellable');
        return false;
      }

      if (activeJob.status == 'pause-requested') {
        // Cancelling a task should not be possible while the job is being paused.
        // In the future this might be supported, see issue #104315.
        return false;
      }

      // Allow cancellation for specified task statuses.
      return this._anyTaskWithStatus(['queued', 'active', 'soft-failed']);
    },

Also the triple === is unnecessary. The job status is never going to be something other than a string, and if it's some other type, if it compares equal to "pause-requested", it's good enough a reason to disable cancellation of the task.

There's generally two reasons to test something with an `if` statement: - checking preconditions (for example to avoid errors), and - actual business logic. In this case the check for `activeJob` is there to prevent errors when accessing `activeJob.status` in case there is no active job. This mixes the two above reasons, though. Another way to look at this, is asking yourself: **does it make sense to even answer the query** ("can I cancel the active task?") if there is no active job? If too little information is given, in this particular case we know the task but not the job, is that an error? Or is that a sensible state to be in? When the front-end is displaying tasks, it should be aware of which job those tasks belong to. If that is not the case, I feel that this is a bug. Such bugs should not be hidden, not be defensively worked around. They should be visible, so that they are easy to recognise as bugs. That way people report them, and we can get them fixed. All that to say: it's probably better to write something like: ```js canCancel() { const jobs = useJobs(); const activeJob = jobs.activeJob; if (!activeJob) { console.warn('no active job, unable to determine whether the active task is cancellable'); return false; } if (activeJob.status == 'pause-requested') { // Cancelling a task should not be possible while the job is being paused. // In the future this might be supported, see issue #104315. return false; } // Allow cancellation for specified task statuses. return this._anyTaskWithStatus(['queued', 'active', 'soft-failed']); }, ``` Also the triple `===` is unnecessary. The job status is never going to be something other than a string, and if it's some other type, if it compares equal to `"pause-requested"`, it's good enough a reason to disable cancellation of the task.

David-Zhang-10 marked this conversation as resolved

web/app/src/stores/tasks.js Outdated

						
				@ -22,0 +27,4 @@

				      if (activeJob && activeJob.status === 'pause-requested') {

				        return false;

				      }

				      // Allow cancellation for specified task statuses

Sybren A. Stüvel commented

2024-06-17 15:28:20 +02:00

Make comments proper sentences, so end with a period.

David-Zhang-10 marked this conversation as resolved

David Zhang added 1 commit 2024-06-18 19:36:19 +02:00

Manager: keep the newline between built-in packages and 3rd party packages 5a4e225a6b

David Zhang added 1 commit 2024-06-18 22:13:16 +02:00

Manager: reorder updateJobOnTaskStatusCanceled logic 4b4511f9b6

David Zhang added 1 commit 2024-06-18 22:16:38 +02:00

Webapp: end comment with a period c13cc884ce

David Zhang added 1 commit 2024-06-18 22:27:34 +02:00

Manager: pauseTasks keeps cancelled tasks as cancelled a569a1ce00

David Zhang added 1 commit 2024-06-18 22:31:16 +02:00

Webapp: update canCancel to properly expose bugs 05a1388106

David Zhang added 1 commit 2024-06-18 22:48:06 +02:00

Manager: correct logger message and reorder logic 01754e9e3c

Sybren A. Stüvel requested changes 2024-06-19 10:44:26 +02:00

Dismissed

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -207,0 +223,4 @@

					if job.Status != api.JobStatusPauseRequested {

						return sm.jobStatusIfAThenB(ctx, logger, job, api.JobStatusQueued, api.JobStatusActive,

							"task failed, but not enough to fail the job")

					} else {

Sybren A. Stüvel commented

2024-06-19 10:40:48 +02:00

I don't think this is the right structure. The if {} block now contains a copy of the code below it, which is error-prone. There should be one piece of code responsible for one thing. By duplicating this code, there is the chance that one of the copies goes out of sync with the other, introducing hard to reason about bugs.

Also, don't use else after a return, it doesn't mean anything.

Better to just use:

if job.Status == api.JobStatusPauseRequested {
		numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive)
		if err != nil {
			return err
		}
		if numActive == 0 {
			// There is no active task, and the job is in pause-requested status, so we can pause the job.
			failLogger.Info().Msg("No more active tasks, job is paused")
			return sm.JobStatusChange(ctx, job, api.JobStatusPaused, "all tasks completed")
		}
}

It'll also make it easier to move the code into a function of its own. It's the same "single responsibility principle" as before: there should be one piece of code that checks whether pausing a job is complete.

I don't think this is the right structure. The `if {}` block now contains a copy of the code below it, which is error-prone. There should be one piece of code responsible for one thing. By duplicating this code, there is the chance that one of the copies goes out of sync with the other, introducing hard to reason about bugs. Also, don't use `else` after a `return`, it doesn't mean anything. Better to just use: ```go if job.Status == api.JobStatusPauseRequested { numActive, _, err := sm.persist.CountTasksOfJobInStatus(ctx, job, api.TaskStatusActive) if err != nil { return err } if numActive == 0 { // There is no active task, and the job is in pause-requested status, so we can pause the job. failLogger.Info().Msg("No more active tasks, job is paused") return sm.JobStatusChange(ctx, job, api.JobStatusPaused, "all tasks completed") } } ``` It'll also make it easier to move the code into a function of its own. It's the same "single responsibility principle" as before: there should be one piece of code that checks whether pausing a job is complete.

David-Zhang-10 marked this conversation as resolved

David Zhang added 1 commit 2024-06-19 17:21:24 +02:00

Manager: reorder logic d510eb390c

David Zhang added 1 commit 2024-06-20 21:08:27 +02:00

Manager: wrap job pause condition inside a function 8f067da901

David Zhang added 1 commit 2024-06-20 21:13:50 +02:00

Manager: all original test cases passed 009ad23da1

David Zhang added 1 commit 2024-06-20 21:56:28 +02:00

Manager: add more test cases for paused status 484f746f92

David Zhang requested review from Sybren A. Stüvel 2024-06-20 21:59:02 +02:00

David Zhang referenced this pull request

2024-06-22 16:40:45 +02:00

WIP: Allow jobs to be submitted in `paused` status #104318

Sybren A. Stüvel requested changes 2024-06-26 12:05:22 +02:00

Dismissed

Sybren A. Stüvel left a comment

Looking much better now :)

Just a few small nitpicks, I've wrote some inline notes.

Also I think there is one unit test missing (of course more could be written, but I think this one is important to cover):

Job is active, one tasks is active, one is complete, the rest are queued.
Pausing of the job is requested.
The active task goes to soft-failed.
This should pause the job.

And, if you feel like it, a similar test, except that the active task goes to failed.

Looking much better now :) Just a few small nitpicks, I've wrote some inline notes. Also I think there is one unit test missing (of course more could be written, but I think this one is important to cover): - Job is `active`, one tasks is `active`, one is `complete`, the rest are queued. - Pausing of the job is requested. - The `active` task goes to `soft-failed`. - This should pause the job. And, if you feel like it, a similar test, except that the `active` task goes to `failed`.

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -166,6 +166,21 @@ func (sm *StateMachine) jobStatusIfAThenB(

					return sm.JobStatusChange(ctx, job, thenStatus, reason)

				}

				func (sm *StateMachine) shouldJobBePaused(ctx context.Context, logger zerolog.Logger, job *persistence.Job) (bool, error) {

Sybren A. Stüvel commented

2024-06-26 11:16:17 +02:00

I think the function name can be improved. The name shouldJobBePaused describes what should be done by the caller. I always try to pick a function name that matches what the function does. So for this one, I feel that isJobPausingComplete would be better.

Also a documentation comment would be good. In Go, these always start with // {function name} .... It should describe that this returns true when the job status is pause-requested and there are no more active tasks.

I think the function name can be improved. The name `shouldJobBePaused` describes what should be done by the caller. I always try to pick a function name that matches what the function does. So for this one, I feel that `isJobPausingComplete` would be better. Also a documentation comment would be good. In Go, these always start with `// {function name} ...`. It should describe that this returns true when the job status is `pause-requested` and there are no more active tasks.

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -167,2 +167,4 @@

				}

				func (sm *StateMachine) shouldJobBePaused(ctx context.Context, logger zerolog.Logger, job *persistence.Job) (bool, error) {

					if job.Status == api.JobStatusPauseRequested {

Sybren A. Stüvel commented

2024-06-26 11:12:17 +02:00

Right now, the structure of the function is such that the entire function body is contained in a conditional. This is not a good idea, as it unnecessarily increases the complexity. Flipping the condition of the first if will un-indent the remaining code:

	if job.Status != api.JobStatusPauseRequested {
		return false, nil
	}

  // rest of the code

Right now, the structure of the function is such that the entire function body is contained in a conditional. This is not a good idea, as it unnecessarily increases the complexity. Flipping the condition of the first `if` will un-indent the remaining code: ```go if job.Status != api.JobStatusPauseRequested { return false, nil } // rest of the code ```

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -169,0 +174,4 @@

						}

						if numActive == 0 {

							// There is no active task, and the job is in pause-requested status, so we can pause the job.

							logger.Info().Msg("No more active tasks, job is paused")

Sybren A. Stüvel commented

2024-06-26 11:13:37 +02:00

I think it would be be better to leave the logging to the caller of this function. Also what is logged here is not entirely true -- the job is still in state pause-requested, not in state paused, so logging that it is paused is incorrect.

By removing the logging, you can reduce the final return to just return numActive == 0, nil.

I think it would be be better to leave the logging to the caller of this function. Also what is logged here is not entirely true -- the job is still in state `pause-requested`, not in state `paused`, so logging that it **is** paused is incorrect. By removing the logging, you can reduce the final `return` to just `return numActive == 0, nil`.

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -441,0 +511,4 @@

					// If pausing was requested, it has now happened, so the job can transition.

					toBePaused, err := sm.shouldJobBePaused(ctx, logger, job)

					if err != nil {

						return "", fmt.Errorf("error when accessing number of active tasks")

Sybren A. Stüvel commented

2024-06-26 11:23:57 +02:00

This hides the underlying error, making things hard to debug. When wrapping an error, always use the %w format specifier to include the underlying error. This will make sure that functions like errors.Is() and errors.As() can do their work. You can find more about this at https://pkg.go.dev/errors

Also, this function shouldn't know what's going on inside of shouldJobBePaused(), so if you want to have "when accessing number of active tasks" in the error message, that should be done inside shouldJobBePaused(), not here.

This hides the underlying error, making things hard to debug. When wrapping an error, always use the `%w` format specifier to include the underlying error. This will make sure that functions like `errors.Is()` and `errors.As()` can do their work. You can find more about this at https://pkg.go.dev/errors Also, this function shouldn't know what's going on inside of `shouldJobBePaused()`, so if you want to have "when accessing number of active tasks" in the error message, that should be done inside `shouldJobBePaused()`, not here.

David-Zhang-10 marked this conversation as resolved

internal/manager/task_state_machine/task_state_machine.go Outdated

						
				@ -441,0 +514,4 @@

						return "", fmt.Errorf("error when accessing number of active tasks")

					}

					if toBePaused {

						logger.Info().Msg("all tasks of job paused, job can go to 'paused' status")

Sybren A. Stüvel commented

2024-06-26 11:29:01 +02:00

Perfect example of why shouldJobBePaused() should not be logging anything. It's better to leave that to places like this, where you know the context in which that function is called.

Perfect example of why `shouldJobBePaused()` should not be logging anything. It's better to leave that to places like this, where you know the context in which that function is called.

David-Zhang-10 marked this conversation as resolved

David Zhang added 2 commits 2024-07-01 04:18:18 +02:00

Merge remote-tracking branch 'official/main' into paused-job-status 8e1f481bf1

# Conflicts:
#	go.mod

Manager: function rename adac7bbb37

David Zhang added 1 commit 2024-07-01 04:21:56 +02:00

Manager: flip condition 95f3472e6d

David Zhang added 1 commit 2024-07-01 04:31:22 +02:00

Manager: remove logging in isJobPausingComplete 4b20f8fad8

Sybren A. Stüvel approved these changes 2024-07-01 12:03:39 +02:00

Sybren A. Stüvel left a comment

Looks good to me!

In our meeting this afternoon, let's go over the steps to get this landed on the main branch together.

Make sure you've pull'ed the latest main
Rebase of your PR branch on top of main.
Use git rebase --interactive to squash all commits together.
Use git gui to split out that one squashed commits into a few separate ones (and get rid of the go.sum changes, that file shouldn't be touched in this PR):
1. API changes: flamenco-openapi.yaml
2. Generated code
3. The rest of the changes.
Run make test again
Use gitk to be sure of the final changes one last time.
git push

Looks good to me! In our meeting this afternoon, let's go over the steps to get this landed on the `main` branch together. 1. Make sure you've `pull`'ed the latest `main` 2. Rebase of your PR branch on top of `main`. 3. Use `git rebase --interactive` to squash all commits together. 4. Use `git gui` to split out that one squashed commits into a few separate ones (and get rid of the `go.sum` changes, that file shouldn't be touched in this PR): 1. API changes: `flamenco-openapi.yaml` 2. Generated code 3. The rest of the changes. 5. Run `make test` again 6. Use `gitk` to be sure of the final changes one last time. 7. `git push`

David Zhang referenced this issue from a commit

2024-07-01 17:00:11 +02:00

OAPI: add pause-requested job status

David Zhang referenced this issue from a commit

2024-07-01 17:00:11 +02:00

Manager: Support pausing jobs

Sybren A. Stüvel manually merged commit aac55e7e3c into main

2024-07-01 17:53:44 +02:00

Sybren A. Stüvel referenced this pull request

2024-07-01 17:54:28 +02:00

Support pausing jobs #99399

David Zhang referenced this pull request