Buildbot: when triggered from Gitea, sometimes builds are missing #57

Open
opened 2023-03-20 14:22:47 +01:00 by Sybren A. Stüvel · 13 comments

Sometimes when I trigger the buildbot with @blender-bot build or @blender-bot package on a PR, it seems to skip some platforms.

Here's a screenshot of the situation:

image

As you can see, only one builder (vexp-code-patch-darwin-x86_64) actually performed a build, and 3 more are "pending".

However, there are no further builders listed. Since 100% of the listed builders actually succeeded, the entire build is marked as succesful. It is this 'success' status that is communicated back to Gitea.

This issue happened to me twice before. In all those cases it were the Windows and Linux builders that were missing. I think once or twice it also missed one of the macOS builders, but I'm not 100% sure about that.

When triggering a rebuild from the buildbot web interface, two entries hang (for at least 5 minutes) on "loading buildrequest details...". Refreshing the page doesn't help.

image

Not sure what it means, maybe it helps with finding the root cause.

Links:

Sometimes when I trigger the buildbot with `@blender-bot build` or `@blender-bot package` on a PR, it seems to skip some platforms. Here's a screenshot of the situation: ![image](/attachments/5ec0364a-87d2-4f99-9b0f-e4eb1369fc3b) As you can see, only one builder (`vexp-code-patch-darwin-x86_64`) actually performed a build, and 3 more are "pending". However, there are no further builders listed. Since 100% of the listed builders actually succeeded, the entire build is marked as succesful. It is this 'success' status that is communicated back to Gitea. This issue happened to me twice before. In all those cases it were the Windows and Linux builders that were missing. I think once or twice it also missed one of the macOS builders, but I'm not 100% sure about that. When triggering a rebuild from the buildbot web interface, two entries hang (for at least 5 minutes) on "loading buildrequest details...". Refreshing the page doesn't help. ![image](/attachments/b2f1822a-20e7-4c82-a24d-f130fae7a9b0) Not sure what it means, maybe it helps with finding the root cause. Links: - [this build](https://builder.blender.org/admin/#/builders/136/builds/794) - triggered from https://projects.blender.org/blender/blender/pulls/105604#issuecomment-904231
Brecht Van Lommel added the
buildbot
label 2023-03-20 14:34:47 +01:00

This might be a bug in buildbot that would be fixed by upgrading to the latest version.
https://github.com/buildbot/buildbot/pull/6152

@Arnd I guess we should upgrade buildbot at some point regardless.

This might be a bug in buildbot that would be fixed by upgrading to the latest version. https://github.com/buildbot/buildbot/pull/6152 @Arnd I guess we should upgrade buildbot at some point regardless.
Brecht Van Lommel added the
deployment
label 2023-03-24 13:44:07 +01:00

For reference, created this report some weeks ago, which seems to be same issue too: https://gitlab.com/blender/bdr-devops-core/-/issues/1

For reference, created this report some weeks ago, which seems to be same issue too: https://gitlab.com/blender/bdr-devops-core/-/issues/1

For completeness-sake, the builds do not seem to only be missing when triggered from Gitea; but perhaps these were the most visible to users.

This issue is likely to have been addressed with upgrade of buildbot from 3.2.0 to 3.3.0.
Preliminary tests do not have the issue show up so far.
Closing ticket. Please re-open if happening again.

For completeness-sake, the builds do not seem to only be missing when triggered from Gitea; but perhaps these were the most visible to users. This issue is likely to have been addressed with upgrade of buildbot from 3.2.0 to 3.3.0. Preliminary tests do not have the issue show up so far. Closing ticket. Please re-open if happening again.

It appears this is still happening after the upgrade.

It appears this is still happening after the upgrade.

The latest buildbot version is 3.8.0 but we only upgraded to 3.3.0. I think we should upgrade to the latest.

The latest buildbot version is 3.8.0 but we only upgraded to 3.3.0. I think we should upgrade to the latest.

As far as I can find, the last version in the 3.* series is 3.6.0
I have updated the UATEST cluster to 3.6.0 on the master. The clients should, according to documentation, stay compatible; but will upgrade those at some point too.

So far no weird things.
I'm planning to upgrade PROD to 3.6.0 on Jul 26 (tomorrow) unless I find something weird (will report that here then)

As far as I can find, the last version in the 3.* series is 3.6.0 I have updated the UATEST cluster to 3.6.0 on the master. The clients should, according to documentation, stay compatible; but will upgrade those at some point too. So far no weird things. I'm planning to upgrade PROD to 3.6.0 on Jul 26 (tomorrow) unless I find something weird (will report that here then)
The latest is 3.8.0? https://github.com/buildbot/buildbot/releases https://pypi.org/project/buildbot/#history

Totally weird. Their website release-notes history only goes to 3.6.0
http://docs.buildbot.net/current/relnotes/index.html

Will investigate if there's anything big going on in the releases after that and roll 'm out on uatest asap so I can hopefully still roll out a 3.8.0 on prod, tomorrow.

Totally weird. Their website release-notes history only goes to 3.6.0 http://docs.buildbot.net/current/relnotes/index.html Will investigate if there's anything big going on in the releases after that and roll 'm out on uatest asap so I can hopefully still roll out a 3.8.0 on prod, tomorrow.

Upgraded buildbot-master, worker and www + components to 3.8.0 on UATEST and PROD.
Initial results seem to indicate that we're likely still experiencing the same issue.

I did a little sleuthing to see if anything obvious could be found.
Using the api/JSON that the web-interface uses, and sqlite on the database, the following seems to happen:

  • Build-request for buildset comes in
  • creates subbuildrequests in DB
  • Creates buildrequest_claims for each of them
    ...but only a few actually were properly claimed and finalized.
    Not all build-requests/buildrequest_claims have a build associated with it; reason is not clear but possibly race-condition somewhere or improper locking in general.

This result was seen when the workers were still running 3.2.0, however.. Upgraded them to 3.8.0 just now to see if that'd fix the issue.

Upgraded buildbot-master, worker and www + components to 3.8.0 on UATEST and PROD. Initial results seem to indicate that we're likely still experiencing the same issue. I did a little sleuthing to see if anything obvious could be found. Using the api/JSON that the web-interface uses, and sqlite on the database, the following seems to happen: * Build-request for buildset comes in * creates subbuildrequests in DB * Creates buildrequest_claims for *each* of them ...but only a few actually were properly claimed and finalized. Not all build-requests/buildrequest_claims have a build associated with it; reason is not clear but possibly race-condition somewhere or improper locking in general. This result was seen when the workers were still running 3.2.0, however.. Upgraded them to 3.8.0 just now to see if that'd fix the issue.

Issue still present, sadly.
In preparation for submitting issue upstream, i'm working on migrating buildbot away from sqlite first (as stated in deployment guide for non-small deployments).

Issue still present, sadly. In preparation for submitting issue upstream, i'm working on migrating buildbot away from sqlite first (as stated in deployment guide for non-small deployments).

The Sqlite->Postgres migration has been performed on both UATEST and PROD. Tidying up leftovers now (bdr-devops-core configs, etc)
The next step is for the problem to re-occur with Postgres in place and create a report out of the occurrance to send upstream.
If using postgres fixed the issue; even better.
to be continued.

The Sqlite->Postgres migration has been performed on both UATEST and PROD. Tidying up leftovers now (bdr-devops-core configs, etc) The next step is for the problem to re-occur with Postgres in place and create a report out of the occurrance to send upstream. If using postgres fixed the issue; even better. to be continued.

If this build's anything to go by; it'd seem we're still seeing the same issue. This was a re-trigger of a previous build that was cancelled before the upgrade but apart from that it shouldn't make a difference.

https://builder.blender.org/admin/#/builders/36/builds/11800

If this build's anything to go by; it'd seem we're still seeing the same issue. This was a re-trigger of a previous build that was cancelled before the upgrade but apart from that it shouldn't make a difference. https://builder.blender.org/admin/#/builders/36/builds/11800

I've made an issue in the github issue-tracker for buildbot:
https://github.com/buildbot/buildbot/issues/7091

I've made an issue in the github issue-tracker for buildbot: https://github.com/buildbot/buildbot/issues/7091
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: infrastructure/blender-projects-platform#57
No description provided.