Buildbot: when triggered from Gitea, sometimes builds are missing #57
Labels
No Label
Service
Buildbot
Service
Chat
Service
Gitea
Service
Translate
Type
Bug
Type
Config
Type
Deployment
Type
Feature
Type
Setup
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: infrastructure/blender-projects-platform#57
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Sometimes when I trigger the buildbot with
@blender-bot build
or@blender-bot package
on a PR, it seems to skip some platforms.Here's a screenshot of the situation:
As you can see, only one builder (
vexp-code-patch-darwin-x86_64
) actually performed a build, and 3 more are "pending".However, there are no further builders listed. Since 100% of the listed builders actually succeeded, the entire build is marked as succesful. It is this 'success' status that is communicated back to Gitea.
This issue happened to me twice before. In all those cases it were the Windows and Linux builders that were missing. I think once or twice it also missed one of the macOS builders, but I'm not 100% sure about that.
When triggering a rebuild from the buildbot web interface, two entries hang (for at least 5 minutes) on "loading buildrequest details...". Refreshing the page doesn't help.
Not sure what it means, maybe it helps with finding the root cause.
Links:
This might be a bug in buildbot that would be fixed by upgrading to the latest version.
https://github.com/buildbot/buildbot/pull/6152
@Arnd I guess we should upgrade buildbot at some point regardless.
For reference, created this report some weeks ago, which seems to be same issue too: https://gitlab.com/blender/bdr-devops-core/-/issues/1
For completeness-sake, the builds do not seem to only be missing when triggered from Gitea; but perhaps these were the most visible to users.
This issue is likely to have been addressed with upgrade of buildbot from 3.2.0 to 3.3.0.
Preliminary tests do not have the issue show up so far.
Closing ticket. Please re-open if happening again.
It appears this is still happening after the upgrade.
The latest buildbot version is 3.8.0 but we only upgraded to 3.3.0. I think we should upgrade to the latest.
As far as I can find, the last version in the 3.* series is 3.6.0
I have updated the UATEST cluster to 3.6.0 on the master. The clients should, according to documentation, stay compatible; but will upgrade those at some point too.
So far no weird things.
I'm planning to upgrade PROD to 3.6.0 on Jul 26 (tomorrow) unless I find something weird (will report that here then)
The latest is 3.8.0?
https://github.com/buildbot/buildbot/releases
https://pypi.org/project/buildbot/#history
Totally weird. Their website release-notes history only goes to 3.6.0
http://docs.buildbot.net/current/relnotes/index.html
Will investigate if there's anything big going on in the releases after that and roll 'm out on uatest asap so I can hopefully still roll out a 3.8.0 on prod, tomorrow.
Upgraded buildbot-master, worker and www + components to 3.8.0 on UATEST and PROD.
Initial results seem to indicate that we're likely still experiencing the same issue.
I did a little sleuthing to see if anything obvious could be found.
Using the api/JSON that the web-interface uses, and sqlite on the database, the following seems to happen:
...but only a few actually were properly claimed and finalized.
Not all build-requests/buildrequest_claims have a build associated with it; reason is not clear but possibly race-condition somewhere or improper locking in general.
This result was seen when the workers were still running 3.2.0, however.. Upgraded them to 3.8.0 just now to see if that'd fix the issue.
Issue still present, sadly.
In preparation for submitting issue upstream, i'm working on migrating buildbot away from sqlite first (as stated in deployment guide for non-small deployments).
The Sqlite->Postgres migration has been performed on both UATEST and PROD. Tidying up leftovers now (bdr-devops-core configs, etc)
The next step is for the problem to re-occur with Postgres in place and create a report out of the occurrance to send upstream.
If using postgres fixed the issue; even better.
to be continued.
If this build's anything to go by; it'd seem we're still seeing the same issue. This was a re-trigger of a previous build that was cancelled before the upgrade but apart from that it shouldn't make a difference.
https://builder.blender.org/admin/#/builders/36/builds/11800
I've made an issue in the github issue-tracker for buildbot:
https://github.com/buildbot/buildbot/issues/7091
Currently trying to resolve this issue. It was mentioned in the GitHub issues by one of the maintainers that it might have to do with collapsing build/requests.
After trying to disable collapsing results globally in the configuration of Buildbot and still seeing these pending builds disappear, I want to see if updating our Buildbot to the latest version will fix the issue at hand.
When trying to upgrade Buildbot to 4.0.1, I get the following error:
I've upgraded Buildbot to 3.11.6 on UATEST for temporary testing, needed the following changes:
~/.devops/services/buildbot-master/Pipfile
~/git/bdr-devops-core/buildbot/pipeline/__init__.py
cd /home/blender/git/bdr-devops-core && /snap/bin/pwsh -c "./cmd/buildbot/buildbot-master.ps1 -steps db-upgrade -serviceEnvId UATEST -serviceHostId pvep-lvm-buildbot-master-01"
@bartvdbraak The
pollinterval
was a deprecated field, which was removed for 4.0:9c459aad37
The proper spelling is
pollInterval
, so should be easy to update our code: https://docs.buildbot.net/latest/manual/configuration/changesources.html#gitpollerI was already aware of this change:
d0717598c8
@bartvdbraak Ah, great! I didn't see that commit at the time I saw the comment here about issues with upgrade to 4.0.1.
After upgrading Buildbot to version
3.11.6
and adding thec["collapseRequests"] = False
configuration, this issue no longer appears. I will close this issue for now, but I will reopen it if it happens again.