All timestamps are in UTC
2023-07-31 11:53, first support case about the merge-queue unexpectedly dequeued a pull request with the message: Base does not exist
. We started the investigation.
2023-07-31 14:53, We opened an internal incident as our monitoring alerted us about an increasing number of unexpected GitHub API status codes while Mergify created or deleted draft pull requests.
2023-07-31 15:12, We understood that the Git branches we create and the changes we make on them, with the GitHub Git Database API, are not instantly visible by GitHub Repository and Pulls API. API call of Git manipulation succeeds, but when you get the Git resources you just created, GitHub returns that they do not exist.
The issue was causing unexpected failures in many different code paths. That could result for customers into two visible issues:
pull requests wrongly dequeued with one of these error messages:
No commits between XXXX and YYYY
Base does not exist
merge queue stuck at step: This queue is waiting for a batch to fill up.
We decided to implement in different code paths a retry mechanism when this issue occurred.
2023-07-31 14:50, Our first change to mitigate the issue lands in production and continue the monitoring closely
2023-07-31 14:53, We enabled some full HTTP request/response logging to gather material for GitHub support.
2023-07-31 15:12, We decide to make the incident public
2023-07-31 15:34, We deploy a second code change to improve the mitigation
2023-07-31 15:49, We escalated the issue to GitHub support as we have enough materials to show the API breakage.
2023-07-31 16:25, We extracted stats about the number of customers and pull requests impacted. We found that GitHub API started to report as non-existing existing Git resources on 2023-07-27 at 14:14:10 UTC for some accounts. We discovered later it was the date of the previous GitHub Pull Request API incident https://www.githubstatus.com/incidents/l59z35rhzdky.
2023-07-31 17:34, A third change is deployed to readjust the retrying strategy. Mergify was always able to succeed in detecting and retrying when the issue occurred.
2023-08-01 07:15, A new change is deployed to cover a new code path where the issue occurs.
2023-08-01 09:51, GitHub support answered our support ticket and acknowledged the GitHub API behavior changed and escalated to the engineering team
2023-08-01 16:53, GitHub fixed the issue; we asked for more details and why the GitHub status page didn’t get updated
2023-08-02 09:36, GitHub communicates more details about the API behavior change issue:
A feature flag related to spoke caching was turn on earlier that causes replication lag. Following reports of 404 errors occurring for newly created refs, the change was reversed.
023-08-02 10:53, GitHub confirms this incident will be part of their next availability report
Thanks for the feedback --I'd pass those on to the relevant team. Hopefully it gets published in the monthly published availability report.