08:48: We receive alerts that an organization has an unusual latency.
08:48: We start investigating the issue.
09:06: We receive new alerts about 5 more organizations having an unusual latency.
09:35: We open a public incident to inform users about the unusual latency.
10:18: We find out that a change merged at 08:03 AM introduces a regression, making Mergify re-evaluate all opened pull requests on each push event instead of just recomputing the conflict status.
10:24: The change is reverted. We are now waiting for Mergify to finish evaluating all wrongly categorized pull request changes.
10:46: GitHub declares a pull request, webhook, and API incident https://www.githubstatus.com/incidents/7txcg0j03kkg. Mergify can’t evaluate pull requests anymore.
10:46: We start getting an unusually high level of retries. We link this to the GitHub incident.
15:50: The GitHub incident informs us that the API has almost recovered. We still see a lot of errors, and the number of retries doesn’t decrease as expected. We continue to investigate why the number of retries does not match the decrease in GitHub API errors.
15:50: The engine’s latency is not recovering as fast as we expected.
16:46: We discover the retry mechanism has categorized some errors as “organization-wide issues” (e.g.: API rate limit) instead of “pull request issues” (GitHub API issue, Pull request incoherent state) making the evaluations of pull requests pause for a while.
16:46: We start the manual reset of pending retry attempts for all organizations.
17:09: We find out why we are wrongly categorizing some pull request-related issues as organization-wide issues.
17:50: We fix the retry mechanism and reset the retry attempts for all organizations.
19:25: The latency of all users reaches the expected usual values.
This incident unveiled two issues:
mergeable
/mergeable_state
/rebaseable
attributes) in the background and doesn’t send an event when it’s finished, forcing Mergify to poll GitHub. Most of the time, applications like ours do not need to poll as the background task has already finished. GitHub recomputes these states on all opened pull requests on each push. The time it takes for GitHub to recompute these states is bound to the number of opened pull requests. If we go back to our new cache mechanism, it invalidates the cache when receiving push events and leverages the retry mechanism to poll all opened pull requests. Retry mechanism which was, in some cases, pausing the whole organization instead of just the pull request. The cache code that leverages the retry mechanism was introduced on 2024/03/12, making the issue occurring a lot in organizations with a lot of pull requests opened, even if it was an underlying silent two-years bug.