Engine latency

Incident Report for Mergify Status Page

Postmortem

Timeline (UTC)

08:48: We receive alerts that an organization has an unusual latency.

08:48: We start investigating the issue.

09:06: We receive new alerts about 5 more organizations having an unusual latency.

09:35: We open a public incident to inform users about the unusual latency.

10:18: We find out that a change merged at 08:03 AM introduces a regression, making Mergify re-evaluate all opened pull requests on each push event instead of just recomputing the conflict status.

10:24: The change is reverted. We are now waiting for Mergify to finish evaluating all wrongly categorized pull request changes.

10:46: GitHub declares a pull request, webhook, and API incident https://www.githubstatus.com/incidents/7txcg0j03kkg. Mergify can’t evaluate pull requests anymore.

10:46: We start getting an unusually high level of retries. We link this to the GitHub incident.

15:50: The GitHub incident informs us that the API has almost recovered. We still see a lot of errors, and the number of retries doesn’t decrease as expected. We continue to investigate why the number of retries does not match the decrease in GitHub API errors.

15:50: The engine’s latency is not recovering as fast as we expected.

16:46: We discover the retry mechanism has categorized some errors as “organization-wide issues” (e.g.: API rate limit) instead of “pull request issues” (GitHub API issue, Pull request incoherent state) making the evaluations of pull requests pause for a while.

16:46: We start the manual reset of pending retry attempts for all organizations.

17:09: We find out why we are wrongly categorizing some pull request-related issues as organization-wide issues.

17:50: We fix the retry mechanism and reset the retry attempts for all organizations.

19:25: The latency of all users reaches the expected usual values.

The root cause

This incident unveiled two issues:

The obvious one: due to a regression introduced the same morning, it was wrongly set to route the events we received from GitHub. Mergify uses push events to recompute the conflict state of all opened pull requests; instead, the regression was reevaluating all pull requests on each push. This put a massive load on our engine, significantly increasing the evaluation latency.
The long-standing one: discovered by a combination of our incident and GitHub’s incident, put a lot of pressure on our retry mechanism. The retry issue was underlying for a very long time. It is not the first time GitHub has had an incident, but as a consequence of our first issue, we decided to look further into why it was triggered way more than before. Mergify has reworked its caching mechanism over the past months to absorb more load; this new cache mechanism is almost enabled for all customers. The cache reduced a lot the latency during pull request evaluation as Mergify reduced the number of GitHub API calls… except for a couple of customers. Until today, we were still trying to understand why the new cache didn’t work as expected for these specific customers. Most people are not aware of GitHub internal mechanics. So, GitHub computes some pull request states (mergeable/mergeable_state/rebaseable attributes) in the background and doesn’t send an event when it’s finished, forcing Mergify to poll GitHub. Most of the time, applications like ours do not need to poll as the background task has already finished. GitHub recomputes these states on all opened pull requests on each push. The time it takes for GitHub to recompute these states is bound to the number of opened pull requests. If we go back to our new cache mechanism, it invalidates the cache when receiving push events and leverages the retry mechanism to poll all opened pull requests. Retry mechanism which was, in some cases, pausing the whole organization instead of just the pull request. The cache code that leverages the retry mechanism was introduced on 2024/03/12, making the issue occurring a lot in organizations with a lot of pull requests opened, even if it was an underlying silent two-years bug.

Posted Apr 25, 2024 - 09:51 UTC

Resolved

The issue causing the latency has been mitigated.

Posted Apr 24, 2024 - 16:02 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 24, 2024 - 10:32 UTC

Investigating

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.

Posted Apr 24, 2024 - 09:42 UTC

This incident affected: Engine.