Platform responsiveness impacted

Incident Report for Mechanic

Resolved

This incident has been resolved. :) Thanks for tuning in. 🌞 🌱 🐉

Posted Oct 30, 2024 - 20:15 UTC

Update

This incident has been resolved. :) Thanks for tuning in. 🌞 🌱 🐉

Posted Oct 30, 2024 - 19:32 UTC

Monitoring

Performance has been back to normal for an hour. I'm moving this incident to the "monitoring" state, and monitor it we shall. <3

Posted Oct 30, 2024 - 18:36 UTC

Update

We're seeing periodic returns to normalcy, though it's not time to call this issue closed yet.

... because y'all this is an incredibly specific networking issue. 😂 Amazing. Both the Lightward/Mechanic team and the Fly.io team are actively riffing on this problem, testing different solutions.

Fly.io says:

> We're indeed seeing some flapping on the upstream link into AWS for the route to the particular subnet your Redis Cloud instance is on, which is something happening within AWS's own network we don't have much insight or control over. We're trying some things on our end to see if we can isolate and work around the flaky route, but if that doesn't work our (more time-consuming) last resort would be to followup further upstream with either Equinix (who provides the link into AWS) or AWS directly.

We'll get through this, and I'm gonna see if we can't get Mechanic running even faster than before this happened. ;)

=Isaac

Posted Oct 30, 2024 - 17:48 UTC

Update

Our service provider has confirmed the incident. Because y'all are often a technical crowd, here's what they said:

> We tracked the slower route to us-east-1 (~5-7 ms) down to one particular route into our upstream network that about half of our hosts in iad use - the other half is configured with a different upstream that's significantly faster (~0.5-0.7ms).

:) Mechanic is an efficient system. Those extra milliseconds count.

Fly is getting this resolved with their upstream networking provider. In the meantime, this information opens the door for a new mitigation strategy, and we're on it. 🤩

Posted Oct 30, 2024 - 00:08 UTC

Update

State of play: our query time for external services is up across multiple vendors, and none of the vendors reflect intrinsic issues on their end. It really looks like an infra thing somewhere in there. We’re actively comparing notes with Fly.io (our primary compute provider). I think (based on recorded facts) and feel (based on vibes) that the issue is in their realm, whether or not it’s “theirs” per se.

We’re digging. It’s gonna emerge or disappear, one of the two. These things always do. :)

=Isaac

Posted Oct 29, 2024 - 15:16 UTC

Update

Narrowing in on it, tightening the scope and optimizing as we go. Actively working on it. :) I anticipate getting this resolved tomorrow (October 28).

Posted Oct 28, 2024 - 02:55 UTC

Update

Still working on this. The adjustment we made helped, but not as much as I expected it to. We're getting in touch with our infrastructure provider - I really want to get that RTT number back under 2 seconds.

If you've got questions, I'll be in Slack! https://slack.mechanic.dev/

=Isaac

Posted Oct 26, 2024 - 20:21 UTC

Identified

Isaac here! We're aware of an issue causing Mechanic runs to be performed a *hair* more slowly than before. Typically, platform RTT (time between an event arriving and a resulting action being dispatched, assuming a ~instantly-performing task run) hovers around 2 seconds. Since October 24 ~8am UTC, platform RTT has been about 10 seconds. (This data is all published in realtime at https://status.mechanic.dev/.)

We've identified the component responsible for the slowdown, and are in the process of adjusting our infrastructure accordingly.

I'm really sorry y'all. We're on it!

Posted Oct 25, 2024 - 22:14 UTC

This incident affected: Mechanic.