Degraded availability for webhooks.mechanic.dev; degraded platform performance

Incident Report for Mechanic

Resolved

Performance and throughput are back to normal. AWS has closed their incident, with the following summary:

> Jun 13 10:42 PM UTC Between 11:49 AM PDT and 3:37 PM PDT, we experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use of other AWS services. Additionally, customers may have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also have experienced issues when attempting to initiate a Call or Chat to AWS Support. As of 2:47 PM, the issue initiating calls and chats to AWS Support was resolved. By 1:41 PM, the underlying issue with the subsystem responsible for AWS Lambda was resolved. At that time, we began processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services. As of 3:37 PM, the backlog was fully processed. The issue has been resolved and all AWS Services are operating normally.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 23:42 UTC

Update

Mechanic's availability and responsiveness is normal. We're still seeing some light irregularity in event and run volume, which may be related to AWS's backlog of Lambda invocations. We're going to keep this incident report open until both (1) AWS closes their incident, and (2) event and run volume return to normal.

From https://health.aws.amazon.com/health/status :

> Jun 13 9:49 PM UTC We are working to accelerate the rate at which Lambda asynchronous invocations are processed, and now estimate that the queue will be fully processed over the next hour. We expect that all queued invocations will be executed.

> Jun 13 9:29 PM UTC Lambda synchronous invocation APIs have recovered. We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). Lambda is working to process these messages during the next few hours and during this time, we expect to see continued delays in the execution of asynchronous invocations.

Posted Jun 13, 2023 - 22:25 UTC

Monitoring

Mechanic is seeing a return to normal throughput rates, and availability of webhooks.mechanic.dev is normal.

Latest update from AWS:

> Jun 13 9:00 PM UTC Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 21:12 UTC

Update

> Jun 13 8:48 PM UTC Beginning at 6:49 PM UTC, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. [...] We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 20:51 UTC

Update

From AWS:

> Jun 13 8:38 PM UTC We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 20:42 UTC

Update

Latest update from AWS:

> Jun 13 7:36 PM UTC We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 20:10 UTC

Update

AWS has posted the following update:

> Jun 13 7:26 PM UTC We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 19:31 UTC

Update

Mechanic's run throughput is affected, resulting in longer waits between event ingress and task/action run performance.

Meanwhile, AWS has shared the following update at https://health.aws.amazon.com/health/status:

> Jun 13 7:19 PM UTC AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue.

Posted Jun 13, 2023 - 19:23 UTC

Identified

AWS (an upstream provider of Mechanic's) confirms an issue with the affected service (AWS Lambda):

> Jun 13 7:08 PM UTC We are investigating increased error rates and latencies in the US-EAST-1 Region.

https://health.aws.amazon.com/health/status

Posted Jun 13, 2023 - 19:11 UTC

Investigating

We are investigating an issue with degraded performance for webhooks.mechanic.dev. We are working on the issue now. This appears to be an issue with our cloud provider.

Posted Jun 13, 2023 - 19:09 UTC

This incident affected: Mechanic.