Serverless runs failing to launch

Incident Report for Dagster Cloud

Postmortem

Dagster Serverless places containers across three different availability zones for redundancy. Typically, this hedges us against issues like capacity availability constraints.

At 10:00 AM PST, AWS began experiencing elevated latencies for a networking subsystem in two of these availability zones. You can find more details at health.aws.amazon.com.

This caused about 30% of new Serverless runs to fail container startup between 10:00 AM 12:00 PM.

Initially, we planned to fail over all new Serverless runs to our one remaining availability zone. However, separately, AWS informed us of increased error rates to our container registry.

At 12:32 PM we began manually failing over organizations one at a time to our remaining availability zone so we could closely monitor for container registry errors before failing over more aggressively for all organizations.

At 5:22 PM, AWS reported that they had mitigated the issue and were monitoring for recovery. At 5:56 PM, they acknowledged that their mitigation had not worked and they were exploring new fixes.

At 6:40 PM, we decided we were comfortable with our container registry's performance during our partial rollout and began a full rollout of our mitigation to all organizations. By 8:00 PM, Serverless run launching error rates had stabilized and we continued to monitor.

At 10:33 PM, AWS resolved the networking issues across all availability zones.

Posted Sep 19, 2023 - 12:34 UTC

Resolved

We've rolled out our fallback change and the underlying network issue also appears to be resolved.

Posted Sep 19, 2023 - 04:03 UTC

Update

AWS has not been able to solve the network issues that they identified in a number of AZs. We are continuing to monitor their progress on the underlying issue.

In the meantime, we have been rolling out a workaround to fall back on to stable AZs. This workaround has been applied to most Dagster Cloud organizations on Serverless that encountered run launch failures today, and we are still monitoring as this rollout completes to the full set of Serverless deployments.

Posted Sep 19, 2023 - 01:40 UTC

Monitoring

We've been rolling out changes across Serverless that have been stabilizing network issues.

Also, our underlying cloud provider (AWS) has been providing updates that expect full resolution in the next hour.

Posted Sep 18, 2023 - 23:53 UTC

Update

We are continuing to experience networking issues with our underlying cloud provider. We have started to rollout a workaround to mitigate these networking issues.

Posted Sep 18, 2023 - 21:44 UTC

Update

We are continuing to experience networking issues with our underlying cloud provider. We have begun implementing a workaround to failover to an availability zone that is not experiencing these issues.

Posted Sep 18, 2023 - 20:06 UTC

Update

We are continuing to monitor the networking issues with our underlying cloud provider which are causing Serverless run launch failures.

Posted Sep 18, 2023 - 19:24 UTC

Update

Our underlying cloud provider is having some networking issues which are likely the root cause of the Serverless run launch failures. We are continuing to investigate these errors.

Posted Sep 18, 2023 - 18:39 UTC

Investigating

We are currently investigating issues with some Serverless runs failing to launch.

Posted Sep 18, 2023 - 18:07 UTC