Dagster Serverless places containers across three different availability zones for redundancy. Typically, this hedges us against issues like capacity availability constraints.
At 10:00 AM PST, AWS began experiencing elevated latencies for a networking subsystem in two of these availability zones. You can find more details at health.aws.amazon.com.
This caused about 30% of new Serverless runs to fail container startup between 10:00 AM 12:00 PM.
Initially, we planned to fail over all new Serverless runs to our one remaining availability zone. However, separately, AWS informed us of increased error rates to our container registry.
At 12:32 PM we began manually failing over organizations one at a time to our remaining availability zone so we could closely monitor for container registry errors before failing over more aggressively for all organizations.
At 5:22 PM, AWS reported that they had mitigated the issue and were monitoring for recovery. At 5:56 PM, they acknowledged that their mitigation had not worked and they were exploring new fixes.
At 6:40 PM, we decided we were comfortable with our container registry's performance during our partial rollout and began a full rollout of our mitigation to all organizations. By 8:00 PM, Serverless run launching error rates had stabilized and we continued to monitor.
At 10:33 PM, AWS resolved the networking issues across all availability zones.