During this weekend's maintenance window, we conducted a major update of our storage layer to enable continued scale and resiliency. The major update was completed over the weekend without a problem.
Yesterday, the team worked through a set of follow-up tasks to finalize the update. During this process, a configuration error was applied that led to a temporary outage in the new storage layer. The error was identified immediately, and steps were taken to restore service. The total outage was approximately 30 minutes.
During the outage, the Dagster+ web interface was not accessible. Run instigators such as schedules and sensors were not run, and existing runs failed. When service was restored, run instigators were automatically restored and typically caught up automatically. Runs with retry policies were also evaluated and executed.
Going forward, we are taking a number of steps to improve and prevent future outages:
The storage update that preceded this event, now complete, will enable higher availability of the storage layer.
Additional steps are being put in place to make storage failover faster, and we are implementing changes to the service to allow existing runs to be resilient to short disruptions.
Additional guardrails are being put in place to prevent configuration errors from being applied.
We appreciate that downtime has a significant impact for our customers. Please reach out to your Dagster Labs account team if you have further questions or concerns.