Dagster Cloud Service

Incident Report for Dagster Cloud

Postmortem

During this weekend's maintenance window, we conducted a major update of our storage layer to enable continued scale and resiliency. The major update was completed over the weekend without a problem.

Yesterday, the team worked through a set of follow-up tasks to finalize the update. During this process, a configuration error was applied that led to a temporary outage in the new storage layer. The error was identified immediately, and steps were taken to restore service. The total outage was approximately 30 minutes.

During the outage, the Dagster+ web interface was not accessible. Run instigators such as schedules and sensors were not run, and existing runs failed. When service was restored, run instigators were automatically restored and typically caught up automatically. Runs with retry policies were also evaluated and executed.

Going forward, we are taking a number of steps to improve and prevent future outages:

The storage update that preceded this event, now complete, will enable higher availability of the storage layer.
Additional steps are being put in place to make storage failover faster, and we are implementing changes to the service to allow existing runs to be resilient to short disruptions.
Additional guardrails are being put in place to prevent configuration errors from being applied.

We appreciate that downtime has a significant impact for our customers. Please reach out to your Dagster Labs account team if you have further questions or concerns.

Posted Oct 01, 2024 - 20:05 UTC

Resolved

This incident has been resolved. Some runs may have been delayed or failed during the outage. We do not expect any further interruption of service.

Posted Oct 01, 2024 - 00:25 UTC

Monitoring

The fix has been implemented and we are monitoring to confirm that service has been restored.

Posted Oct 01, 2024 - 00:00 UTC

Identified

We are in the process of rolling out a change that is intended to restore access. We will continue to post updates here as we work to restore service.

Posted Sep 30, 2024 - 23:45 UTC

Update

We are in the process of rolling out a change that is intended to restore access. We will continue to post updates here as we work to restore service.

Posted Sep 30, 2024 - 23:44 UTC

Investigating

We are currently experiencing an issue with Dagster Cloud's storage engine that is preventing access to the service. We are immediately investigating the issue and will post continued updates here.

Posted Sep 30, 2024 - 23:36 UTC

This incident affected: API and Dagster Cloud UI.