Scheduler and sensor disruptions

Incident Report for Dagster Cloud

Resolved

We upgraded our agent request queue backend at 2022-11-02 19:04 UTC.

During upgrades, Dagster Cloud's backend services read messages from the old queue and write messages onto the new queue. Once the old queue is empty, Dagster Cloud's services begin reading messages only from the new queue.

When rolling the new queue configuration out to Dagster Cloud's backend services, our scheduling services did not recognize the new configuration. These services continued to look only at the old queue.

This means new agent requests were being written to the new queue, but our scheduler services - which control things like executing schedules, sensors, and launching runs - continued to only read from the old queue.

Customers may need to manually retry runs that launched but were marked as failed during the outage window because Dagster Cloud never acknowledged the launch event.

Schedules and sensors will resume automatically.

To mitigate this in the future, when performing similar operations, we plan to use immutable configurations that will preclude services from continuing to run with an old configuration.

Posted Nov 02, 2022 - 20:34 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 02, 2022 - 20:29 UTC

Update

We are continuing to work on a fix for this issue.

Posted Nov 02, 2022 - 19:44 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 02, 2022 - 19:04 UTC

This incident affected: API.