Dagster Cloud Status - Incident History

Website IPs Changing

2024-06-05T18:08:53Z

Jun 5, 18:08 UTC
Completed - The scheduled maintenance has been completed.

Jun 5, 17:54 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.

May 29, 22:46 UTC
Scheduled - We’re working on making our Dagster+ web infrastructure faster and more reliable! As part of this work, some of our web infrastructure will be migrated to use AWS CloudFront’s content delivery network. We intend to migrate on June 5, 2024.

This migration will necessitate a change to the IP address range used by some of our web infrastructure. If you’re using any VPN or firewall that has IP-based restrictions then you may be adversely impacted if you do not take action.

What will be impacted:
Anyone using a VPN or firewall that requires explicit allowlisting based on IP address (not just domain name). The following Dagster+ features and services would become inaccessible on June 5 if the new IP address range is not proactively allowed:
- Dagster+ web interface
- dagster-cloud CLI
- Dagster+ GraphQL API

What action should be taken:
The IT administrator, or someone with access to your VPN/firewall rules, will need to proactively allowlist the AWS Cloudfront IP address ranges. To do this, they should visit that page, download CloudFront’s current IP address ranges in the ip-ranges.json file and allow HTTP(S) traffic to those IP address ranges.

What will not be impacted:
- Dagster agents will continue to interact with the same IP address at <org>.agent.dagster.cloud
- Dagster+ Serverless deployments will continue to use the same documented IP addresses.
- Our domain names (dagster.cloud and dagster.plus) will remain unchanged.

Dagster Cloud UI Unavailable

2024-06-03T22:01:01Z

Jun 3, 22:01 UTC
Resolved - This incident has been resolved.

Jun 3, 21:58 UTC
Monitoring - The configuration change has been rolled back and we are monitoring to ensure that access to the Dagster Cloud UI has been restored.

Jun 3, 21:56 UTC
Investigating - A configuration change to our load balancer has resulted in the Dagster Cloud UI being temporarily unavailable. We are in the process of rolling back the configuration change now.

Some Dagster Plus Serverless Runs Failing to Start

2024-05-14T20:16:51Z

May 14, 20:16 UTC
Resolved - The fix for this problem is now live.

May 14, 19:57 UTC
Identified - Our cloud provider has identified the root cause of the problem and is in the process of rolling back the change that caused it. We'll continue to provide updates as we work to resolve the issue.

May 14, 19:34 UTC
Update - We are continuing to investigate the root cause of this issue with our cloud provider. We are rolling out a change to add additional automatic retries to runs that are affected by this issue.

May 14, 18:53 UTC
Investigating - Some Dagster Plus Serverless runs are failing to start with an "Agent detected that run worker failed." error. We are working with our cloud provider to investigate the root cause of the problem as well as adding automatic retries for the underlying error that is causing the runs to fail to start.

Increased Errors While Retrying Failed Runs

2024-05-14T15:54:41Z

May 14, 15:54 UTC
Resolved - This incident has been resolved.

May 14, 15:21 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

May 14, 12:46 UTC
Identified - Some failed runs are displaying an error message when they are retried in the Dagster Plus UI. We believe we have identified the source of the problem and are investigating a solution.

dagster.plus redirect breaks login for some organizations

2024-04-17T13:53:58Z

Apr 17, 13:53 UTC
Resolved - In anticipation of today's Dagster+ launch event, we added a second domain to our app: dagster.plus. Last night, we erroneously started redirecting certain requests from dagster.cloud to dagster.plus.

As a result, organizations who explicitly allow-list dagster.cloud in their corporate firewalls saw redirected requests to dagster.plus fail.

Affected requests included:
- Interactive use of the Dagster web interface
- GraphQL queries authenticated with a user token (for example, many CI/CD jobs)

Agent interactions (including launching runs and storing events) were not redirected and continued to operate as expected.

Organizations without these firewall settings were not affected.

Once detected, we began working on a way to selectively disable redirects for affected organizations. Shortly after, we removed redirects entirely for all organizations.

However, many browsers permanently cache redirects. If you continue to have trouble accessing dagster.cloud please clear your browser cache and try again.

Internally, we're conducting a full postmortem and deciding how to more safely introduce our new domain name in the future.

Apr 17, 13:53 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Apr 17, 13:20 UTC
Identified - In preparation for our Dagster+ launch, we erroneously began redirecting dagster.cloud to dagster.plus. We are rolling back this change now.

If you continue to have trouble logging in, you may need to clear your browser's cache.

Some Dagster Cloud API Requests failing with an Internal Server Error

2024-04-11T21:25:44Z

Apr 11, 21:25 UTC
Resolved - This incident has been resolved.

Apr 11, 17:54 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Apr 11, 17:04 UTC
Update - We are continuing to work on a fix for this issue.

Apr 11, 17:02 UTC
Identified - Starting earlier this week, a small number of runs in Dagster Cloud started failing with an Internal Server Error exception due to a timeout in our storage engine. We've identified the source of the problem and are in the process of rolling out a fix.

Elevated API latency failed runs

2024-04-01T10:20:00Z

Apr 1, 10:20 UTC
Resolved - Increased latency on our storage engine caused a small number of requests to our API to time out at 10:20 UTC and again at 10:47 UTC. This may have resulted in some runs that were running during this time period to fail. The timeouts have now been resolved and we’re in the process of investigating the root cause of the increased latency.

Outage in Dagster Cloud

2024-02-26T18:05:10Z

Feb 26, 18:05 UTC
Resolved - We have confirmed that the incident is resolved.

Feb 26, 17:44 UTC
Monitoring - We experienced an outage while rotating secrets in our production environment. The outage lasted from 17:28 UTC - 17:39 UTC. We are back receiving traffic and are monitoring the recovery

Dagster Cloud Maintenance Outage

2024-02-07T23:43:27Z

Feb 7, 23:43 UTC
Resolved - As a result of our maintenance window, Dagster Cloud was down from 23:36 to 23:39UTC. This is now resolved; all systems should be operating normally.

Scheduled Maintenance

2024-02-07T22:00:57Z

Feb 7, 22:00 UTC
Completed - The scheduled maintenance has been completed.

Feb 7, 21:04 UTC
Update - The prerequisite steps for our database maintenance are running longer than estimated. We still expect a brief (1-2 minute) downtime during which active runs will be paused, agents will pause work, and the Dagster Cloud UI will show a maintenance banner. We now expect this downtime to occur later this afternoon outside of our reserved maintenance window. Dagster Cloud will continue operating as normal until the downtime and will resume normal operations afterward.

Feb 7, 20:00 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.

Feb 5, 23:52 UTC
Scheduled - On Wednesday, February 7th there will be brief downtime for Dagster Cloud during our reserved maintenance window of 12pm - 2pm PST. We expect there to be two separate downtimes in this window, both roughly five minutes long.

Dagster Cloud will be inaccessible during the downtime. Any schedules, sensors, or experimental auto-materialization policies will pause submitting runs. Any in-flight runs will continue, but will not be marked complete until after the downtime. Following the downtime, all schedules, sensors, backfills, and auto-materialization policies will resume automatically.

Scheduled downtimes like this are rare - thank you very much for your patience. Don't hesitate to reach out on Slack or contact your customer success manager if you have any concerns or questions about the downtime.

Duplicated runs during maintenance window

2024-02-07T20:00:00Z

Feb 7, 20:00 UTC
Resolved - From 20:01-20:13 UTC, we saw a small number of duplicate run launches related to our maintenance window. This issue is now resolved.

Outage disrupting runs on Dagster Cloud

2024-02-01T18:56:59Z

Feb 1, 18:56 UTC
Resolved - The underlying incident has been resolved. We're still investigating the breadth of the impact and will message with more information shortly.

Feb 1, 18:56 UTC
Investigating - We experienced a brief outage storing events from 18:41.55 UTC - 18:42.51 UTC. We recovered from the underlying issue but a number of runs may have unexpectedly failed.

Serverless runs failing to launch

2023-09-19T04:03:05Z

Sep 19, 04:03 UTC
Resolved - We've rolled out our fallback change and the underlying network issue also appears to be resolved.

Sep 19, 01:40 UTC
Update - AWS has not been able to solve the network issues that they identified in a number of AZs. We are continuing to monitor their progress on the underlying issue.

In the meantime, we have been rolling out a workaround to fall back on to stable AZs. This workaround has been applied to most Dagster Cloud organizations on Serverless that encountered run launch failures today, and we are still monitoring as this rollout completes to the full set of Serverless deployments.

Sep 18, 23:53 UTC
Monitoring - We've been rolling out changes across Serverless that have been stabilizing network issues.

Also, our underlying cloud provider (AWS) has been providing updates that expect full resolution in the next hour.

Sep 18, 21:44 UTC
Update - We are continuing to experience networking issues with our underlying cloud provider. We have started to rollout a workaround to mitigate these networking issues.

Sep 18, 20:06 UTC
Update - We are continuing to experience networking issues with our underlying cloud provider. We have begun implementing a workaround to failover to an availability zone that is not experiencing these issues.

Sep 18, 19:24 UTC
Update - We are continuing to monitor the networking issues with our underlying cloud provider which are causing Serverless run launch failures.

Sep 18, 18:39 UTC
Update - Our underlying cloud provider is having some networking issues which are likely the root cause of the Serverless run launch failures. We are continuing to investigate these errors.

Sep 18, 18:07 UTC
Investigating - We are currently investigating issues with some Serverless runs failing to launch.

Database maintenance failover

2023-09-18T12:46:00Z

Sep 18, 12:46 UTC
Resolved - Database maintenance by our cloud provider caused a 90 second service disruption between 12:46:10 UTC and 12:47:40 UTC while a primary database failed over to a secondary. Dagster operations during this window would have failed. All systems have resumed normal operation.

Some asset runs failing with a "GraphQLStorageError: Error in GraphQL Response" error

2023-09-06T23:59:38Z

Sep 6, 23:59 UTC
Resolved - This incident has been resolved.

Sep 6, 22:39 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Sep 6, 20:08 UTC
Identified - The issue has been identified and we are in the process of testing a fix.

Sep 6, 19:30 UTC
Investigating - We have observed an increase in timeouts from our storage engine during run execution, resulting in some runs failing with an error like the following in the event log:

dagster_cloud_cli.core.errors.GraphQLStorageError: Error in GraphQL response: [{'message': 'Internal Server Error (Trace ID: XXX)', 'locations': [{'line': 22, 'column': 13}], 'path': ['eventLogs', 'getEventRecords']}]

We are currently investigating the issue.

Cannot launch runs in Serverless

2023-08-03T21:23:25Z

Aug 3, 21:23 UTC
Resolved - AWS was experiencing increased API error rates and latencies for their EC2 APIs, which impacted AWS services making use of these APIs. This included AWS Fargate, which is used in Dagster Cloud Serverless. Their incident was resolved as of 21:18 UTC.

Aug 3, 21:09 UTC
Monitoring - We are seeing runs successfully launch as of 20:50 UTC. We are continuing to monitor the situation.

Aug 3, 20:29 UTC
Investigating - We are currently investigating this issue.

Intermittent read timeouts

2023-07-07T19:03:10Z

Jul 7, 19:03 UTC
Resolved - A recent change to how we autoscale our webservers caused a small number of customer read timeouts (on the order of magnitude of a few dozen per hour) when connecting to our API endpoint. We've reverted these changes and the incident is now resolved.

Jul 7, 18:02 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Jul 7, 17:43 UTC
Investigating - We're investigating reports of intermittent read timeouts to our agent API.

Partial outage of Dagster Cloud

2023-06-14T18:00:25Z

Jun 14, 18:00 UTC
Resolved - DNS records have been updated and are propagating. Users unable to access Dagster Cloud should clear their DNS cache.

Jun 14, 17:51 UTC
Update - We are continuing to investigate this issue.

Jun 14, 17:44 UTC
Investigating - We are currently investigating an outage affecting dagster.cloud DNS routing

Reports of jobs not starting

2023-05-03T23:29:18Z

May 3, 23:29 UTC
Resolved - This incident has been resolved.

An operation to upgrade the nodes of our cluster overlapped with our weekly release. Interaction between the two operations caused the Dagster daemons that control run queuing, alerting, schedules, and sensors to be temporarily unavailable.

May 3, 23:12 UTC
Update - The issue is confirmed as resolved, though we are continuing to investigate the issue.

May 3, 22:58 UTC
Investigating - We have reverted a recent change and believe this should resolve the issue. We will provide an update after further investigation.

Brief outage

2023-04-26T17:29:15Z

Apr 26, 17:29 UTC
Resolved - Dagster Cloud experienced two brief outages: one from 9:27:16 PDT -> 9:27:46 PDT, and one from 9:51:25 PDT -> 9:51:28 PDT.

This may have caused issues with runs that were either starting or completing during these windows.

Dagster Cloud run launches interrupted

2023-04-05T19:30:00Z

Apr 5, 19:30 UTC
Resolved - A network configuration change applied to Dagster Cloud caused run launches to fail for some serverless users from 7:45 PM to 8:45 PM UTC on 4/5/2023. The change has since been rolled back and run launching behavior has returned to normal.

GitHub actions downtime is causing code location update failures

2023-03-15T15:49:35Z

Mar 15, 15:49 UTC
Resolved - This incident has been resolved.

Mar 15, 15:13 UTC
Monitoring - Users of the Github Actions CICD pipeline for Dagster Cloud may be affected.
Follow the GitHub incident here: https://www.githubstatus.com/incidents/ybnn77s3lyf8

Dagster Cloud sensors and runs failing

2023-01-27T20:50:11Z

Jan 27, 20:50 UTC
Resolved - This incident has been resolved.

Jan 27, 20:32 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Jan 27, 19:40 UTC
Identified - An error in the graphql layer is causing sensors and runs using the k8s_job_executor to fail

Jan 27, 19:14 UTC
Update - We are continuing to investigate this issue.

Jan 27, 19:14 UTC
Investigating - We are investigating a failure at the graphql layer that is causing runs to fail.

Usage metrics cannot be queried

2023-01-04T16:47:42Z

Jan 4, 16:47 UTC
Resolved - The incident has been resolved. The errors were originating from an external incident with one of our service providers.

Jan 4, 14:54 UTC
Investigating - Components in the Dagster Cloud UI that display usage information for billing purposes are unable to fetch data for display. We are currently investigating the issue.

Scheduler and sensor disruptions

2022-11-02T20:34:42Z

Nov 2, 20:34 UTC
Resolved - We upgraded our agent request queue backend at 2022-11-02 19:04 UTC.

During upgrades, Dagster Cloud's backend services read messages from the old queue and write messages onto the new queue. Once the old queue is empty, Dagster Cloud's services begin reading messages only from the new queue.

When rolling the new queue configuration out to Dagster Cloud's backend services, our scheduling services did not recognize the new configuration. These services continued to look only at the old queue.

This means new agent requests were being written to the new queue, but our scheduler services - which control things like executing schedules, sensors, and launching runs - continued to only read from the old queue.

Customers may need to manually retry runs that launched but were marked as failed during the outage window because Dagster Cloud never acknowledged the launch event.

Schedules and sensors will resume automatically.

To mitigate this in the future, when performing similar operations, we plan to use immutable configurations that will preclude services from continuing to run with an old configuration.

Nov 2, 20:29 UTC
Monitoring - A fix has been implemented and we are monitoring the results.

Nov 2, 19:44 UTC
Update - We are continuing to work on a fix for this issue.

Nov 2, 19:04 UTC
Identified - The issue has been identified and a fix is being implemented.