tag:dagstercloud.statuspage.io,2005:/historyDagster Cloud Status - Incident History2024-02-26T18:05:10ZDagster Cloudtag:dagstercloud.statuspage.io,2005:Incident/200807682024-02-26T18:05:10Z2024-02-26T18:05:10ZOutage in Dagster Cloud<p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>18:05</var> UTC</small><br><strong>Resolved</strong> - We have confirmed that the incident is resolved.</p><p><small>Feb <var data-var='date'>26</var>, <var data-var='time'>17:44</var> UTC</small><br><strong>Monitoring</strong> - We experienced an outage while rotating secrets in our production environment. The outage lasted from 17:28 UTC - 17:39 UTC. We are back receiving traffic and are monitoring the recovery</p>tag:dagstercloud.statuspage.io,2005:Incident/199348872024-02-07T23:43:27Z2024-02-07T23:43:27ZDagster Cloud Maintenance Outage<p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>23:43</var> UTC</small><br><strong>Resolved</strong> - As a result of our maintenance window, Dagster Cloud was down from 23:36 to 23:39UTC. This is now resolved; all systems should be operating normally.</p>tag:dagstercloud.statuspage.io,2005:Incident/199170612024-02-07T22:00:57Z2024-02-07T22:00:57ZScheduled Maintenance<p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>22:00</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>21:04</var> UTC</small><br><strong>Update</strong> - The prerequisite steps for our database maintenance are running longer than estimated. We still expect a brief (1-2 minute) downtime during which active runs will be paused, agents will pause work, and the Dagster Cloud UI will show a maintenance banner. We now expect this downtime to occur later this afternoon outside of our reserved maintenance window. Dagster Cloud will continue operating as normal until the downtime and will resume normal operations afterward.</p><p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>20:00</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Feb <var data-var='date'> 5</var>, <var data-var='time'>23:52</var> UTC</small><br><strong>Scheduled</strong> - On Wednesday, February 7th there will be brief downtime for Dagster Cloud during our reserved maintenance window of 12pm - 2pm PST. We expect there to be two separate downtimes in this window, both roughly five minutes long.<br /><br />Dagster Cloud will be inaccessible during the downtime. Any schedules, sensors, or experimental auto-materialization policies will pause submitting runs. Any in-flight runs will continue, but will not be marked complete until after the downtime. Following the downtime, all schedules, sensors, backfills, and auto-materialization policies will resume automatically.<br /><br />Scheduled downtimes like this are rare - thank you very much for your patience. Don't hesitate to reach out on Slack or contact your customer success manager if you have any concerns or questions about the downtime.</p>tag:dagstercloud.statuspage.io,2005:Incident/199338722024-02-07T20:00:00Z2024-02-07T21:01:15ZDuplicated runs during maintenance window<p><small>Feb <var data-var='date'> 7</var>, <var data-var='time'>20:00</var> UTC</small><br><strong>Resolved</strong> - From 20:01-20:13 UTC, we saw a small number of duplicate run launches related to our maintenance window. This issue is now resolved.</p>tag:dagstercloud.statuspage.io,2005:Incident/198871172024-02-01T18:56:59Z2024-02-01T19:04:45ZOutage disrupting runs on Dagster Cloud<p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>18:56</var> UTC</small><br><strong>Resolved</strong> - The underlying incident has been resolved. We're still investigating the breadth of the impact and will message with more information shortly.</p><p><small>Feb <var data-var='date'> 1</var>, <var data-var='time'>18:56</var> UTC</small><br><strong>Investigating</strong> - We experienced a brief outage storing events from 18:41.55 UTC - 18:42.51 UTC. We recovered from the underlying issue but a number of runs may have unexpectedly failed.</p>tag:dagstercloud.statuspage.io,2005:Incident/185373262023-09-19T04:03:05Z2023-09-19T12:34:45ZServerless runs failing to launch<p><small>Sep <var data-var='date'>19</var>, <var data-var='time'>04:03</var> UTC</small><br><strong>Resolved</strong> - We've rolled out our fallback change and the underlying network issue also appears to be resolved.</p><p><small>Sep <var data-var='date'>19</var>, <var data-var='time'>01:40</var> UTC</small><br><strong>Update</strong> - AWS has not been able to solve the network issues that they identified in a number of AZs. We are continuing to monitor their progress on the underlying issue.<br /><br />In the meantime, we have been rolling out a workaround to fall back on to stable AZs. This workaround has been applied to most Dagster Cloud organizations on Serverless that encountered run launch failures today, and we are still monitoring as this rollout completes to the full set of Serverless deployments.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>23:53</var> UTC</small><br><strong>Monitoring</strong> - We've been rolling out changes across Serverless that have been stabilizing network issues.<br /><br />Also, our underlying cloud provider (AWS) has been providing updates that expect full resolution in the next hour.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>21:44</var> UTC</small><br><strong>Update</strong> - We are continuing to experience networking issues with our underlying cloud provider. We have started to rollout a workaround to mitigate these networking issues.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>20:06</var> UTC</small><br><strong>Update</strong> - We are continuing to experience networking issues with our underlying cloud provider. We have begun implementing a workaround to failover to an availability zone that is not experiencing these issues.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>19:24</var> UTC</small><br><strong>Update</strong> - We are continuing to monitor the networking issues with our underlying cloud provider which are causing Serverless run launch failures.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>18:39</var> UTC</small><br><strong>Update</strong> - Our underlying cloud provider is having some networking issues which are likely the root cause of the Serverless run launch failures. We are continuing to investigate these errors.</p><p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>18:07</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating issues with some Serverless runs failing to launch.</p>tag:dagstercloud.statuspage.io,2005:Incident/185350092023-09-18T12:46:00Z2023-09-18T13:23:37ZDatabase maintenance failover<p><small>Sep <var data-var='date'>18</var>, <var data-var='time'>12:46</var> UTC</small><br><strong>Resolved</strong> - Database maintenance by our cloud provider caused a 90 second service disruption between 12:46:10 UTC and 12:47:40 UTC while a primary database failed over to a secondary. Dagster operations during this window would have failed. All systems have resumed normal operation.</p>tag:dagstercloud.statuspage.io,2005:Incident/184123902023-09-06T23:59:38Z2023-09-06T23:59:38ZSome asset runs failing with a "GraphQLStorageError: Error in GraphQL Response" error<p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>23:59</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>22:39</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>20:08</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and we are in the process of testing a fix.</p><p><small>Sep <var data-var='date'> 6</var>, <var data-var='time'>19:30</var> UTC</small><br><strong>Investigating</strong> - We have observed an increase in timeouts from our storage engine during run execution, resulting in some runs failing with an error like the following in the event log:<br /><br />dagster_cloud_cli.core.errors.GraphQLStorageError: Error in GraphQL response: [{'message': 'Internal Server Error (Trace ID: XXX)', 'locations': [{'line': 22, 'column': 13}], 'path': ['eventLogs', 'getEventRecords']}]<br /><br />We are currently investigating the issue.</p>tag:dagstercloud.statuspage.io,2005:Incident/180449282023-08-03T21:23:25Z2023-08-03T21:23:25ZCannot launch runs in Serverless<p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>21:23</var> UTC</small><br><strong>Resolved</strong> - AWS was experiencing increased API error rates and latencies for their EC2 APIs, which impacted AWS services making use of these APIs. This included AWS Fargate, which is used in Dagster Cloud Serverless. Their incident was resolved as of 21:18 UTC.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>21:09</var> UTC</small><br><strong>Monitoring</strong> - We are seeing runs successfully launch as of 20:50 UTC. We are continuing to monitor the situation.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>20:29</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:dagstercloud.statuspage.io,2005:Incident/177924412023-07-07T19:03:10Z2023-07-07T20:36:45ZIntermittent read timeouts<p><small>Jul <var data-var='date'> 7</var>, <var data-var='time'>19:03</var> UTC</small><br><strong>Resolved</strong> - A recent change to how we autoscale our webservers caused a small number of customer read timeouts (on the order of magnitude of a few dozen per hour) when connecting to our API endpoint. We've reverted these changes and the incident is now resolved.</p><p><small>Jul <var data-var='date'> 7</var>, <var data-var='time'>18:02</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Jul <var data-var='date'> 7</var>, <var data-var='time'>17:43</var> UTC</small><br><strong>Investigating</strong> - We're investigating reports of intermittent read timeouts to our agent API.</p>tag:dagstercloud.statuspage.io,2005:Incident/175742912023-06-14T18:00:25Z2023-06-14T18:00:25ZPartial outage of Dagster Cloud<p><small>Jun <var data-var='date'>14</var>, <var data-var='time'>18:00</var> UTC</small><br><strong>Resolved</strong> - DNS records have been updated and are propagating. Users unable to access Dagster Cloud should clear their DNS cache.</p><p><small>Jun <var data-var='date'>14</var>, <var data-var='time'>17:51</var> UTC</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Jun <var data-var='date'>14</var>, <var data-var='time'>17:44</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating an outage affecting dagster.cloud DNS routing</p>tag:dagstercloud.statuspage.io,2005:Incident/171371532023-05-03T23:29:18Z2023-05-04T00:30:34ZReports of jobs not starting<p><small>May <var data-var='date'> 3</var>, <var data-var='time'>23:29</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.<br /><br />An operation to upgrade the nodes of our cluster overlapped with our weekly release. Interaction between the two operations caused the Dagster daemons that control run queuing, alerting, schedules, and sensors to be temporarily unavailable.</p><p><small>May <var data-var='date'> 3</var>, <var data-var='time'>23:12</var> UTC</small><br><strong>Update</strong> - The issue is confirmed as resolved, though we are continuing to investigate the issue.</p><p><small>May <var data-var='date'> 3</var>, <var data-var='time'>22:58</var> UTC</small><br><strong>Investigating</strong> - We have reverted a recent change and believe this should resolve the issue. We will provide an update after further investigation.</p>tag:dagstercloud.statuspage.io,2005:Incident/169738202023-04-26T17:29:15Z2023-04-26T17:33:06ZBrief outage<p><small>Apr <var data-var='date'>26</var>, <var data-var='time'>17:29</var> UTC</small><br><strong>Resolved</strong> - Dagster Cloud experienced two brief outages: one from 9:27:16 PDT -> 9:27:46 PDT, and one from 9:51:25 PDT -> 9:51:28 PDT.<br /><br />This may have caused issues with runs that were either starting or completing during these windows.</p>tag:dagstercloud.statuspage.io,2005:Incident/167634902023-04-05T19:30:00Z2023-04-05T20:51:41ZDagster Cloud run launches interrupted<p><small>Apr <var data-var='date'> 5</var>, <var data-var='time'>19:30</var> UTC</small><br><strong>Resolved</strong> - A network configuration change applied to Dagster Cloud caused run launches to fail for some serverless users from 7:45 PM to 8:45 PM UTC on 4/5/2023. The change has since been rolled back and run launching behavior has returned to normal.</p>tag:dagstercloud.statuspage.io,2005:Incident/165127532023-03-15T15:49:35Z2023-03-15T15:49:35ZGitHub actions downtime is causing code location update failures<p><small>Mar <var data-var='date'>15</var>, <var data-var='time'>15:49</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Mar <var data-var='date'>15</var>, <var data-var='time'>15:13</var> UTC</small><br><strong>Monitoring</strong> - Users of the Github Actions CICD pipeline for Dagster Cloud may be affected.<br />Follow the GitHub incident here: https://www.githubstatus.com/incidents/ybnn77s3lyf8</p>tag:dagstercloud.statuspage.io,2005:Incident/159545172023-01-27T20:50:11Z2023-01-27T20:50:11ZDagster Cloud sensors and runs failing<p><small>Jan <var data-var='date'>27</var>, <var data-var='time'>20:50</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jan <var data-var='date'>27</var>, <var data-var='time'>20:32</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Jan <var data-var='date'>27</var>, <var data-var='time'>19:40</var> UTC</small><br><strong>Identified</strong> - An error in the graphql layer is causing sensors and runs using the k8s_job_executor to fail</p><p><small>Jan <var data-var='date'>27</var>, <var data-var='time'>19:14</var> UTC</small><br><strong>Update</strong> - We are continuing to investigate this issue.</p><p><small>Jan <var data-var='date'>27</var>, <var data-var='time'>19:14</var> UTC</small><br><strong>Investigating</strong> - We are investigating a failure at the graphql layer that is causing runs to fail.</p>tag:dagstercloud.statuspage.io,2005:Incident/157714262023-01-04T16:47:42Z2023-01-04T16:47:42ZUsage metrics cannot be queried<p><small>Jan <var data-var='date'> 4</var>, <var data-var='time'>16:47</var> UTC</small><br><strong>Resolved</strong> - The incident has been resolved. The errors were originating from an external incident with one of our service providers.</p><p><small>Jan <var data-var='date'> 4</var>, <var data-var='time'>14:54</var> UTC</small><br><strong>Investigating</strong> - Components in the Dagster Cloud UI that display usage information for billing purposes are unable to fetch data for display. We are currently investigating the issue.</p>tag:dagstercloud.statuspage.io,2005:Incident/126980132022-11-02T20:34:42Z2022-11-02T20:35:43ZScheduler and sensor disruptions<p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>20:34</var> UTC</small><br><strong>Resolved</strong> - We upgraded our agent request queue backend at 2022-11-02 19:04 UTC.<br /><br />During upgrades, Dagster Cloud's backend services read messages from the old queue and write messages onto the new queue. Once the old queue is empty, Dagster Cloud's services begin reading messages only from the new queue.<br /><br />When rolling the new queue configuration out to Dagster Cloud's backend services, our scheduling services did not recognize the new configuration. These services continued to look only at the old queue.<br /><br />This means new agent requests were being written to the new queue, but our scheduler services - which control things like executing schedules, sensors, and launching runs - continued to only read from the old queue.<br /><br />Customers may need to manually retry runs that launched but were marked as failed during the outage window because Dagster Cloud never acknowledged the launch event.<br /><br />Schedules and sensors will resume automatically.<br /><br />To mitigate this in the future, when performing similar operations, we plan to use immutable configurations that will preclude services from continuing to run with an old configuration.</p><p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>20:29</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>19:44</var> UTC</small><br><strong>Update</strong> - We are continuing to work on a fix for this issue.</p><p><small>Nov <var data-var='date'> 2</var>, <var data-var='time'>19:04</var> UTC</small><br><strong>Identified</strong> - The issue has been identified and a fix is being implemented.</p>tag:dagstercloud.statuspage.io,2005:Incident/119883992022-10-13T23:00:00Z2022-10-14T00:16:05ZServerless agent outage<p><small>Oct <var data-var='date'>13</var>, <var data-var='time'>23:00</var> UTC</small><br><strong>Resolved</strong> - We have identified and rolled back a change which prevented some Serverless agents from launching runs</p>tag:dagstercloud.statuspage.io,2005:Incident/115698892022-10-06T23:44:28Z2022-10-06T23:44:28ZDagster Cloud Scheduler Outage<p><small>Oct <var data-var='date'> 6</var>, <var data-var='time'>23:44</var> UTC</small><br><strong>Resolved</strong> - Scheduler has returned to normal operation</p><p><small>Oct <var data-var='date'> 6</var>, <var data-var='time'>23:33</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating an interruption in the Dagster Cloud scheduler</p>tag:dagstercloud.statuspage.io,2005:Incident/110894172022-09-09T15:25:38Z2022-09-09T15:26:39ZWe are investigating an issue causing some user's runs to fail to start.<p><small>Sep <var data-var='date'> 9</var>, <var data-var='time'>15:25</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Sep <var data-var='date'> 9</var>, <var data-var='time'>14:48</var> UTC</small><br><strong>Identified</strong> - We have identified the likely cause of the issue and are publishing a fix.</p><p><small>Sep <var data-var='date'> 9</var>, <var data-var='time'>14:35</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating an issue where runs to fail to start for some users.</p>tag:dagstercloud.statuspage.io,2005:Incident/108616122022-08-19T15:39:00Z2022-08-19T16:12:02ZElevated error rates on Dagster Cloud<p><small>Aug <var data-var='date'>19</var>, <var data-var='time'>15:39</var> UTC</small><br><strong>Resolved</strong> - We are currently investigating this issue.</p>tag:dagstercloud.statuspage.io,2005:Incident/108231932022-08-14T09:30:54Z2022-08-15T17:49:46ZElevated error rates on Dagster Cloud<p><small>Aug <var data-var='date'>14</var>, <var data-var='time'>09:30</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Aug <var data-var='date'>14</var>, <var data-var='time'>08:54</var> UTC</small><br><strong>Update</strong> - We are continuing to monitor for any further issues.</p><p><small>Aug <var data-var='date'>14</var>, <var data-var='time'>08:52</var> UTC</small><br><strong>Monitoring</strong> - A fix has been implemented and we are monitoring the results.</p><p><small>Aug <var data-var='date'>14</var>, <var data-var='time'>08:33</var> UTC</small><br><strong>Investigating</strong> - We are currently investigating this issue.</p>tag:dagstercloud.statuspage.io,2005:Incident/107544232022-08-03T18:55:56Z2022-08-03T18:55:56ZScheduled maintenance 8-3-22<p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>18:55</var> UTC</small><br><strong>Completed</strong> - The scheduled maintenance has been completed.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>18:33</var> UTC</small><br><strong>In progress</strong> - Scheduled maintenance is currently in progress. We will provide updates as necessary.</p><p><small>Aug <var data-var='date'> 3</var>, <var data-var='time'>18:32</var> UTC</small><br><strong>Scheduled</strong> - We will be undergoing scheduled maintenance during this time.</p>tag:dagstercloud.statuspage.io,2005:Incident/107107672022-07-28T17:22:30Z2022-07-28T17:22:30ZElevated error rates on Dagster Cloud<p><small>Jul <var data-var='date'>28</var>, <var data-var='time'>17:22</var> UTC</small><br><strong>Resolved</strong> - This incident has been resolved.</p><p><small>Jul <var data-var='date'>28</var>, <var data-var='time'>17:15</var> UTC</small><br><strong>Investigating</strong> - We are actively investigating issues where users are unable to load Dagster Cloud. We will provide updates regularly until the issues are resolved.</p>