Cloudflare: Cloudflare Pages, Workers KV and Cloudflare Access Availability Issues

Incident report for DatoCMS

Resolved

We have further analyzed what happened in this incident and here's a longer explanation.

We use Cloudflare workers at the edge of our platform to cache GraphQL requests and block traffic that goes over hard limits, for example for people with cancelled subscriptions. To block these projects we save the information in the Cloudflare KV.

When Cloudflare KV was down our workers were breaking causing the cache to be bypassed.

So all the GraphQL requests were going straight to our servers causing an unusual load that slowed down everything very significantly. Only a minority of requests timed out, but still the user experience was significantly affected and everything seemed down. Furthermore, because everything hit our servers the rate-limits for GraphQL requests were hit much more frequently.

To avoid this from happening in the future we are going to improve our workers to avoid them for breaking in case of the KV service being down. If this has to happen again at least the caching will keep working. We didn't anticipate that the workers and KV storage might break independently, but we'll improve this part of the infrastructure soon.

Posted at Oct 31, 11:13 GMT+00:00

Resolved

This incident is now resolved. The impact start time was 2023-10-30, 20:03 UTC, and the end time was 2023-10-30, 20:35 UTC.

The incident caused response times to be longer than usual on CDA API. We will investigate the issue, and we'll get back here with more details.

Posted at Oct 30, 20:40 GMT+00:00