On August 23, 2023 from 2:41 p.m. ET to 3:40 p.m ET an update to the server cluster responsible for hosting caching services essential to our data capture pipeline created a resource constraint which caused an outage in our session capture pipeline.
Approximately 95% of new session requests failed throughout the duration of the incident. As a result, customers may notice a significant drop in the number of newly captured sessions for this period. All missing activity during this time period is non-recoverable and may impact Metrics, Funnels, Dashboards, and Conversions. Additionally, the impacted sessions are not available for session replay.
Customers using our mobile SDK may have experienced build errors that prevented apps from compiling. Re-running the build will properly upload the assets.
An update to the log collection process on the server cluster hosting caching services led to insufficient resources to guarantee availability of all caches used in the capture pipeline. This change was made due to a deprecation of an existing cloud service that required migration to new clusters, requiring logging configuration changes to existing clusters to support migrated services. The updates applied caused an increase in memory usage on these servers, which resulted in caching services being preempted. As a result of the decrease in the availability of multiple levels of caching, new session capture requests were unable to enforce privacy configuration and validate new session requests. The failure of both cache layers caused these requests to time out before settings could be retrieved, resulting in all session and data capture requests to be rejected with a server error.
This change was made first in our testing environment and validated there, however due to differences in resource allocations and traffic volume this problem was not detected.
FullStory operations detected the problem within 5 minutes of the change and immediately took steps to diagnose. The update was reverted at 3:15 p.m., which enabled caching services to restart and fully repopulate, approximately 34 minutes after the initial change. Due to the delays in handling requests during the issue, capture services were unresponsive to new requests and to internal health monitoring. These services were automatically restarted, however without sufficient capacity to handle the backlog of requests. Additional capacity was provisioned and made available at 3:37 p.m., which restored session capture fully at 3:41 p.m.
We are committed to preventing this incident in the future. We’ve completed the following action items:
Here are additional steps we’re taking: