Data capture outage (web & mobile)

Incident Report for Fullstory

Postmortem

On August 23, 2023 from 2:41 p.m. ET to 3:40 p.m ET an update to the server cluster responsible for hosting caching services essential to our data capture pipeline created a resource constraint which caused an outage in our session capture pipeline.

Approximately 95% of new session capture requests failed for the duration of the incident.
Sessions that had started before the incident were preserved; however, new user activity for those sessions was not captured once the incident began.
This incident affected session capture for both web and mobile apps, as well as events captured using our Server Events API. The affected sessions will not be available within FullStory.

Customer Impact

Approximately 95% of new session requests failed throughout the duration of the incident. As a result, customers may notice a significant drop in the number of newly captured sessions for this period. All missing activity during this time period is non-recoverable and may impact Metrics, Funnels, Dashboards, and Conversions. Additionally, the impacted sessions are not available for session replay.

Customers using our mobile SDK may have experienced build errors that prevented apps from compiling. Re-running the build will properly upload the assets.

Root Cause

An update to the log collection process on the server cluster hosting caching services led to insufficient resources to guarantee availability of all caches used in the capture pipeline. This change was made due to a deprecation of an existing cloud service that required migration to new clusters, requiring logging configuration changes to existing clusters to support migrated services. The updates applied caused an increase in memory usage on these servers, which resulted in caching services being preempted. As a result of the decrease in the availability of multiple levels of caching, new session capture requests were unable to enforce privacy configuration and validate new session requests. The failure of both cache layers caused these requests to time out before settings could be retrieved, resulting in all session and data capture requests to be rejected with a server error.

This change was made first in our testing environment and validated there, however due to differences in resource allocations and traffic volume this problem was not detected.

Resolution

FullStory operations detected the problem within 5 minutes of the change and immediately took steps to diagnose. The update was reverted at 3:15 p.m., which enabled caching services to restart and fully repopulate, approximately 34 minutes after the initial change. Due to the delays in handling requests during the issue, capture services were unresponsive to new requests and to internal health monitoring. These services were automatically restarted, however without sufficient capacity to handle the backlog of requests. Additional capacity was provisioned and made available at 3:37 p.m., which restored session capture fully at 3:41 p.m.

Process Change and Prevention

We are committed to preventing this incident in the future. We’ve completed the following action items:

Removed log collection process from caching service cluster to eliminate the resource contention that initially caused the incident.
Migrated all services and processes not related to caching services to different server clusters.
Added additional alerting to caching services cluster to monitor for problematic behaviors in restarting and redistributing currently running services, to be promptly notified of additional issues affecting availability

Here are additional steps we’re taking:

Improve capture pipeline behavior to ensure that capture settings can always be retrieved directly from the source database in the event of multiple levels of caching being unavailable.
Improve capture pipeline behavior to fail fast when cache retrievals are unsuccessful. This will allow adequate time for multiple fallback data access strategies to be executed.
Update the rollout policies for capture pipeline services to better handle sudden surges of traffic while completely restarting services. This will guarantee adequate capacity during the rollout process and ensure prompt automated recovery.

Posted Aug 30, 2023 - 15:02 EDT

Resolved

This incident has been resolved.

Posted Aug 23, 2023 - 17:41 EDT

Monitoring

End user sessions and data were not captured from 2:41pm ET to 3:40pm ET for accounts on our US data center. All missing activity during this time period is non-recoverable and may impact Metrics, Funnels, Dashboards, and Conversions. During this time, native mobile builds would also have failed. Re-running the build will properly upload the assets.

This issue did not impact any FullStory accounts utilizing our EU data center. (Your FullStory URLs would start with app.eu1.fullstory.com if you are utilizing our EU data center.)

A fix has been implemented and we are actively monitoring the issue.

Posted Aug 23, 2023 - 16:01 EDT

This incident affected: Data Capture (Web Capture, Native Mobile Capture).