Session Capture Outage
Incident Report for Fullstory
Postmortem

Between 2023-11-07 5:49 PM UTC and 2023-11-08 11:39 AM UTC an update to our capture service caused some web sessions to be initialized in a corrupted state that prevented capture data from being processed successfully. Replay and analytics features that rely on this session and event data were impacted; missing activity during this time period may impact Metrics, Funnels, Dashboards, and Conversions. Additionally, the impacted sessions are not available for session replay.

This postmortem details the customer impact, the root cause of what happened, how we addressed the problem, and how we will prevent similar incidents from happening in the future.

Customer Impact

During the incident, customers using FullStory Relay or those with CSP policies that disallow access to the edge.fullstory.com CDN might have failed to capture entire sessions or have captured sessions that are missing a subset of their pages.

Root Cause

An update to our capture settings service caused some sessions to be initialized with inconsistent state and for which our backend data capture service was not able to process some pages for these sessions. In cases where the primary CDN-backed settings could not be accessed (see above notes on Relay and CSP), the client hits a fallback endpoint. This fallback endpoint did not contain accurate capture instructions for the client, leading to corruption of the local capture state.

Resolution

Our internal monitoring alerted Engineering of corrupted sessions being processed, which resulted in the deployment being rolled back at 8:52 PM UTC. After observing lingering data capture errors after the rollback and testing possible impacts of the defect, a remediation process was developed to recover corrupted device identifiers for impacted users such that new capture sessions could be initiated.

Process Changes and Prevention

We are committed to preventing this type of incident in the future. We’ve completed the following action items:

  • Implemented a change that allows recovery of device identifiers when invalid settings have been propagated
  • Developed and run a data recovery process for sessions captured with invalid identifiers during the brief period of time for which data recovery is possible
  • Increased the period of time for which data with invalid identifiers are preserved allowing for a greater possibility of data recovery in case of similar incident

Here are additional steps we’re taking:

  • Add additional automated testing to ensure consistency between primary and fallback capture settings
  • Add additional checks in our web capture client to guard against possible corruption of device identity in the observed and similar circumstances

We deeply regret this incident and invite any FullStory customer who was materially affected to contact support@fullstory.com. We stand by ready to fully address all of your concerns.

Posted Nov 17, 2023 - 12:12 EST

Resolved
On November 7, 2023, a subset of FullStory orgs experienced a session ingestion outage. If your organization uses Relay or needs to update Content Security Policy (CSP) settings, sessions captured between November 7, 12:49 PM ET, and November 8, 6:39 AM ET may be lost or missing some pages.

The problem was caused by a change in settings, which was quickly reversed once we noticed the impact on November 7, 3:52 PM ET. However, users that initiated sessions during the specified time frame might have still been affected until a fix to repair was deployed on November 8, 6:39 AM ET.

We're sorry for the inconvenience and are working to put safeguards in place to prevent and detect similar scenarios in the future. If you suspect your org was affected or want more information, contact us at support@fullstory.com.
Posted Nov 07, 2023 - 12:30 EST