On January 26th, 2022, the Aware production environment had a full service disruption. Ingestion, processing, and servicing of data elements was fully disrupted between 3:55pm and 5:34pm Eastern Standard Time. The outage was due to the expiration of authentication certificates in Aware’s container management system. There was no data leakage and all data that would have been ingested during that timeframe was recovered by 11:38am EST the following day.
Date | Time | Description |
---|---|---|
Jan 26, 2022 | 3:55pm | Site Reliability Engineering (SRE) Team notified of expired certificate errors on the management plane for container management system. |
Jan 26, 2022 | 4:04pm | SRE identifies the specific certificates that expired. SRE begins researching corrective measures. |
Jan 26, 2022 | 4:20pm | Manual certificate rotation begins. |
Jan 26, 2022 | 4:50pm | Certificate rotation finished. |
Jan 26, 2022 | 4:53pm | Nodes report errors as refreshed deployments spin up. |
Jan 26, 2022 | 5:34pm | Service is restored after addressing errors. |
Jan 26, 2022 | 5:48pm | Recovery actions started (spotlight runs, recurring scans, searches) |
Jan 27, 2022 | 8:32am | Started data recovery for the 2 hours of ingestion down time. |
Jan 28, 2022 | 11:38am | Data recovery complete. |
From our analysis, we determined that the root cause of the outage was inability of containers to communicate on the network due to expiration of certificates used for network encryption. The container-orchestration system provides the backbone of processing of all data in the Aware system. With the individual containers offline, all processing ceased, which caused ingestion, archival, and the UI to fail to respond to any requests. The management system’s API was also affected, which prevented the responding team from responding quickly until its certificate was updated.
The team responded as soon as the issue was reported and was able to quickly determine the source of the issue. Mitigation was identified to renew the certificates associated with management of the cluster, but renewing all certificates across the cluster took 30 minutes. With the certificates renewed, the cluster was brought back online where the team identified errors related to the order of bringing the containers up. Once containers were brought up in the correct order, the system was fully online and services were restored.
We understand the severity of the incident and that organizations depend upon our software to report on insights, implement retention policies, and monitor for specific usage of the collaboration platforms we ingest. The data input to Aware is of the highest sensitivity and we take that charge very seriously. We apologize for the outage and any impact to your organization.
Recovery of any data that failed to ingest was begun immediately upon service restoration. We have taken the following actions to ensure that the failure will not recur and will take further actions in future to mitigate potential failures.
We continue to investigate preventative actions. Further activities are planned which will prevent recurrence of this or similar disruptions. A non-exhaustive list of these activities includes:
Finally, we continue to evaluate and improve our incident management processes. With every incident, we perform a full RCA to identify and eliminate root causes as well as a retrospective of the process itself, noting any improvements needed. In this case, additional monitoring and alerting was added to the system as a whole and the team has committed to several process improvements to handle incidents faster in future.
In future, as during this incident, all customer communications will be centralized via our status page at Aware Status . Customers can subscribe via this page to be notified of any outages and service degradation.