Outage - Certificate Expiration
Incident Report for Aware
Postmortem

01/26/2022 Outage - Certificate Expiration

Summary

On January 26th, 2022, the Aware production environment had a full service disruption. Ingestion, processing, and servicing of data elements was fully disrupted between 3:55pm and 5:34pm Eastern Standard Time. The outage was due to the expiration of authentication certificates in Aware’s container management system. There was no data leakage and all data that would have been ingested during that timeframe was recovered by 11:38am EST the following day.

Timeframe

Date Time Description
Jan 26, 2022 3:55pm Site Reliability Engineering (SRE) Team notified of expired certificate errors on the management plane for container management system.
Jan 26, 2022 4:04pm SRE identifies the specific certificates that expired. SRE begins researching corrective measures.
Jan 26, 2022 4:20pm Manual certificate rotation begins.
Jan 26, 2022 4:50pm Certificate rotation finished.
Jan 26, 2022 4:53pm Nodes report errors as refreshed deployments spin up.
Jan 26, 2022 5:34pm Service is restored after addressing errors.
Jan 26, 2022 5:48pm Recovery actions started (spotlight runs, recurring scans, searches)
Jan 27, 2022 8:32am Started data recovery for the 2 hours of ingestion down time.
Jan 28, 2022 11:38am Data recovery complete.

What Happened

From our analysis, we determined that the root cause of the outage was inability of containers to communicate on the network due to expiration of certificates used for network encryption. The container-orchestration system provides the backbone of processing of all data in the Aware system. With the individual containers offline, all processing ceased, which caused ingestion, archival, and the UI to fail to respond to any requests. The management system’s API was also affected, which prevented the responding team from responding quickly until its certificate was updated.

The team responded as soon as the issue was reported and was able to quickly determine the source of the issue. Mitigation was identified to renew the certificates associated with management of the cluster, but renewing all certificates across the cluster took 30 minutes. With the certificates renewed, the cluster was brought back online where the team identified errors related to the order of bringing the containers up. Once containers were brought up in the correct order, the system was fully online and services were restored.

Mitigation Actions

We understand the severity of the incident and that organizations depend upon our software to report on insights, implement retention policies, and monitor for specific usage of the collaboration platforms we ingest. The data input to Aware is of the highest sensitivity and we take that charge very seriously. We apologize for the outage and any impact to your organization.

Recovery of any data that failed to ingest was begun immediately upon service restoration. We have taken the following actions to ensure that the failure will not recur and will take further actions in future to mitigate potential failures.

  • Alerts placed for upcoming renewal on all certificates in use in the environment
  • Fresh certificates replaced the expired certificates in the environment
  • Full inventory of certificates in use and expiration dates completed and published internally to the responsible parties
  • Automatic certificate renewal implemented for the container infrastructure
  • Pods have been reconfigured to start successfully, irrespective of order

We continue to investigate preventative actions. Further activities are planned which will prevent recurrence of this or similar disruptions. A non-exhaustive list of these activities includes:

  • Accelerating roadmap items related to architectural patterns to optimize resilience
  • Identify and mitigate similar single points of failure
  • Redundant deployment of services within or across regions
  • Updating runbooks for container cluster management, certificate renewal, and restoration of services

Finally, we continue to evaluate and improve our incident management processes. With every incident, we perform a full RCA to identify and eliminate root causes as well as a retrospective of the process itself, noting any improvements needed. In this case, additional monitoring and alerting was added to the system as a whole and the team has committed to several process improvements to handle incidents faster in future.

In future, as during this incident, all customer communications will be centralized via our status page at Aware Status . Customers can subscribe via this page to be notified of any outages and service degradation.

Posted Feb 08, 2022 - 08:21 EST

Resolved
As of 8:30am, we have confirmed that the system is operating properly and all services have been restored.
Posted Jan 27, 2022 - 10:15 EST
Monitoring
The fix brought the system back up and all services are operating as expected. We will continue to monitor the system throughout the night to ensure proper operation.
Posted Jan 26, 2022 - 17:34 EST
Update
The issues has been resolved. The system is now in the process of recovering and restarting. We will continue to monitor until services are completely restored.
Posted Jan 26, 2022 - 17:28 EST
Identified
An issue has been identified with expired certificates which has caused all services to be inaccessible, including login. Further updates will happen as more information is available.
Posted Jan 26, 2022 - 16:55 EST
This incident affected: Spotlight (Analytics, Topic Reports) and Monitoring, Search & Discover, Retention, Data Hold, User Data Removal, Authentication & Authorization.