Back to overview
Degraded

Sensor Live interactive functionality not working

Nov 28 at 03:52am PST
Affected services
app.limacharlie.io

Resolved
Dec 13 at 07:37pm PST

Summary Root Cause Analysis Report - Post-Mortem

Incident Overview:
On November 28, 2023, an issue was identified with Sensor Live items not functioning properly. The problem was initially reported as an issue with the LimaCharlie console. This report summarizes the timeline of events, root causes, and corrective actions taken.

Timeline of Events:
- Nov 28, Early Morning: Customers began reporting the incident via Slack.
- Throughout the Morning: Various team members engaged to investigate issue. The team validated the issue and observed it appeared to impact both backend and frontend systems.
- Mid-Morning: The affected systems were identified. Service restarts were necessary to bring about a resolution to the issue.

Root Causes:
1. Infrastructure Robustness: The reliance on some older technologies brought about complications that did not allow for simple troubleshooting of the issue. This indicated a need for a more robust and autonomous infrastructure.
2. System Monitoring Gaps: System probes were unable to detect the service degradation due to limitations in their configuration.
3. Communication Delays: A subset of configuration was missed during the change of emergency notification tooling which led to a delay in notification to the appropriate parties, causing delays in addressing the issue.

Actions Taken:
1. Testing and Infrastructure Improvements: Initiatives to replace older technologies with more modern solutions and enhance monitoring with improved metrics and tests.
2. Incident Reporting Process Updates: An updated email system will be used.
3. Customer Communication: Plans to revise customer notifications regarding incidents and include incident response information in our onboarding process.
4. Internal Documentation: Creation of documentation on incident reporting best practices and severity levels, ensuring all team members are informed and prepared.

Lessons Learned:
- The importance of robust monitoring systems that can detect not just system uptime but also functional performance.
- The need for clear and efficient communication channels both internally and with customers.
- The value of having a resilient and autonomous infrastructure to reduce downtime and manual intervention requirements.

Future Recommendations:
- Continue to enhance system monitoring and alerting capabilities.
- Streamline communication protocols for faster incident response.
- Regularly review and update infrastructure to ensure it meets the evolving needs of the service and customers.

Updated
Nov 28 at 07:02am PST

Console commands appear to be working as expected. We continue to monitor.

For reference, automated taskings continued to work as expected throughout this incident.

Created
Nov 28 at 03:52am PST

We are investigating an issue that is impacting the Console functionality. Additional details will be posted to this page once they become available.