Sensor Live interactive functionality not working
Resolved
Dec 13 at 07:37pm PST
Summary Root Cause Analysis Report - Post-Mortem
Incident Overview:
On November 28, 2023, an issue was identified with Sensor Live items not functioning properly. The problem was initially reported as an issue with the LimaCharlie console. This report summarizes the timeline of events, root causes, and corrective actions taken.
Timeline of Events:
- Nov 28, Early Morning: Customers began reporting the incident via Slack.
- Throughout the Morning: Various team members engaged to investigate issue. The team validated the issue and observed it appeared to impact both backend and frontend systems.
- Mid-Morning: The affected systems were identified. Service restarts were necessary to bring about a resolution to the issue.
Root Causes:
1. Infrastructure Robustness: The reliance on some older technologies brought about complications that did not allow for simple troubleshooting of the issue. This indicated a need for a more robust and autonomous infrastructure.
2. System Monitoring Gaps: System probes were unable to detect the service degradation due to limitations in their configuration.
3. Communication Delays: A subset of configuration was missed during the change of emergency notification tooling which led to a delay in notification to the appropriate parties, causing delays in addressing the issue.
Actions Taken:
1. Testing and Infrastructure Improvements: Initiatives to replace older technologies with more modern solutions and enhance monitoring with improved metrics and tests.
2. Incident Reporting Process Updates: An updated email system will be used.
3. Customer Communication: Plans to revise customer notifications regarding incidents and include incident response information in our onboarding process.
4. Internal Documentation: Creation of documentation on incident reporting best practices and severity levels, ensuring all team members are informed and prepared.
Lessons Learned:
- The importance of robust monitoring systems that can detect not just system uptime but also functional performance.
- The need for clear and efficient communication channels both internally and with customers.
- The value of having a resilient and autonomous infrastructure to reduce downtime and manual intervention requirements.
Future Recommendations:
- Continue to enhance system monitoring and alerting capabilities.
- Streamline communication protocols for faster incident response.
- Regularly review and update infrastructure to ensure it meets the evolving needs of the service and customers.
Affected services
app.limacharlie.io
Updated
Nov 28 at 07:02am PST
Console commands appear to be working as expected. We continue to monitor.
For reference, automated taskings continued to work as expected throughout this incident.
Affected services
app.limacharlie.io
Created
Nov 28 at 03:52am PST
We are investigating an issue that is impacting the Console functionality. Additional details will be posted to this page once they become available.
Affected services
app.limacharlie.io