Endpoint Agent v4.32.1 BSOD Issue
Resolved
Dec 11 at 07:18am PST
Summary Root Cause Analysis Report - Post-Mortem
Incident Overview:
On December 11th, 2024, a few LimaCharlie customers reported Blue Screen of Death (BSOD) issues affecting Windows systems running version 4.32.1 of the LC EDR Agent. The issue only affected a small subset of systems running lc:latest
that were under high memory pressure.
Engineering quickly identified the issue and rolled back the agent version from 4.32.1
to 4.32.0
to mitigate the problem.
The incident response was swift, with engineering identifying and diagnosing the problem within approximately one hour of the first customer report. A community-wide announcement was made and remediation was implemented through a version rollback within 2.5 hours of initial report.
Timeline of Events:
- 2024-12-10 @ 9:41am PT: LC EDR Agent version
4.32.1
is released - 2024-12-11 @ 05:02am PT: Multiple customers reported isolated Blue Screen of Death (BSOD) incidents on Windows systems running the LimaCharlie EDR Agent
- 2024-12-11 @ 06:06am PT: LC engineering diagnosed a kernel driver defect in
latest
endpoint agent version4.32.1
- 2024-12-11 @ 07:18am PT: Announcement posted in LC Community Slack
- 2024-12-11 @ 07:30am PT: LC Agent
latest
version rolled back to4.32.0
Root Causes:
The root cause was identified as a kernel driver bug related to improper IRQL handling when accessing paged memory with a spin lock. The issue manifested when the kernel driver (tmp_hbs_acq.sys
) attempted to access pageable memory at an interrupt request level (IRQL) that was too high, triggering Windows bug check D1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL
). This bug was infrequent on test systems where no memory pressure was present.
Actions Taken:
- Implement stricter IRQL validation checks in the kernel driver code to ensure proper memory access patterns
- Add comprehensive memory access guards around paged pool operations when holding spin locks
- Enhance the driver's memory management to properly handle high-pressure scenarios
- Implement additional automated testing specifically for IRQL-related scenarios in the kernel driver
- Enhance pre-release testing environment to better simulate high-memory-pressure scenarios and kernel-level interactions across diverse Windows configurations
These changes will be implemented and thoroughly tested before being included in a future release. The development team will also review other areas of the kernel driver code for similar potential IRQL issues.
Lessons Learned:
- The critical importance of thorough IRQL validation in kernel driver development to prevent system crashes
- The need for enhanced pre-release testing environments that better simulate real-world conditions, particularly high-memory-pressure scenarios
- The value of quick rollback capabilities when dealing with kernel-level components
- The importance of maintaining detailed version control and change documentation to quickly identify potential causes of issues
- The effectiveness of rapid customer communication and transparent incident handling in maintaining trust
Future Recommendations:
- Enhance documentation and communication around the intended purpose of
lc:latest
tag, emphasizing its role as a testing/staging version rather than for production use - Develop best practices documentation for rolling out new agent versions across customer environments, including recommended testing procedures and rollback strategies
- Added “Best Practices” to docs: https://docs.limacharlie.io/docs/endpoint-agent-versioning-and-upgrades#updating-endpoint-agents
- Update web app to display both
latest
andstable
as current versions on the Deployed Versions page, removing UI indicators that suggeststable
to be outdated. - Implement a formal notification system for
lc:latest
version changes to help customers better track and manage testing cycles
Affected services
Endpoint Agent