Back to overview

Endpoint Agent v4.32.1 BSOD Issue

Dec 11 at 07:18am PST
Affected services
Endpoint Agent

Resolved
Dec 11 at 07:18am PST

Summary Root Cause Analysis Report - Post-Mortem

Incident Overview:

On December 11th, 2024, a few LimaCharlie customers reported Blue Screen of Death (BSOD) issues affecting Windows systems running version 4.32.1 of the LC EDR Agent. The issue only affected a small subset of systems running lc:latest that were under high memory pressure.

Engineering quickly identified the issue and rolled back the agent version from 4.32.1 to 4.32.0 to mitigate the problem.

The incident response was swift, with engineering identifying and diagnosing the problem within approximately one hour of the first customer report. A community-wide announcement was made and remediation was implemented through a version rollback within 2.5 hours of initial report.

Timeline of Events:

  • 2024-12-10 @ 9:41am PT: LC EDR Agent version 4.32.1 is released
  • 2024-12-11 @ 05:02am PT: Multiple customers reported isolated Blue Screen of Death (BSOD) incidents on Windows systems running the LimaCharlie EDR Agent
  • 2024-12-11 @ 06:06am PT: LC engineering diagnosed a kernel driver defect in latest endpoint agent version 4.32.1
  • 2024-12-11 @ 07:18am PT: Announcement posted in LC Community Slack
  • 2024-12-11 @ 07:30am PT: LC Agent latest version rolled back to 4.32.0

Root Causes:

The root cause was identified as a kernel driver bug related to improper IRQL handling when accessing paged memory with a spin lock. The issue manifested when the kernel driver (tmp_hbs_acq.sys) attempted to access pageable memory at an interrupt request level (IRQL) that was too high, triggering Windows bug check D1 (DRIVER_IRQL_NOT_LESS_OR_EQUAL). This bug was infrequent on test systems where no memory pressure was present.

Actions Taken:

  • Implement stricter IRQL validation checks in the kernel driver code to ensure proper memory access patterns
  • Add comprehensive memory access guards around paged pool operations when holding spin locks
  • Enhance the driver's memory management to properly handle high-pressure scenarios
  • Implement additional automated testing specifically for IRQL-related scenarios in the kernel driver
  • Enhance pre-release testing environment to better simulate high-memory-pressure scenarios and kernel-level interactions across diverse Windows configurations

These changes will be implemented and thoroughly tested before being included in a future release. The development team will also review other areas of the kernel driver code for similar potential IRQL issues.

Lessons Learned:

  • The critical importance of thorough IRQL validation in kernel driver development to prevent system crashes
  • The need for enhanced pre-release testing environments that better simulate real-world conditions, particularly high-memory-pressure scenarios
  • The value of quick rollback capabilities when dealing with kernel-level components
  • The importance of maintaining detailed version control and change documentation to quickly identify potential causes of issues
  • The effectiveness of rapid customer communication and transparent incident handling in maintaining trust

Future Recommendations:

  • Enhance documentation and communication around the intended purpose of lc:latest tag, emphasizing its role as a testing/staging version rather than for production use
  • Develop best practices documentation for rolling out new agent versions across customer environments, including recommended testing procedures and rollback strategies
  • Update web app to display both latest and stable as current versions on the Deployed Versions page, removing UI indicators that suggest stable to be outdated.
  • Implement a formal notification system for lc:latest version changes to help customers better track and manage testing cycles