From workaround to upstream: how AMD fixed the Strix Halo boot crash
In my previous post I wrote about bugs that prevented Linux 7.0-rc2 from booting on my HP ZBook Ultra with AMD Strix Halo. I ended up writing my own workarounds to get things working. At the time I wasn't sure if my fixes were the right approach, but they got the job done.
Since then, AMD has fixed all three issues upstream. I'm now running 7.0-rc6 with zero local patches. Everything works: the webcam, s2idle suspend/resume, and the NPU with XDNA. No workarounds needed.
Here's what AMD did.
The NULL pointer dereference
This was the showstopper. The ISP driver crashed during boot because dev->type could be NULL for devices added by a new ACPI wakeup source registration path. My fix was a simple NULL check before dereferencing dev->type->name.
AMD's fix, authored by Pratap Nirujogi, landed in rc4 via commit 3fc4648b53b7.
- if (!dev->type->name) {
+ if (!dev->type || !dev->type->name) {
The wrong modalias breaking module autoloading
The ISP's MFD children were inheriting the GPU's ACPI companion, giving them modalias acpi:LNXVIDEO: instead of platform:amd_isp_capture. My workaround was to temporarily hide the parent's firmware node around the mfd_add_hotplug_devices() call so the MFD core wouldn't assign the wrong ACPI identity.
It worked, but it was a hack. AMD found the real root cause, and it turned out to be deeper than expected.
The actual culprit was a regression in the ACPI bus layer. Commit 336aae5c4e1a in rc1 added a shortcut in acpi_companion_match() for backlight-type ACPI devices. When a device's ACPI companion had pnp.type.backlight set, the function returned it immediately, bypassing the acpi_primary_dev_companion() check. That check is the gatekeeper that prevents secondary devices (like MFD children sharing a parent's ACPI companion) from matching via ACPI. The GPU's ACPI companion has the backlight flag set, so all its children, including the ISP sub-devices, got matched as ACPI devices instead of platform devices.
The fix by Pratap Nirujogi (e7648ffecb7f, merged in rc5) simply removes that shortcut:
- if (adev->pnp.type.backlight)
- return adev;
Now acpi_companion_match() always goes through the proper primary device check. MFD children correctly fall through to platform bus matching. Three lines removed, problem solved.
I find this one particularly interesting because it shows how a seemingly unrelated change in the ACPI subsystem can break something as specific as webcam module autoloading. My ISP driver workaround masked the symptom, but the actual bug was in drivers/acpi/bus.c, nowhere near the GPU or ISP code.
The I2C runtime resume crash
The last bug was a race condition in the ISP's I2C controller driver. During probe, pm_runtime_get_sync() triggered a runtime resume callback that tried to access the register map before it was initialized. My fix was a NULL check on i_dev->map in the resume callback to skip re-initialization when the device hadn't been fully probed yet.
AMD took a different approach. Pratap Nirujogi and Bin Du restructured the probe sequence entirely in commit e2f1ada8e089 (merged in rc5). Instead of patching the resume callback, they moved pm_runtime_enable() to after the device is fully initialized and used dev_pm_genpd_resume() to power on the ISP directly during probe. This way the runtime resume callback can never fire on an uninitialized device. The race condition doesn't just get handled, it gets eliminated.
This is a cleaner solution. My NULL check was a safety net for a situation that shouldn't happen. Their fix makes sure it can't happen in the first place.
Where things stand now
All three fixes landed between rc4 and rc5. I've been running 7.0-rc6 as my daily driver for a while now and everything is solid. The webcam works with proper module autoloading, s2idle suspend and resume are reliable, and the NPU runs fine with XDNA. No manual patches, no workarounds.
It's satisfying to see these get fixed properly. All three upstream commits were authored by Pratap Nirujogi at AMD. My patches were quick workarounds to get it working, but AMD's fixes addressed the root causes at the right architectural layers. The ACPI bus fix in particular is a good example of that.
If you're on Strix Halo and have been holding off on 7.0, rc6 is in good shape.