How A GFX803 OpenMP Printf Bug Became An AMDGPU M0 Backend Fix

Posted on Jun 20, 2026 (Updated on Jun 21, 2026)

The failing case looked like a device printf or ROCr runtime problem: a tiny target teams distribute parallel for num_teams(2) num_threads(6) loop on gfx803 printed extra stale lines or crashed. The first working patches targeted LLVM GPU libc and RPC lane masks, but those were too broad and explained too much. The decisive control was gfx906 forced to COV4: the same source worked there. That moved the search from ROCr, COV4, and printf toward GFX8-specific AMDGPU codegen. The final reduced test was pure llc: a noinline callee that forced dynamic v_writelane_b32 through m0, followed by a caller-side flat/private reload. On GFX6-GFX8, m0 is architectural LDS/flat state and must be preserved across calls. Both the SelectionDAG path in SIFixSGPRCopies and the GlobalISel path in AMDGPUInstructionSelector::selectWritelane can borrow it, so the final fix saves and restores m0 in both paths when ldsRequiresM0Init() is true.

This was one of those bugs where the first successful fix was not the right fix.

The visible failure was small:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <omp.h>
#include <stdio.h>

void foo(void) {
#pragma omp target teams distribute parallel for num_teams(2) num_threads(6)
  for (int i = 0; i < 18; i++)
    printf("teams id = %d, thread id = %d\n", omp_get_team_num(),
           omp_get_thread_num());
}

int main(void) {
  foo();
  return 0;
}

On the working stack, that prints 18 lines and exits normally.

On the broken gfx803 stack, the same shape could print many more lines than the loop had iterations. A representative failure was 48 output lines instead of 18, dominated by stale duplicates such as:

1
teams id = 0, thread id = 0

Related variants could crash while reading bogus varargs.

That made the bug look like a printf problem. It was not. printf was only the first convenient operation that made the corrupted state visible.

This post documents the path from the first symptom to the final fix:

the constraints of the gfx803 OpenMP stack,
why ROCr and code object version 4 were plausible suspects,
why the first GPU-libc/RPC patches seemed convincing,
how gfx906 became the control experiment,
how the root cause moved into LLVM’s AMDGPU backend,
and why the final patch preserves m0 around dynamic v_writelane_b32 lowering in both SelectionDAG and GlobalISel.

A diagram showing the debugging path from a teams-loop printf failure through ROCr, COV4, OpenMP scheduling, GPU libc RPC, gfx906 control tests, and finally the AMDGPU backend m0 fix. — Figure 1. The investigation moved down the stack only when each layer failed to explain the evidence. The useful turning point was not a more clever printf patch; it was a control case that made the bug gfx803-specific rather than COV4-specific.

The Debugging Recipe

Before the retrospective, here is the workflow I wish I had written down at the beginning. It is intentionally mechanical. Each step answers one question and decides whether to stay at the current layer or move lower.

The commands assume the local stack from this work:

1
2
3
4
export STACK="$HOME/Projects/llvm-22"
export CASE="$HOME/Projects/whiteboard/openmp/declare_target"
source "$STACK/env.sh"
cd "$CASE"

env.sh adds the generated compiler wrappers to PATH and exports LLVM_INSTALL, ROCM_PATH, DEVLIB, LD_LIBRARY_PATH, and the ROCr queue-size workaround used by this stack. The clang-amdgpu-openmp wrapper is a shortcut for $LLVM_INSTALL/bin/clang plus -fopenmp, -fopenmp-targets=amdgcn-amd-amdhsa, the configured --offload-arch list, and COV4 forwarding for any architecture listed in OMP_AMDGPU_COV4_ARCHES.

For one-architecture control commands below, I spell out $LLVM_INSTALL/bin/clang and use -Xarch_device -mcode-object-version=4. That form sends the COV4 option to the only device compilation in the command and is easier to reproduce outside this local wrapper.

The source file is the small teams-loop case shown above. The expected output is 18 lines: two teams, six threads per team, and eighteen loop iterations.

Step 1: Reproduce The Symptom And Count It

Start by turning “it looks wrong” into a number:

1
2
3
clang-amdgpu-openmp test.c -o test-m0
OMP_TARGET_OFFLOAD=MANDATORY ./test-m0 | tee out.txt
wc -l out.txt

Why this matters:

OMP_TARGET_OFFLOAD=MANDATORY prevents a silent fallback to the host.
tee out.txt keeps the raw evidence.
wc -l makes the failure precise.

The healthy result is:

1
18 out.txt

The broken gfx803 result was not “a little noisy.” A representative failure printed 48 lines for an 18-iteration loop. That number is the first clue: the loop is not merely formatting output incorrectly; extra stale lane state is becoming visible.

Step 2: Prove The Loop Scheduler Is Not Simply Running 48 Iterations

Save this as count_only.c:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#include <stdio.h>

int main(void) {
  int counter = 0;

#pragma omp target teams distribute parallel for num_teams(2) num_threads(6) \
    map(tofrom: counter)
  for (int i = 0; i < 18; i++) {
    #pragma omp atomic update
    counter++;
  }

  printf("counter=%d\n", counter);
  return counter == 18 ? 0 : 1;
}

Then compile and run it the same way:

1
2
clang-amdgpu-openmp count_only.c -o count-only
OMP_TARGET_OFFLOAD=MANDATORY ./count-only

Why this matters:

If the counter is 18, OpenMP is not scheduling 48 clean loop iterations.
If only printf shows extra activity, output is exposing corrupted state.
That moves the search toward device runtime calls, active-lane masks, varargs, or caller state around an out-of-line call.

This step prevents a common beginner mistake: staring at the loop scheduler when the loop scheduler is only the messenger.

Step 3: Check Whether This Is Any Device Printf Or This Specific Shape

Save this as direct_printf.c:

1
2
3
4
5
6
7
8
9
#include <omp.h>
#include <stdio.h>

int main(void) {
#pragma omp target
  printf("initial=%d team=%d thread=%d\n",
         omp_is_initial_device(), omp_get_team_num(), omp_get_thread_num());
  return 0;
}

Compile and run:

1
2
clang-amdgpu-openmp direct_printf.c -o direct-printf
OMP_TARGET_OFFLOAD=MANDATORY ./direct-printf

Why this matters:

If direct printf works, GPU libc and the host RPC server are not completely broken.
The failing shape is now narrower: an out-of-line device print from a teams-loop context with divergent lane participation.

At this point, GPU libc/RPC lane masks are a reasonable hypothesis. They are not yet proven.

Step 4: Save Compiler Temps And Look For The First Suspicious Pattern

Ask Clang to keep the intermediate files:

1
2
3
rm -f test-*
clang-amdgpu-openmp -save-temps=obj test.c -o test-m0
rg "vprintf|printf|v_writelane|s_mov_b32 m0|flat_load|flat_store" .

Why this matters:

vprintf confirms the device-side call path.
v_writelane_b32 and s_mov_b32 m0 point at AMDGPU backend lowering, not just C library logic.
flat_load after a call is interesting because the caller may reload private loop state after the callee returns.

The first search is intentionally broad. You are not trying to prove the root cause yet. You are finding the next layer to inspect.

Step 5: Separate COV4 From GFX803

At this point, gfx803 and code object version 4 are entangled. The old card needs COV4 in this stack, and the bug appears on the old card, so COV4 is a natural suspect.

If a gfx906 device is available, force the same source through COV4 on that device:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
"$LLVM_INSTALL/bin/llvm-offload-device-info"

"$LLVM_INSTALL/bin/clang" \
  -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa \
  --offload-arch=gfx906 \
  -Xarch_device -mcode-object-version=4 \
  test.c -o test-gfx906-cov4

OMP_TARGET_OFFLOAD=MANDATORY ./test-gfx906-cov4 | tee gfx906-cov4.txt
wc -l gfx906-cov4.txt

On a machine with multiple GPUs, use the device order reported by llvm-offload-device-info and set OMP_DEFAULT_DEVICE=<id> if you need to pin the run to one device. The important thing is not the numeric ID; it is that the same source and same COV4 setting behave differently across gfx803 and gfx906.

Then compare it with the gfx803 COV4 result:

1
2
3
4
5
6
7
8
9
"$LLVM_INSTALL/bin/clang" \
  -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa \
  --offload-arch=gfx803 \
  -Xarch_device -mcode-object-version=4 \
  test.c -o test-gfx803-cov4

OMP_TARGET_OFFLOAD=MANDATORY ./test-gfx803-cov4 | tee gfx803-cov4.txt
wc -l gfx803-cov4.txt

Why this matters:

If gfx906 + COV4 prints 18 lines, COV4 alone is not the root cause.
If gfx803 + COV4 still fails, the failure follows an architecture generation boundary.
That justifies looking for a pre-GFX9 AMDGPU codegen behavior.

This was the decisive control experiment. Without it, the early GPU libc/RPC patches looked much more convincing than they deserved.

Step 6: Inspect The Device Assembly Around The Call Boundary

Now search the generated assembly more narrowly:

1
2
rg -n "v_writelane_b32|s_mov_b32 m0|s_setpc_b64|s_swappc_b64|flat_load_dword" \
  test-*.s

Read around the matching assembly lines:

1
2
rg -n -C 12 "v_writelane_b32|s_mov_b32 m0|s_setpc_b64|flat_load_dword" \
  test-*.s

In the broken shape, the important sequence is:

1
2
3
s_mov_b32 m0, lane
v_writelane_b32 ..., m0
s_setpc_b64 ...

with no matching restore before return.

Why this matters:

s_setpc_b64 is the callee return.
If m0 is changed before return and not restored, the caller sees the wrong architectural value.
A caller-side flat_load_dword after the call can then reload loop/private state with bad m0 state on GFX6-GFX8.

This is the point where the hypothesis changes from “printf has a bad mask” to “a callee clobbers a register the caller expects to survive.”

Step 7: Check The ABI And The Target Predicate

Do not assume m0 is call-preserved just because the theory needs it. Check the LLVM source and documentation:

1
2
3
cd "$STACK/llvm_src"
rg -n "GFX6-GFX8: M0|All other registers are clobbered" llvm/docs/AMDGPUUsage.rst
rg -n "ldsRequiresM0Init" llvm/lib/Target/AMDGPU llvm/include/llvm

Why this matters:

The ABI documentation says M0 is preserved on GFX6-GFX8.
ldsRequiresM0Init() is LLVM’s existing subtarget boundary for the same generation split.
Using that predicate keeps the fix architectural rather than gfx803-name specific.

Then find the two lowering paths that can borrow m0 for dynamic writelane:

1
2
rg -n "V_WRITELANE_B32|selectWritelane|SIFixSGPRCopies|SReg_32_XM0RegClass" \
  llvm/lib/Target/AMDGPU

The important files are:

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

The first one covers the SelectionDAG path. The second one covers GlobalISel. The complete fix has to handle both.

Step 8: Reduce Below OpenMP

The source-level reproducer proves user impact, but it is too large for an LLVM backend regression test. It depends on OpenMP, ROCr, GPU libc, RPC, and hardware.

The backend reducer should keep only the invariant:

a noinline callee forces dynamic llvm.amdgcn.writelane.i32;
that callee would borrow m0;
the caller performs a private/flat reload after the call.

Run the reduced test with normal SelectionDAG lowering:

1
2
3
4
5
cd "$STACK/llvm_src"

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8

Then run the same test through strict GlobalISel:

1
2
3
4
5
6
cd "$STACK/llvm_src"

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 \
  -global-isel -global-isel-abort=1 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8

Why strict GlobalISel matters:

-global-isel-abort=0 can silently fall back to SelectionDAG.
-global-isel-abort=1 proves the GlobalISel selector really handles the function.
Without this, the test could accidentally verify only the path already fixed.

Finally, check the GFX9 control:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cd "$STACK/llvm_src"

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX9

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 \
  -global-isel -global-isel-abort=1 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX9

Why this matters:

gfx803 should save and restore m0.
gfx906 should not grow unnecessary GFX8 preservation code.
Both SelectionDAG and GlobalISel should be tested explicitly.

That is the debugging ladder in compact form. The rest of the post explains why each rung mattered and why several tempting fixes were eventually removed.

The Stack Was Already Unusual

The machine was not a normal ROCm workstation.

The target GPU was an AMD Radeon Pro WX 3200, a Polaris12 card that reports as gfx803. The host was loongarch64. The compiler baseline was upstream LLVM 22. The ROCr baseline eventually settled on a forked rocm-6.4.4 runtime because that release still contains the legacy GFX8 doorbell path needed by this card.

The working OpenMP stack had several local constraints:

build LLVM’s AMDGPU OpenMP plugin on LoongArch64;
build LLVM with both openmp and the top-level offload runtime enabled;
use ROCr 6.4.4 with small non-x86 portability patches;
build LLVM-compatible AMD device libraries instead of using the system ROCm bitcode;
use code object version 4 for gfx803;
accept AMDGPU HSA ELF ABI version 4 in libomptarget;
use the COV4 implicit argument size, 56 bytes, when launching COV4 kernels;
set LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64, because larger HSA queues fail on this setup.

The openmp versus offload split is easy to misstate. In LLVM 22, the host offloading runtime, plugins, tests, and LIBOMPTARGET_GPU_LIBC_SUPPORT build policy live under the top-level offload/ tree, and this stack must build that runtime. The historical openmp/device/... paths discussed below are the device-side OpenMP bitcode runtime sources that still exist in the pinned llvmorg-22.1.8 tree. This tag does not have an offload/DeviceRTL/ directory.

That context matters because it gave the bug many plausible hiding places.

If a device program crashes while using:

1
2
3
4
-fopenmp
-fopenmp-targets=amdgcn-amd-amdhsa
--offload-arch=gfx803
-Xarch_device -mcode-object-version=4

on a private ROCr runtime, an old GPU, and a non-mainstream host architecture, the first instinct is not “LLVM backend register preservation.” It is more likely:

the runtime is submitting a bad packet,
COV4 launch metadata is wrong,
the old GFX8 doorbell path is incomplete,
the GPU libc RPC server is mishandling a lane mask,
or OpenMP’s teams loop runtime is scheduling the wrong lanes.

All of those were reasonable hypotheses. Most of them were wrong.

First, Prove That The Runtime Can Launch Anything

Before debugging printf, the lower layers had to work at all.

The ROCr path had already gone through a basic bring-up sequence:

llvm-offload-device-info had to see the device as gfx803.
A raw HSA queue of size 64 had to be created.
A raw HSA barrier packet had to complete.
A scalar OpenMP target region had to run on the GPU.
A reduction test had to print the expected result:
1
sum = 15050

Those checks mattered because they separated “the runtime cannot dispatch work” from “one generated device program corrupts itself.”

The raw barrier packet was especially important. It proved that ROCr could:

enumerate the GPU,
create a usable queue,
ring the legacy GFX8 doorbell,
and make the GPU consume at least one AQL packet.

So when printf failed later, ROCr was not the first place to patch. ROCr only sees submitted packets, memory accesses, and queue state. If the device code builds a bad varargs packet or corrupts loop state before talking to the host RPC server, ROCr cannot repair that.

The Failing Shape Was Suspiciously Specific

The failing source was small, but the shape mattered:

1
2
3
4
#pragma omp target teams distribute parallel for num_teams(2) num_threads(6)
for (int i = 0; i < 18; i++)
  printf("teams id = %d, thread id = %d\n", omp_get_team_num(),
         omp_get_thread_num());

This was not just “device printf never works.”

A direct target-region printf worked:

1
2
3
#pragma omp target
printf("target printf initial=%d thread=%d team=%d\n",
       omp_is_initial_device(), omp_get_thread_num(), omp_get_team_num());

Small teams-loop variants also behaved differently depending on team/thread shape. That made it tempting to blame divergent control flow and active-lane masks. A printf inside a teams loop is a natural stress test for lane-sensitive device runtime code:

only a subset of lanes may reach a particular call site,
the runtime has to describe the active lanes correctly,
and a variadic call carries enough pointer/state complexity to fault if the wrong lane participates.

The first useful reduction was not a fix. It was a count-only version.

If the loop body updated an atomic counter instead of printing, the loop executed the expected number of iterations. That was strong evidence that OpenMP loop scheduling was not simply launching 48 iterations. The extra lines were not the loop runtime doing extra work in a clean way. They were stale or inactive lanes becoming visible through printf.

That finding pointed toward GPU libc and RPC.

The First Working Theory: GPU Libc RPC Lane Masks

LLVM GPU libc implements device-side printf for AMDGPU through an RPC path. The device code opens a port, describes participating lanes, and the host-side RPC server performs the output operation.

The early theory was:

The output wrapper is out of line, so the active-lane mask is captured in the wrong function context. In a divergent OpenMP teams loop, that could produce a full-wave or stale mask instead of the subset of lanes that actually called printf.

That theory fit the symptom well enough to produce a patch series:

Split OpenMP GPU-libc stdio support from OpenMP device-runtime allocation.
LLVM’s OpenMP device runtime used OMPTARGET_HAS_LIBC for more than one policy. Locally, we wanted printf and puts to resolve to LLVM GPU libc, but we did not want OpenMP internal allocation to route through LLVM libc’s GPU malloc/free path.
Mark GPU libc output entry points as call-site-inline.
The local patch introduced a LIBC_GPU_RPC_INLINE macro and applied it to GPU stdio wrappers such as printf, vprintf, fprintf, puts, fwrite, and related output functions. The idea was to force RPC mask calculation to happen at the real divergent OpenMP call site.
Guard GPU-side RPC callback invocation by the lane mask.
The callback code was changed so a GPU lane only handled its slot if the RPC packet mask included that lane.

This was not irrational. It matched the visible bug, it made the local tests pass, and it produced a plausible story:

count-only loop works,
output path fails,
output path has lane masks,
inline the output path and guard the mask.

The problem was that it was too convenient. It fixed printf, but it did not prove that printf was the root cause.

Why The First Fix Was Not Good Enough

The warning sign was scope.

The patch touched broad GPU libc and device-runtime behavior across the LLVM 22 offload build surface and the device-side OpenMP runtime sources:

offload/CMakeLists.txt and the LIBOMPTARGET_GPU_LIBC_SUPPORT build policy,
openmp/device/CMakeLists.txt and openmp/device/src/LibC.cpp in the llvmorg-22.1.8 DeviceRTL source tree,
libc/src/__support/macros/attributes.h,
many libc/src/stdio/gpu/*.cpp files,
and libc/shared/rpc.h.

That is a lot of surface area for a bug that only appeared on gfx803 in one small teams-loop shape.

The more important concern was conceptual:

If printf only exposes corrupted device state, then every printf-specific fix is a mask.

Imagine adding a new GPU runtime operation tomorrow:

1
2
3
printx(...);
printy(...);
printz(...);

If all of them would fail because the caller’s state was already corrupt, then patching printf would be the wrong engineering move. The right fix would be where the state became corrupt.

That is why the first working patch series was eventually removed. It was useful as a debugging step, but it had the wrong shape for a root-cause fix.

The Control Experiment: Force COV4 On GFX906

The decisive experiment came from reinstalling the MI50, a Vega 20 GPU that reports as gfx906.

At first, the old card and code object version 4 were entangled:

gfx803 needed COV4 for the local stack,
the failing case ran on gfx803,
and COV4 support had been re-enabled locally in LLVM 22.

So COV4 itself remained a suspect.

The control was simple:

Run the same source on gfx803 with COV4 and without the local printf workaround.
Confirm that the old failure still reproduces.
Run the same source on gfx906, also forced to COV4.
Check whether the same corruption appears.

The result changed the investigation:

1
2
gfx803 + COV4: bad
gfx906 + COV4: good

That ruled out “COV4 alone” as the explanation.

It also made the broad RPC hypothesis less satisfying. The same LLVM GPU libc RPC path could run through COV4 on gfx906 without producing the same stale lines. Something about the pre-GFX9 backend path mattered.

A test matrix showing gfx803 with COV4 fails before the backend fix, gfx906 with COV4 passes, gfx906 default COV also passes, and gfx803 with the m0 preservation fix passes. — Figure 2. The matrix separated code object version from architecture generation. COV4 was necessary for the old card, but it was not sufficient to cause the bug.

Looking Below Printf

The next step was to inspect what printf made the compiler generate.

The important pattern was not “RPC” by itself. It was an out-of-line callee, generated for GPU libc/RPC support, followed by caller-side reloads of private loop state.

On AMDGPU, private memory can be accessed through flat addressing. On older generations, some local/LDS/flat behavior still depends on a special scalar register named m0.

That was the missing background.

I did not initially know what m0 was, and that was part of the debugging problem. It is not just another temporary SGPR. In the AMDGPU ABI for GFX6-GFX8:

kernel prolog code initializes m0 to a value suitable for LDS access;
m0 is used for LDS range checking on GFX6-GFX8;
GFX9 and later no longer need m0 for LDS range checking;
on function exit, GFX6-GFX8 m0 is documented as preserved.

The plain-language version is:

On GFX6-GFX8, if a callee borrows m0, it has to return it in the condition it found it. Otherwise the caller may use a bad architectural register value for later memory operations.

That is exactly the kind of bug that can make a later printf look guilty even though the actual corruption happened earlier.

The Actual Bad Sequence

The broken sequence looked like this:

The caller enters with a valid m0 value.
The caller calls an out-of-line GPU libc/RPC helper.
Inside the callee, LLVM lowers a dynamic v_writelane_b32.
SIFixSGPRCopies borrows physical m0 as the lane selector.
The callee returns without restoring the original m0.
Back in the caller, later flat/private reloads run with the wrong m0.
Loop state becomes stale or corrupt.
printf observes extra lanes or bad varargs.

In the concrete failure, small values such as lane ids were especially bad. A callee might return with:

1
m0 = 5

instead of the caller’s valid LDS/flat value. Subsequent private/flat reloads could then read the wrong state.

This explained the strange symptom:

the loop did not schedule extra clean iterations;
instead, stale lane state made extra output appear;
variadic data could be read through corrupted state;
and gfx906 avoided the same failure because GFX9 no longer has the same m0 requirement for LDS range checking.

A flow diagram showing caller m0 valid, callee borrows m0 for dynamic writelane, callee returns with lane id in m0, caller flat reload reads stale state, and printf exposes extra lanes. — Figure 3. `printf` was the visible symptom. The actual bug was a callee that borrowed architectural `m0` on GFX8 and returned it clobbered.

Why Dynamic Writelane Used M0

The first relevant LLVM code lives in SIFixSGPRCopies.

v_writelane_b32 is special. On subtargets with a one-SGPR constant-bus limit, the instruction still has to obey the “one SGPR” rule. The lane selector can use m0 without counting as a normal constant-bus use, so the pass may rewrite the instruction by moving one source operand into m0.

The simplified lowering shape is:

1
2
s_mov_b32 m0, lane_selector
v_writelane_b32 vN, src, m0

That is fine if m0 is just scratch state.

On GFX6-GFX8, it is not scratch state.

LLVM already has a subtarget predicate for this boundary:

1
2
3
bool ldsRequiresM0Init() const {
  return getGeneration() < GFX9;
}

That predicate means: this target is before GFX9, so LDS/flat behavior still requires a valid m0 value.

The bug was not that LLVM used m0 at all. The bug was that this lowering path borrowed m0 without preserving the incoming value on targets where m0 is part of the call-preserved architectural state.

The follow-up audit found the same idea in GlobalISel: AMDGPUInstructionSelector::selectWritelane can also copy a dynamic lane selector through physical M0. That made the first backend patch incomplete. The user-facing OpenMP bug happened to expose the SelectionDAG path, but an ABI fix has to cover every backend path that can return from a callee with m0 clobbered.

The Pure Backend Reproducer

The clean test had to remove OpenMP, ROCr, GPU libc, and the RPC server.

The reduced LLVM IR does three things:

Create a noinline callee that forces dynamic llvm.amdgcn.writelane.i32.
Call it from a kernel.
Perform a caller-side private/flat reload after the call.

The important part is the callee:

1
2
3
4
5
6
7
declare i32 @llvm.amdgcn.writelane.i32(i32, i32, i32)

define internal fastcc i32 @clobber_m0(i32 %src, i32 %lane, i32 %old) #0 {
entry:
  %r = call i32 @llvm.amdgcn.writelane.i32(i32 %src, i32 %lane, i32 %old)
  ret i32 %r
}

And the caller-side reload:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
define amdgpu_kernel void @caller(ptr addrspace(1) %out, i32 %lane) #1 {
entry:
  %slot = alloca i32, align 4, addrspace(5)
  store volatile i32 123, ptr addrspace(5) %slot, align 4
  %old = load volatile i32, ptr addrspace(5) %slot, align 4
  %r = call fastcc i32 @clobber_m0(i32 777, i32 %lane, i32 %old)
  %slot.generic = addrspacecast ptr addrspace(5) %slot to ptr
  %reload = load volatile i32, ptr %slot.generic, align 4
  %sum = add i32 %reload, %r
  store volatile i32 %sum, ptr addrspace(1) %out, align 4
  ret void
}

The test checks two targets and two lowering paths.

The RUN lines are deliberately explicit:

1
2
3
4
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 < %s | FileCheck %s --check-prefix=GFX8
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 -global-isel -global-isel-abort=1 < %s | FileCheck %s --check-prefix=GFX8
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 < %s | FileCheck %s --check-prefix=GFX9
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 -global-isel -global-isel-abort=1 < %s | FileCheck %s --check-prefix=GFX9

The first and third lines cover the normal SelectionDAG path. The second and fourth lines force GlobalISel and make fallback a hard error. That detail matters: -global-isel-abort=0 would allow a silent fallback to SelectionDAG, which could hide a missing AMDGPUInstructionSelector fix.

For gfx803, the generated assembly must save and restore m0:

1
2
3
4
; GFX8: s_mov_b32 [[SAVE:s[0-9]+]], m0
; GFX8: s_mov_b32 m0, [[LANE:s[0-9]+]]
; GFX8: v_writelane_b32 {{v[0-9]+}}, {{s[0-9]+}}, m0
; GFX8: s_mov_b32 m0, [[SAVE]]

For gfx906, the test verifies that the GFX8-specific save/restore is not inserted:

1
2
; GFX9: v_writelane_b32 {{v[0-9]+}}, {{s[0-9]+}}, m0
; GFX9-NOT: s_mov_b32 m0,

That is what made the fix upstream-grade enough for a local patch series: it was no longer an OpenMP reproducer, a ROCr reproducer, or a printf reproducer. It was a backend reproducer, and it checked both AMDGPU instruction selection routes that can borrow m0.

The Final Patch

The final patch is still small, but it has two code paths.

In the SelectionDAG path, SIFixSGPRCopies saves m0 into a normal SGPR before borrowing it if the subtarget still needs valid m0 for LDS/flat operations:

1
2
3
4
5
6
7
Register SavedM0;
if (ST.ldsRequiresM0Init()) {
  SavedM0 = MRI->createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);
  BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), TII->get(AMDGPU::COPY),
          SavedM0)
      .addReg(AMDGPU::M0);
}

Then perform the existing borrow:

1
2
3
4
BuildMI(*MI.getParent(), MI, MI.getDebugLoc(), TII->get(AMDGPU::COPY),
        AMDGPU::M0)
    .add(Src1);
Src1.ChangeToRegister(AMDGPU::M0, false);

And restore immediately after the v_writelane_b32:

1
2
3
4
5
6
if (SavedM0.isValid()) {
  MachineBasicBlock::iterator RestorePt = std::next(MI.getIterator());
  BuildMI(*MI.getParent(), RestorePt, MI.getDebugLoc(),
          TII->get(AMDGPU::COPY), AMDGPU::M0)
      .addReg(SavedM0, RegState::Kill);
}

These snippets use LLVM 22’s BuildMI(MachineBasicBlock &, MachineInstr &, ...) overload. Passing MI here is not pseudocode; the local LLVM build used this exact form. The equivalent iterator form would also be possible, but the patch keeps the style already used in nearby AMDGPU backend code.

The same preservation is needed in the GlobalISel selector. The equivalent lowering lives in AMDGPUInstructionSelector::selectWritelane, where the lane selector can also be constrained to SReg_32_XM0RegClass and copied through physical M0:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Register SavedM0;
...
if (STI.ldsRequiresM0Init()) {
  SavedM0 = MRI->createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);
  BuildMI(*MBB, *MIB, DL, TII.get(AMDGPU::COPY), SavedM0)
      .addReg(AMDGPU::M0);
}

BuildMI(*MBB, *MIB, DL, TII.get(AMDGPU::COPY), AMDGPU::M0)
    .addReg(LaneSelect);
MIB.addReg(AMDGPU::M0);
...
if (SavedM0.isValid()) {
  BuildMI(*MBB, std::next(MIB->getIterator()), DL, TII.get(AMDGPU::COPY),
          AMDGPU::M0)
      .addReg(SavedM0, RegState::Kill);
}

The GlobalISel part is easy to miss because OpenMP normally reached the SelectionDAG path in this setup. But the ABI rule is not conditional on which instruction selector produced the callee. If GlobalISel can emit the same dynamic writelane-through-m0 sequence on GFX6-GFX8, it has to preserve m0 too.

This is narrower than the earlier GPU-libc/RPC workaround in three ways.

First, it is not specific to printf. If another device runtime operation exposes the same caller-state corruption later, the same backend fix applies.

Second, it is not broad GPU libc policy. It does not change stdio wrappers, allocation selection, or generic RPC callback behavior.

Third, it is gated by an existing architectural predicate:

1
ST.ldsRequiresM0Init()

That currently means getGeneration() < GFX9, so the extra save/restore applies to GFX6-GFX8 and avoids changing the generated code for GFX9+.

Verification

The final validation happened at several layers.

Patch application against pinned sources:

1
PREPARE_ONLY=1 ./setup_openmp_offload_stack.sh /tmp/amd-omp-stack-prepare-check

Backend regression through the stack’s verification helper:

1
2
3
4
5
6
source scripts/common.sh
source scripts/verify.sh
LLVM_SRC="$HOME/Projects/llvm-22/llvm_src" \
LLVM_BUILD="$HOME/Projects/llvm-22/llvm_build" \
LLVM_INSTALL="$HOME/Projects/llvm-22/llvm_install" \
  run_amdgpu_backend_regression_if_available

Direct backend regression commands:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
cd "$STACK/llvm_src"

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O0 \
  -global-isel -global-isel-abort=1 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX9

llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -O0 \
  -global-isel -global-isel-abort=1 \
  < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
  FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX9

Optimization sweep for the GFX8 target:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
cd "$STACK/llvm_src"

for opt in 0 1 2 3; do
  llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O"$opt" \
    < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
    FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8

  llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -O"$opt" \
    -global-isel -global-isel-abort=1 \
    < llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll |
    FileCheck llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll --check-prefix=GFX8
done

Device discovery:

1
2
3
Name: gfx906
Name: gfx803
Name: Virtual Host Device

Mixed-device OpenMP smoke test:

1
2
3
openmp amdgpu smoke: devices=2 checked=2
  device=0 sum=120 on_device=1
  device=1 sum=1720 on_device=1

The original gfx803 teams-loop reproducer:

1
status=0 lines=18

The reduction test:

1
sum = 15050

The most important verification was not the final “it works” run. It was the negative/positive split:

gfx803 without the fix still showed the old corruption;
gfx906 forced to COV4 did not;
gfx803 with the backend m0 fix printed exactly 18 lines.

That split is what kept the final patch narrow.

What The Old Patches Taught Us

The removed patches were not wasted work.

They proved several useful things:

GPU libc was actually linked and the RPC server could run.
printf and puts could work on this stack.
The failure was after kernel launch, not at device discovery or queue creation.
Lane-sensitive runtime code was a plausible exposure point.

But they also illustrated a common debugging trap:

A patch that makes the symptom disappear can still be at the wrong layer.

The first patch series made sense when the only failing observation was “device stdio in a divergent teams loop fails.” It became less defensible after the gfx906 COV4 control worked and after the reduced backend test reproduced the exact missing preservation pattern.

That is the main lesson from this bug. The fix should be where the invariant is violated, not where the symptom is printed.

Practical Takeaways

There are a few lessons I would carry into the next offloading bug.

First, keep a control GPU if possible. The MI50 was not just a faster or newer device. It was the experiment that separated “COV4 is broken” from “the pre-GFX9 backend path is broken.”

Second, never trust printf as the root cause just because printf is the first visible failure. Variadic output is excellent at exposing corrupted state. That does not mean output created the corruption.

Third, reduce below the runtime when the evidence points below the runtime. A pure llc reproducer is dramatically stronger than an OpenMP reproducer when the suspected bug is in instruction selection or a late machine pass.

Fourth, if the backend has more than one lowering path, test the one you think you fixed and the one you might have forgotten. SelectionDAG covered the path that exposed this OpenMP bug, but GlobalISel had equivalent m0 borrowing logic. Using -global-isel-abort=1 made that test strict instead of allowing a quiet fallback.

Fifth, old architecture support often fails at forgotten architectural seams. For GFX6-GFX8, m0 is one of those seams. It is easy to think of it as a temporary scalar register until a callee returns with the wrong value and the caller starts reloading stale private state.

Finally, keep local patches indexed and disposable. The early GPU-libc/RPC patches were easy to delete because they were isolated patch files. Once the root cause moved, the patch series could move with it.

Conclusion

The final state is much simpler than the path that found it.

The OpenMP stack still needs local work for this old card:

ROCr 6.4.4 for the legacy GFX8 doorbell path,
LoongArch64 host portability patches,
COV4 ELF acceptance,
COV4 implicit argument sizing,
explicit gfx803 COV4 compile flags,
queue size 64.

But the teams-loop printf corruption is no longer handled by a special GPU libc workaround. It is handled where it belongs: in the AMDGPU backend lowering paths that borrow m0.

The fixed rule is straightforward:

If any dynamic v_writelane_b32 lowering path borrows m0 on a target where LDS/flat operations require m0 to remain valid, save it first and restore it after the instruction.

That is the kind of conclusion I want from a debugging session: fewer patches, smaller scope, a pure regression test, and a reason the fix applies to the architecture rather than to one lucky symptom.

The Debugging Recipe

Step 1: Reproduce The Symptom And Count It

Step 2: Prove The Loop Scheduler Is Not Simply Running 48 Iterations

Step 3: Check Whether This Is Any Device Printf Or This Specific Shape

Step 4: Save Compiler Temps And Look For The First Suspicious Pattern

Step 5: Separate COV4 From GFX803

Step 6: Inspect The Device Assembly Around The Call Boundary

Step 7: Check The ABI And The Target Predicate

Step 8: Reduce Below OpenMP

The Stack Was Already Unusual

First, Prove That The Runtime Can Launch Anything

The Failing Shape Was Suspiciously Specific

The First Working Theory: GPU Libc RPC Lane Masks

Why The First Fix Was Not Good Enough

The Control Experiment: Force COV4 On GFX906

Looking Below Printf

The Actual Bad Sequence

Why Dynamic Writelane Used M0

The Pure Backend Reproducer

The Final Patch

Verification

What The Old Patches Taught Us

Practical Takeaways

Conclusion

Table of contents