Moving GFX803 LLVM OpenMP Offloading From COV4 To COV5

Posted on (Updated on )
The first working WX3200/gfx803 OpenMP stack used COV4 because it got real kernels running quickly. That was useful but wrong as a maintenance direction: LLVM 22’s offload loader rejects AMDGPU ABI v4, and keeping COV4 meant reintroducing more removed legacy behavior. Auditing upstream LLVM 10 through 22 and ROCm/AOMP LLVM from the old 3.3 HCC/OCL tags through the audited ROCm 7.2.4 tag showed the better path: old LLVM 10/11 worked because gfx803 was native in the old DeviceRTL/COV-era stack; modern LLVM kept pieces of the COV5 pre-gfx9 design, especially hidden implicit-argument offsets at 192/196/200, but LLVM 22’s libomptarget did not populate those fields and the backend loaded them through the wrong base in non-entry functions. The final stack compiles gfx803 as COV5, fills pre-gfx9 COV5 implicit args from the ROCr queue, uses the callee implicitarg.ptr for non-entry hidden-arg loads, keeps the separate m0 preservation fix, and verifies the result with llc regressions plus a mixed gfx803/gfx906 OpenMP smoke test.

The first end-to-end result for the WX3200 was intentionally pragmatic:

1
LLVM 22.1.8 + ROCr 6.4.4 + gfx803 + COV4

That stack proved the hard parts were possible on the LoongArch64 machine:

  • KFD could expose the Polaris12 GPU as gfx803.
  • ROCr 6.4.4 could create a queue and ring the old GFX8 doorbell.
  • LLVM’s AMDGPU OpenMP plugin could launch real OpenMP target regions.
  • The generated program could run reductions and device-side output.

But a working first path is not always the path to keep.

COV4 was a shortcut. It required local patches to make LLVM 22’s offload loader accept AMDGPU ABI v4 and to make libomptarget use the old 56-byte implicit argument area. That felt wrong once the stack also had an MI50 (gfx906) and once the debugging moved from “can this run at all?” to “what is the narrow patch set we can maintain?”

The better question became:

Can gfx803 run LLVM OpenMP offloading through COV5, using the modern ABI family, instead of resurrecting more of COV4?

The answer was yes. The route was not “turn on COV5 and hope.” It required following the traces left in old LLVM and ROCm/AOMP toolchains, then finishing the incomplete pre-gfx9 COV5 path in two specific places.

Timeline of LLVM and ROCm AMDGPU OpenMP support showing old gfx803 DeviceRTL support, COV5 hidden argument pieces, migration to offload, and final local COV5 fixes.

Figure 1. The important lesson from the history audit was not that old code should be copied back. It was that COV5 already had a pre-gfx9 design shape, but modern LLVM no longer exercised it end to end for gfx803 OpenMP.

Terms First

Code object version, or COV, is the AMDGPU executable ABI version encoded in the GPU image. Clang and LLD produce an AMDGPU ELF image, and libomptarget loads that image through ROCr.

For this work:

  • COV4 means AMDGPU HSA ELF ABI v4.
  • COV5 means AMDGPU HSA ELF ABI v5.
  • COV6 is LLVM 22’s default AMDGPU code object version.

Pre-gfx9 means AMD GPU generations before GFX9, including gfx803. Those chips do not behave the same as gfx906 for all private/shared address lowering. In particular, the compiler may need private and LDS aperture base information that newer chips can get differently.

Implicit arguments are hidden kernel-launch fields passed by the runtime to the device code. User C code never mentions them. LLVM-generated AMDGPU code can still load them. For COV5, the implicit-argument block is 256 bytes.

DeviceRTL is the OpenMP device runtime library. Seeing gfx803 in the AMDGPU backend’s processor table only means LLVM knows the ISA. Seeing it in DeviceRTL build lists means the OpenMP device runtime was intentionally built for that architecture.

The Initial COV4 State

The first repo state looked like this:

1
d00b0e2 Bootstrap gfx803 OpenMP offload stack

It carried these LLVM-side ideas:

1
2
3
4
5
6
0001 openmp-enable-amdgpu-plugin-on-loongarch64
0002 offload-accept-amdgpu-hsa-elf-abi-v4
0003 amdgpu-use-cov4-implicit-argument-size
0004 openmp-split-gpu-libc-stdio-and-allocation
0005 libc-inline-gpu-stdio-rpc-entrypoints
0006 libc-honor-rpc-lane-mask-on-gpu

The COV4 pieces did two things:

  1. Let the offload ELF checker accept AMDGPU ABI v4.
  2. Tell libomptarget that COV4 uses a 56-byte implicit-argument area.

That was enough to make the old card useful. It was also a sign that we were reintroducing a path LLVM 22 had moved away from. The printf patches were similarly useful as experiments, but the later m0 investigation showed that the real bug was in AMDGPU backend register preservation, not in the GPU libc RPC layer.

The COV4 stack was valuable because it established a working lower bound. It was not the right final design.

The Combinations Tried

The useful evidence came from trying combinations, not from assuming that gfx803 implied COV4.

Toolchain/runtime comboResultWhat it taught us
LLVM 22 default COV6 + early gfx803 setupNot the working pathLLVM 22 defaults to COV6, but old gfx803 support was not complete just by using the default.
LLVM 22 + ROCr 7.x + gfx803Proof-of-concept onlyROCr 7.x had removed the legacy GFX8 DoorbellType == 1 queue path; patching it back was too broad.
LLVM 22 + ROCr 6.4.4 + gfx803 + COV4Worked after local COV4 patchesGood bootstrap path, but it restored removed ELF/implicit-arg behavior.
LLVM 22 + ROCr 6.4.4 + gfx906 + COV5/defaultWorkedMI50 did not need the gfx803 COV4 escape route.
LLVM 22 + ROCr 6.4.4 + gfx906 forced to COV4Worked for the printf control caseCOV4 alone was not the cause of the m0/printf failure.
LLVM 22 + ROCr 6.4.4 + gfx803 + COV5 before fixesFailed in small OpenMP target/teams shapesThe modern ABI family existed, but pre-gfx9 launch metadata and non-entry hidden-arg lowering were incomplete.
LLVM 22 + ROCr 6.4.4 + gfx803 + COV5 after fixesWorkedThis became the maintained direction.
LLVM 22 + ROCr 6.4.4 + gfx803 COV5 + gfx906 default in one binaryWorkedOne process can use both GPUs with arch-specific code-object policy.

The final local wrapper policy is now:

1
2
--offload-arch=gfx803 -Xarch_gfx803 -mcode-object-version=5
--offload-arch=gfx906

That means gfx803 is explicitly held at COV5 while gfx906 uses the default newer path.

The Upstream LLVM Audit

The first suspicion was that COV5 and gfx803 were never meant to work together. The old source history did not support that simple answer.

The audit used shallow tag fetches and source grep over these families:

1
2
llvmorg-10.0.1 ... llvmorg-22.1.8
rocm-hcc-3.3.0, rocm-ocl-3.3.0, rocm-3.5.0 ... rocm-7.2.4

The commands were intentionally simple. For example:

1
2
3
4
5
6
7
8
9
LLVM_SRC=$HOME/Projects/llvm-22/llvm_src

git -C "$LLVM_SRC" grep -n 'gfx803' llvmorg-10.0.1 -- openmp

git -C "$LLVM_SRC" grep -n 'PRIVATE_BASE_OFFSET = 192' llvmorg-22.1.8 -- \
  llvm/lib/Target/AMDGPU

git -C "$LLVM_SRC" grep -n 'struct AMDGPUImplicitArgsTy' llvmorg-22.1.8 -- \
  offload/plugins-nextgen/amdgpu

The audit found this progression:

LLVM Release LineFinding
LLVM 10.0.1 / 11.1.0The old openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt built for gfx700 gfx701 gfx801 gfx803 gfx900. This matches the recollection that old LLVM worked natively for gfx803, but it predates the modern COV5/COV6 path.
LLVM 12 / 13The classic AMDGPU libomptarget plugin appears, and gfx803 remains in old DeviceRTL lists. COV4 constants appear in the AMDGPU backend around this era, but this is still not a modern COV5 OpenMP solution.
LLVM 14 / 15The newer openmp/libomptarget/DeviceRTL path includes gfx803. LLVM 15 has PRIVATE_BASE_OFFSET = 192 and QUEUE_PTR_OFFSET = 200 in the AMDGPU backend. That is the first strong clue that hidden COV5 pre-gfx9 fields are a real design, not something invented locally.
LLVM 16 / 17The nextgen AMDGPU plugin appears beside the classic plugin. It has a COV5 implicit-argument struct, but the runtime-side struct only covers the common fields up to dynamic LDS and padding. The backend knows the hidden offsets; the runtime does not fill them.
LLVM 18 / 19COV5 and COV4 coexist in parts of libomptarget. LLVM 19 moves the offload runtime out of openmp/libomptarget into offload. gfx803 is still listed in DeviceRTL in LLVM 19.
LLVM 20gfx803 disappears from the upstream DeviceRTL architecture list, even though the AMDGPU backend still knows the processor and still has the hidden implicit-argument offsets. This is the point where “the backend knows gfx803” and “OpenMP ships a complete gfx803 path” clearly diverge.
LLVM 21 / 22The offload plugin is COV5+ oriented. LLVM 22’s common ELF check rejects AMDGPU ABI versions below 5 with “must be version 5 or above”. The COV5 implicit-argument struct still does not expose or populate the pre-gfx9 hidden fields.

That history changed the direction of the patch.

If LLVM 10/11 worked, copying that whole old path back would mean going back to the old DeviceRTL/plugin/COV assumptions. That is a trap. The more relevant signal was LLVM 15 onward: the backend had COV5 hidden-field offsets for private base, shared base, and queue pointer. The design was already there. LLVM 22 just did not complete the path for our pre-gfx9 OpenMP use.

The ROCm/AOMP LLVM Audit

The AMD downstream history filled in another part of the story.

I am calling this the ROCm/AOMP audit because these tags represent AMD’s LLVM toolchain history for OpenMP offloading, including the old ROCm 3.3 HCC/OCL split tags and the later unified ROCm LLVM tags. The important range was not just “a recent ROCm release.” It was the whole arc from the old gfx803-capable runtime through the current audited 7.x line:

The tag audit included the old 3.3 split tags:

1
2
rocm-hcc-3.3.0
rocm-ocl-3.3.0

Regular ROCm LLVM tags then continue from rocm-3.5.0 through the audited rocm-7.2.4 tag.

The concrete grep checks are reproducible:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
LLVM_SRC=$HOME/Projects/llvm-22/llvm_src

# Use glob pathspecs so old tags that lack newer directories like offload do
# not fail, while still avoiding a full-tree grep.
for tag in rocm-hcc-3.3.0 rocm-3.5.0 rocm-5.7.1 rocm-6.4.4 rocm-7.2.4; do
  echo "== $tag =="
  git -C "$LLVM_SRC" grep -n 'gfx803' "$tag" -- \
    ':(glob)openmp/**' ':(glob)offload/**' ':(glob)libc/**' |
    head -20
done

for tag in rocm-5.7.1 rocm-6.4.4 rocm-7.2.4; do
  echo "== $tag implicit args =="
  git -C "$LLVM_SRC" grep -E -n 'AMDGPUImplicitArgsTy|IMPLICITARGS' "$tag" -- \
    ':(glob)openmp/**' ':(glob)offload/**' \
    ':(glob)llvm/lib/Target/AMDGPU/**' |
    head -40
done

git -C "$LLVM_SRC" grep -E -n \
  'PRIVATE_BASE_OFFSET|SHARED_BASE_OFFSET|QUEUE_PTR_OFFSET' \
  rocm-5.7.1 rocm-6.4.4 rocm-7.2.4 -- llvm/lib/Target/AMDGPU

The useful findings:

ROCm/AOMP lineFinding
ROCm HCC/OCL 3.3 and ROCm 3.5gfx803 is in the old AMDGPU DeviceRTL build list, for example openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt lists gfx700 gfx701 gfx801 gfx803 gfx900. There is no COV5 machinery. This confirms old gfx803 OpenMP support existed, but in an older ABI/toolchain shape.
ROCm 4.0 / 4.5gfx803 remains in DeviceRTL and hostcall/libm-related lists. COV4 constants are present. Still not the modern COV5 solution.
ROCm 5.0Both old and newer OpenMP runtime directories exist, and gfx803 remains in the build lists. This is the transition period where the old support and newer runtime structure overlap.
ROCm 5.7The tree has both the classic AMDGPU plugin and the nextgen plugin. openmp/libomptarget/DeviceRTL, old deviceRTLs/amdgcn, hostexec, hostrpc, libm, and libc GPU architecture lists still mention gfx803. The classic plugin defines COV4_SIZE = 56 and COV5_SIZE = 256, fills many COV5 fields explicitly, and the backend already has COV5 hidden offsets. This is a strong sign that AMD had overlapping COV4/COV5-era support, but not that LLVM 22’s final nextgen path is complete for gfx803.
ROCm 6.4.4offload/DeviceRTL still lists gfx803, and the nextgen plugin has both a COV5 implicit-argument struct and a 56-byte COV4 dummy struct. It also defaults to COV6 in the backend, and it still does not populate the pre-gfx9 COV5 private/shared/queue fields in the nextgen runtime path.
ROCm 7.2.4COV4 ELF loading is rejected by the common offload checker, like upstream LLVM 22. gfx803 appears only in source-level platform guards such as openmp/device/include/Platform.h, not as a normal DeviceRTL architecture list. The backend still has COV5 hidden-offset constants, but the launch-side population remains absent.

The old COV4 size was not guessed. ROCm 5.7’s classic plugin spells it out:

1
2
COV4_SIZE = 56
COV5_SIZE = 256

That explained why the first COV4 patch worked. It did not justify keeping it. The same audit showed that later toolchains were moving away from COV4 loading, while the backend kept COV5 pre-gfx9 offsets. That made COV5 the better target.

The conclusion was specific:

  • Do not recover the whole old COV4 path.
  • Do not blindly change generic COV5 behavior for all GPUs.
  • Continue the existing COV5 + pre-gfx9 design where the source already points.
  • Patch only the missing runtime population and the incorrect non-entry backend address base.

Root Cause 1: COV5 Launch Metadata Was Too Generic

The COV5 implicit-argument block is 256 bytes.

LLVM 22’s runtime-side struct looked like a generic COV5 block:

1
2
3
4
5
offset   0: block counts
offset  12: group sizes
offset  64: grid dimensions
offset 120: dynamic LDS size
rest: padding

For pre-gfx9 devices, the backend-side ABI has more fields:

1
2
3
offset 192: private segment aperture base high bits
offset 196: group/LDS segment aperture base high bits
offset 200: queue pointer

Those offsets already existed in the AMDGPU backend as PRIVATE_BASE_OFFSET, SHARED_BASE_OFFSET, and QUEUE_PTR_OFFSET. The runtime just was not filling them for COV5 launches.

COV5 implicit argument layout showing common launch fields, dynamic LDS size, and pre-gfx9 hidden fields at offsets 192, 196, and 200.

Figure 2. The local patch does not invent a new ABI. It expands libomptarget’s COV5 implicit-argument struct so the runtime can fill the hidden fields that the AMDGPU backend already knows how to load.

The fix in libomptarget does three narrow things:

  1. Keep the COV5 implicit-argument block at 256 bytes.
  2. Add explicit fields at offsets 192, 196, and 200.
  3. Fill the pre-gfx9 aperture fields only for gfx6, gfx7, and gfx8.

The queue pointer is always filled when the raw HSA queue is available. The private/shared aperture values come from ROCr’s AMD queue extension prefix. The patch uses static assertions for the ROCr 6.4.4 queue layout so a future layout change fails at build time instead of silently filling wrong offsets.

The subtle review fix here was important: QueuePtr must be uint64_t, not void *. The implicit-argument ABI is device-side and fixed-width. A host pointer type would make the structure layout depend on the LLVM build host.

Root Cause 2: Non-Entry Functions Used The Wrong Base Pointer

The second failure was lower in the AMDGPU backend.

There are two different cases:

  1. Entry kernel: the backend can address hidden implicit arguments through the kernel argument segment pointer plus the aligned explicit-argument size.
  2. Non-entry device helper: the helper does not have the entry kernel’s kernarg base. It receives a preloaded implicitarg.ptr SGPR pair.

The broken code used the entry-kernel addressing idea too broadly. In a non-entry function, this can generate loads that look like “load from field offset 0xc0” without using the callee’s real implicit-argument pointer as the base.

For gfx803 COV5, those fields matter:

1
2
3
0xc0 = 192 = private base
0xc4 = 196 = shared base
0xc8 = 200 = queue pointer

The fix was to split “field offset” from “addressing base”:

  • field offsets stay the same ABI constants;
  • entry functions use the kernel argument pointer path;
  • non-entry functions use IMPLICIT_ARG_PTR;
  • both SelectionDAG and GlobalISel get the same rule.

The regression test is deliberately an LLVM backend .ll file, not an OpenMP C test. It forces three non-entry loads:

  • private base through an alloca in private memory;
  • shared base through an LDS/generic pointer check;
  • queue pointer through llvm.trap().

The expected gfx803 code loads from the callee implicit-argument pointer at 0xc0, 0xc4, and 0xc8. The gfx906 checks are negative controls: it should not start using those pre-gfx9 hidden private/shared fields.

The M0 Patch Is Separate

The COV5 migration did not replace the m0 fix from the previous debugging round. It carried it forward.

That patch fixes a different bug: dynamic v_writelane_b32 lowering can borrow the physical m0 register as a lane selector. On GFX6-GFX8, m0 is also architectural state used by LDS/flat memory operations and must survive calls.

The final COV5 patch series keeps this as its own logical patch:

1
0003-amdgpu-preserve-m0-around-gfx6-gfx8-writelane.patch

That matters for maintainability. The COV5 implicit-argument fixes explain why gfx803 COV5 kernels can launch and use hidden arguments correctly. The m0 fix explains why small teams-loop device printf no longer corrupts caller state. They are adjacent in the verified stack, but they are not the same root cause.

The Final Patch Set

After the refactor, the LLVM patch series became:

1
2
3
4
0001-openmp-enable-amdgpu-plugin-on-loongarch64.patch
0002-amdgpu-populate-cov5-pre-gfx9-implicit-args.patch
0003-amdgpu-preserve-m0-around-gfx6-gfx8-writelane.patch
0004-amdgpu-use-callee-implicitarg-ptr-for-cov5-hidden-args.patch

What was removed:

1
2
0002-offload-accept-amdgpu-hsa-elf-abi-v4.patch
0003-amdgpu-use-cov4-implicit-argument-size.patch

The script policy changed from “gfx803 uses COV4” to “gfx803 uses COV5”:

1
AMDGPU_COV5_ARCHES="gfx803"

The generated wrapper adds:

1
-Xarch_gfx803 -mcode-object-version=5

The GPU libc build also needs to match. LLVM libc for the amdgcn-amd-amdhsa runtime target defaults to a newer code object version unless told otherwise, so the setup script now passes:

1
-DRUNTIMES_amdgcn-amd-amdhsa_LIBC_GPU_CODE_OBJECT_VERSION=5

That avoids a later link-time mismatch when device code uses GPU libc features such as printf.

Debugging path from COV4 bootstrap through historical audit, COV5 runtime launch fixes, COV5 backend non-entry fixes, m0 preservation, and final mixed-gpu verification.

Figure 3. The final path narrowed the patches by asking which layer had the missing COV5 pre-gfx9 behavior, not by making every generic COV5 path behave like gfx803.

Reproducing The Investigation

Start from the repo and a disposable workspace:

1
2
cd $HOME/Projects/amd-omp-gpu-offloading
export OMP_AMDGPU_WORKSPACE=$HOME/Projects/llvm-22

First verify that the patch stack applies to the pinned sources:

1
PREPARE_ONLY=1 ./setup_openmp_offload_stack.sh "$OMP_AMDGPU_WORKSPACE"

This resets the generated LLVM and ROCr source trees, then applies:

1
2
patches/llvm/*.patch
patches/rocr/*.patch

For this COV5 migration, a prepare-only pass is essential. If a patch only works because the workspace already had stale source edits, the patch series is not maintainable.

Then build and test:

1
./setup_openmp_offload_stack.sh "$OMP_AMDGPU_WORKSPACE"

The successful run prints the backend regressions:

1
2
==> Running AMDGPU m0 writelane backend regression
==> Running AMDGPU COV5 non-entry implicit argument regression

On the dual-GPU machine it also prints:

1
2
3
4
==> Running OpenMP AMDGPU smoke test for gfx906 gfx803
openmp amdgpu smoke: devices=2 checked=2
  device=0 sum=120 on_device=1
  device=1 sum=1720 on_device=1

Check the generated CMake cache for the GPU libc COV5 setting:

1
2
grep 'LIBC_GPU_CODE_OBJECT_VERSION' \
  "$OMP_AMDGPU_WORKSPACE/llvm_build/CMakeCache.txt"

Expected important line:

1
LIBC_GPU_CODE_OBJECT_VERSION:STRING=5

Check device discovery:

1
2
source "$OMP_AMDGPU_WORKSPACE/env.sh"
llvm-offload-device-info

Expected important devices:

1
2
Name: gfx906
Name: gfx803

Compile a small reduction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <omp.h>
#include <stdio.h>

int main(void) {
  int sum = 0;

#pragma omp target teams distribute parallel for reduction(+ : sum)
  for (int i = 1; i <= 256; ++i)
    sum += i;

  printf("sum=%d devices=%d\n", sum, omp_get_num_devices());
  return sum == 32896 ? 0 : 1;
}

Build and run:

1
2
clang-amdgpu-openmp reduction.c -o reduction
OMP_TARGET_OFFLOAD=MANDATORY ./reduction

For single-architecture checks:

1
2
3
4
5
clang-gfx803-openmp reduction.c -o reduction-gfx803
OMP_TARGET_OFFLOAD=MANDATORY ./reduction-gfx803

clang-gfx906-openmp reduction.c -o reduction-gfx906
OMP_TARGET_OFFLOAD=MANDATORY ./reduction-gfx906

The manual equivalent for gfx803 is:

1
2
3
4
5
6
"$LLVM_INSTALL/bin/clang" \
  -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa \
  --offload-arch=gfx803 \
  -Xarch_gfx803 -mcode-object-version=5 \
  reduction.c -o reduction-gfx803

How To Debug This Alone

If you are starting from the same symptom, do not begin by editing LLVM. Walk the stack in this order.

1. Prove The GPU Is Visible

1
2
3
4
5
lspci -nn | grep -E -i 'amd|ati|vga|display|3d'

find /sys/class/kfd/kfd/topology/nodes -maxdepth 2 -type f \
  \( -name properties -o -name gpu_id -o -name name \) \
  -print -exec sed -n '1,80p' {} \;

For WX3200, expect gfx_target_version 80003.

2. Prove The Runtime Can Create A Small Queue

The current stack requires:

1
export LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64

The probe result that matters is:

1
2
queue_create(64): HSA_STATUS_SUCCESS
queue_create(128): HSA_STATUS_ERROR_OUT_OF_RESOURCES

If queue size 64 does not work, COV5 patches are not your first problem.

3. Separate Code Object Policy From Hardware Support

Build the same source for one architecture at a time:

1
2
clang-gfx803-openmp test.c -o test-gfx803
clang-gfx906-openmp test.c -o test-gfx906

Then build manually with explicit COV5:

1
2
3
4
5
6
"$LLVM_INSTALL/bin/clang" \
  -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa \
  --offload-arch=gfx803 \
  -Xarch_gfx803 -mcode-object-version=5 \
  test.c -o test-gfx803-cov5

If gfx906 works and gfx803 fails, do not conclude “AMDGPU offloading is broken.” Narrow the question to the pre-gfx9 path.

4. Inspect The LLVM Backend Before Changing The Runtime

Look for the hidden COV5 offsets:

1
2
grep -r -E 'PRIVATE_BASE_OFFSET|SHARED_BASE_OFFSET|QUEUE_PTR_OFFSET' \
  "$OMP_AMDGPU_WORKSPACE/llvm_src/llvm/lib/Target/AMDGPU"

If the backend contains these offsets but the runtime struct does not expose them, the runtime may be launching with a valid 256-byte block that is missing pre-gfx9-specific data.

5. Reduce To llc When Possible

Runtime tests prove behavior, but backend tests prove code generation. The final repo keeps two pure backend regression files in the patched LLVM source:

1
2
llvm/test/CodeGen/AMDGPU/m0-writelane-preserve.ll
llvm/test/CodeGen/AMDGPU/cov5-non-entry-implicitarg.ll

Run them through the generated verifier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
source "$OMP_AMDGPU_WORKSPACE/env.sh"
cd $HOME/Projects/amd-omp-gpu-offloading

(
  set -euo pipefail
  source scripts/common.sh
  source scripts/verify.sh
  source "$OMP_AMDGPU_WORKSPACE/env.sh"
  export LLVM_BUILD="$OMP_AMDGPU_WORKSPACE/llvm_build"
  run_amdgpu_backend_regression_if_available
)

This catches regressions even if the machine temporarily has no working OpenMP runtime path.

Why This Is Better Than The COV4 Path

The COV4 path was good for discovery. The COV5 path is better for maintenance.

COV4 required LLVM 22 to accept and size an older ABI path that modern offload code no longer treats as normal. Every future LLVM update would ask the same question again: which removed COV4 assumption needs to be restored this time?

COV5 changes the maintenance question:

Which existing COV5 pre-gfx9 behavior is incomplete?

That is narrower and easier to audit. It also keeps gfx803 in the same ABI family as newer GPUs, while still allowing per-architecture policy:

1
2
gfx803 -> COV5
gfx906 -> default newer ABI

The final mixed-GPU result is the practical proof. One OpenMP binary can include both images, enumerate both devices, and run work on both GPUs.

Conclusion

The important result was not simply “COV5 works on gfx803.” The important result was why it works now.

Old LLVM and ROCm/AOMP toolchains showed that gfx803 support was real in the old DeviceRTL era. Modern LLVM showed that COV5 pre-gfx9 pieces still exist in the backend. The missing pieces were specific:

  • libomptarget did not populate the pre-gfx9 COV5 hidden fields;
  • non-entry device functions loaded those hidden fields through the wrong base;
  • the separate GFX8 m0 preservation bug still had to stay fixed;
  • GPU libc had to be built as COV5 when the gfx803 application image is COV5.

That is a much better patch shape than bringing COV4 back wholesale. The final repository keeps each behavior as a small indexed patch, verifies patch application from a clean workspace, verifies codegen with llc, and verifies runtime behavior with the actual gfx803 and gfx906 devices.

For an old GPU on an unusual host architecture, that is the difference between “it works on my current tree” and a stack that can survive the next rebuild.