Moving GFX803 LLVM OpenMP Offloading From COV4 To COV5
The first end-to-end result for the WX3200 was intentionally pragmatic:
| |
That stack proved the hard parts were possible on the LoongArch64 machine:
- KFD could expose the Polaris12 GPU as
gfx803. - ROCr 6.4.4 could create a queue and ring the old GFX8 doorbell.
- LLVM’s AMDGPU OpenMP plugin could launch real OpenMP target regions.
- The generated program could run reductions and device-side output.
But a working first path is not always the path to keep.
COV4 was a shortcut. It required local patches to make LLVM 22’s offload loader
accept AMDGPU ABI v4 and to make libomptarget use the old 56-byte implicit
argument area. That felt wrong once the stack also had an MI50 (gfx906) and
once the debugging moved from “can this run at all?” to “what is the narrow
patch set we can maintain?”
The better question became:
Can
gfx803run LLVM OpenMP offloading through COV5, using the modern ABI family, instead of resurrecting more of COV4?
The answer was yes. The route was not “turn on COV5 and hope.” It required following the traces left in old LLVM and ROCm/AOMP toolchains, then finishing the incomplete pre-gfx9 COV5 path in two specific places.
Figure 1. The important lesson from the history audit was not that old code should be copied back. It was that COV5 already had a pre-gfx9 design shape, but modern LLVM no longer exercised it end to end for gfx803 OpenMP.
Terms First
Code object version, or COV, is the AMDGPU executable ABI version encoded in the GPU image. Clang and LLD produce an AMDGPU ELF image, and libomptarget loads that image through ROCr.
For this work:
- COV4 means AMDGPU HSA ELF ABI v4.
- COV5 means AMDGPU HSA ELF ABI v5.
- COV6 is LLVM 22’s default AMDGPU code object version.
Pre-gfx9 means AMD GPU generations before GFX9, including gfx803. Those
chips do not behave the same as gfx906 for all private/shared address
lowering. In particular, the compiler may need private and LDS aperture base
information that newer chips can get differently.
Implicit arguments are hidden kernel-launch fields passed by the runtime to the device code. User C code never mentions them. LLVM-generated AMDGPU code can still load them. For COV5, the implicit-argument block is 256 bytes.
DeviceRTL is the OpenMP device runtime library. Seeing gfx803 in the
AMDGPU backend’s processor table only means LLVM knows the ISA. Seeing it in
DeviceRTL build lists means the OpenMP device runtime was intentionally built
for that architecture.
The Initial COV4 State
The first repo state looked like this:
| |
It carried these LLVM-side ideas:
| |
The COV4 pieces did two things:
- Let the offload ELF checker accept AMDGPU ABI v4.
- Tell libomptarget that COV4 uses a 56-byte implicit-argument area.
That was enough to make the old card useful. It was also a sign that we were reintroducing a path LLVM 22 had moved away from. The printf patches were similarly useful as experiments, but the later m0 investigation showed that the real bug was in AMDGPU backend register preservation, not in the GPU libc RPC layer.
The COV4 stack was valuable because it established a working lower bound. It was not the right final design.
The Combinations Tried
The useful evidence came from trying combinations, not from assuming that
gfx803 implied COV4.
| Toolchain/runtime combo | Result | What it taught us |
|---|---|---|
| LLVM 22 default COV6 + early gfx803 setup | Not the working path | LLVM 22 defaults to COV6, but old gfx803 support was not complete just by using the default. |
| LLVM 22 + ROCr 7.x + gfx803 | Proof-of-concept only | ROCr 7.x had removed the legacy GFX8 DoorbellType == 1 queue path; patching it back was too broad. |
| LLVM 22 + ROCr 6.4.4 + gfx803 + COV4 | Worked after local COV4 patches | Good bootstrap path, but it restored removed ELF/implicit-arg behavior. |
| LLVM 22 + ROCr 6.4.4 + gfx906 + COV5/default | Worked | MI50 did not need the gfx803 COV4 escape route. |
| LLVM 22 + ROCr 6.4.4 + gfx906 forced to COV4 | Worked for the printf control case | COV4 alone was not the cause of the m0/printf failure. |
| LLVM 22 + ROCr 6.4.4 + gfx803 + COV5 before fixes | Failed in small OpenMP target/teams shapes | The modern ABI family existed, but pre-gfx9 launch metadata and non-entry hidden-arg lowering were incomplete. |
| LLVM 22 + ROCr 6.4.4 + gfx803 + COV5 after fixes | Worked | This became the maintained direction. |
| LLVM 22 + ROCr 6.4.4 + gfx803 COV5 + gfx906 default in one binary | Worked | One process can use both GPUs with arch-specific code-object policy. |
The final local wrapper policy is now:
| |
That means gfx803 is explicitly held at COV5 while gfx906 uses the default
newer path.
The Upstream LLVM Audit
The first suspicion was that COV5 and gfx803 were never meant to work
together. The old source history did not support that simple answer.
The audit used shallow tag fetches and source grep over these families:
| |
The commands were intentionally simple. For example:
| |
The audit found this progression:
| LLVM Release Line | Finding |
|---|---|
| LLVM 10.0.1 / 11.1.0 | The old openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt built for gfx700 gfx701 gfx801 gfx803 gfx900. This matches the recollection that old LLVM worked natively for gfx803, but it predates the modern COV5/COV6 path. |
| LLVM 12 / 13 | The classic AMDGPU libomptarget plugin appears, and gfx803 remains in old DeviceRTL lists. COV4 constants appear in the AMDGPU backend around this era, but this is still not a modern COV5 OpenMP solution. |
| LLVM 14 / 15 | The newer openmp/libomptarget/DeviceRTL path includes gfx803. LLVM 15 has PRIVATE_BASE_OFFSET = 192 and QUEUE_PTR_OFFSET = 200 in the AMDGPU backend. That is the first strong clue that hidden COV5 pre-gfx9 fields are a real design, not something invented locally. |
| LLVM 16 / 17 | The nextgen AMDGPU plugin appears beside the classic plugin. It has a COV5 implicit-argument struct, but the runtime-side struct only covers the common fields up to dynamic LDS and padding. The backend knows the hidden offsets; the runtime does not fill them. |
| LLVM 18 / 19 | COV5 and COV4 coexist in parts of libomptarget. LLVM 19 moves the offload runtime out of openmp/libomptarget into offload. gfx803 is still listed in DeviceRTL in LLVM 19. |
| LLVM 20 | gfx803 disappears from the upstream DeviceRTL architecture list, even though the AMDGPU backend still knows the processor and still has the hidden implicit-argument offsets. This is the point where “the backend knows gfx803” and “OpenMP ships a complete gfx803 path” clearly diverge. |
| LLVM 21 / 22 | The offload plugin is COV5+ oriented. LLVM 22’s common ELF check rejects AMDGPU ABI versions below 5 with “must be version 5 or above”. The COV5 implicit-argument struct still does not expose or populate the pre-gfx9 hidden fields. |
That history changed the direction of the patch.
If LLVM 10/11 worked, copying that whole old path back would mean going back to the old DeviceRTL/plugin/COV assumptions. That is a trap. The more relevant signal was LLVM 15 onward: the backend had COV5 hidden-field offsets for private base, shared base, and queue pointer. The design was already there. LLVM 22 just did not complete the path for our pre-gfx9 OpenMP use.
The ROCm/AOMP LLVM Audit
The AMD downstream history filled in another part of the story.
I am calling this the ROCm/AOMP audit because these tags represent AMD’s LLVM toolchain history for OpenMP offloading, including the old ROCm 3.3 HCC/OCL split tags and the later unified ROCm LLVM tags. The important range was not just “a recent ROCm release.” It was the whole arc from the old gfx803-capable runtime through the current audited 7.x line:
The tag audit included the old 3.3 split tags:
| |
Regular ROCm LLVM tags then continue from rocm-3.5.0 through the audited
rocm-7.2.4 tag.
The concrete grep checks are reproducible:
| |
The useful findings:
| ROCm/AOMP line | Finding |
|---|---|
| ROCm HCC/OCL 3.3 and ROCm 3.5 | gfx803 is in the old AMDGPU DeviceRTL build list, for example openmp/libomptarget/deviceRTLs/amdgcn/CMakeLists.txt lists gfx700 gfx701 gfx801 gfx803 gfx900. There is no COV5 machinery. This confirms old gfx803 OpenMP support existed, but in an older ABI/toolchain shape. |
| ROCm 4.0 / 4.5 | gfx803 remains in DeviceRTL and hostcall/libm-related lists. COV4 constants are present. Still not the modern COV5 solution. |
| ROCm 5.0 | Both old and newer OpenMP runtime directories exist, and gfx803 remains in the build lists. This is the transition period where the old support and newer runtime structure overlap. |
| ROCm 5.7 | The tree has both the classic AMDGPU plugin and the nextgen plugin. openmp/libomptarget/DeviceRTL, old deviceRTLs/amdgcn, hostexec, hostrpc, libm, and libc GPU architecture lists still mention gfx803. The classic plugin defines COV4_SIZE = 56 and COV5_SIZE = 256, fills many COV5 fields explicitly, and the backend already has COV5 hidden offsets. This is a strong sign that AMD had overlapping COV4/COV5-era support, but not that LLVM 22’s final nextgen path is complete for gfx803. |
| ROCm 6.4.4 | offload/DeviceRTL still lists gfx803, and the nextgen plugin has both a COV5 implicit-argument struct and a 56-byte COV4 dummy struct. It also defaults to COV6 in the backend, and it still does not populate the pre-gfx9 COV5 private/shared/queue fields in the nextgen runtime path. |
| ROCm 7.2.4 | COV4 ELF loading is rejected by the common offload checker, like upstream LLVM 22. gfx803 appears only in source-level platform guards such as openmp/device/include/Platform.h, not as a normal DeviceRTL architecture list. The backend still has COV5 hidden-offset constants, but the launch-side population remains absent. |
The old COV4 size was not guessed. ROCm 5.7’s classic plugin spells it out:
| |
That explained why the first COV4 patch worked. It did not justify keeping it. The same audit showed that later toolchains were moving away from COV4 loading, while the backend kept COV5 pre-gfx9 offsets. That made COV5 the better target.
The conclusion was specific:
- Do not recover the whole old COV4 path.
- Do not blindly change generic COV5 behavior for all GPUs.
- Continue the existing COV5 + pre-gfx9 design where the source already points.
- Patch only the missing runtime population and the incorrect non-entry backend address base.
Root Cause 1: COV5 Launch Metadata Was Too Generic
The COV5 implicit-argument block is 256 bytes.
LLVM 22’s runtime-side struct looked like a generic COV5 block:
| |
For pre-gfx9 devices, the backend-side ABI has more fields:
| |
Those offsets already existed in the AMDGPU backend as
PRIVATE_BASE_OFFSET, SHARED_BASE_OFFSET, and QUEUE_PTR_OFFSET. The runtime
just was not filling them for COV5 launches.
Figure 2. The local patch does not invent a new ABI. It expands libomptarget’s COV5 implicit-argument struct so the runtime can fill the hidden fields that the AMDGPU backend already knows how to load.
The fix in libomptarget does three narrow things:
- Keep the COV5 implicit-argument block at 256 bytes.
- Add explicit fields at offsets 192, 196, and 200.
- Fill the pre-gfx9 aperture fields only for
gfx6,gfx7, andgfx8.
The queue pointer is always filled when the raw HSA queue is available. The private/shared aperture values come from ROCr’s AMD queue extension prefix. The patch uses static assertions for the ROCr 6.4.4 queue layout so a future layout change fails at build time instead of silently filling wrong offsets.
The subtle review fix here was important: QueuePtr must be uint64_t, not
void *. The implicit-argument ABI is device-side and fixed-width. A host
pointer type would make the structure layout depend on the LLVM build host.
Root Cause 2: Non-Entry Functions Used The Wrong Base Pointer
The second failure was lower in the AMDGPU backend.
There are two different cases:
- Entry kernel: the backend can address hidden implicit arguments through the kernel argument segment pointer plus the aligned explicit-argument size.
- Non-entry device helper: the helper does not have the entry kernel’s
kernarg base. It receives a preloaded
implicitarg.ptrSGPR pair.
The broken code used the entry-kernel addressing idea too broadly. In a non-entry function, this can generate loads that look like “load from field offset 0xc0” without using the callee’s real implicit-argument pointer as the base.
For gfx803 COV5, those fields matter:
| |
The fix was to split “field offset” from “addressing base”:
- field offsets stay the same ABI constants;
- entry functions use the kernel argument pointer path;
- non-entry functions use
IMPLICIT_ARG_PTR; - both SelectionDAG and GlobalISel get the same rule.
The regression test is deliberately an LLVM backend .ll file, not an OpenMP C
test. It forces three non-entry loads:
- private base through an
allocain private memory; - shared base through an LDS/generic pointer check;
- queue pointer through
llvm.trap().
The expected gfx803 code loads from the callee implicit-argument pointer at
0xc0, 0xc4, and 0xc8. The gfx906 checks are negative controls: it
should not start using those pre-gfx9 hidden private/shared fields.
The M0 Patch Is Separate
The COV5 migration did not replace the m0 fix from the previous debugging round. It carried it forward.
That patch fixes a different bug: dynamic v_writelane_b32 lowering can borrow
the physical m0 register as a lane selector. On GFX6-GFX8, m0 is also
architectural state used by LDS/flat memory operations and must survive calls.
The final COV5 patch series keeps this as its own logical patch:
| |
That matters for maintainability. The COV5 implicit-argument fixes explain why
gfx803 COV5 kernels can launch and use hidden arguments correctly. The m0 fix
explains why small teams-loop device printf no longer corrupts caller state.
They are adjacent in the verified stack, but they are not the same root cause.
The Final Patch Set
After the refactor, the LLVM patch series became:
| |
What was removed:
| |
The script policy changed from “gfx803 uses COV4” to “gfx803 uses COV5”:
| |
The generated wrapper adds:
| |
The GPU libc build also needs to match. LLVM libc for the amdgcn-amd-amdhsa
runtime target defaults to a newer code object version unless told otherwise,
so the setup script now passes:
| |
That avoids a later link-time mismatch when device code uses GPU libc features
such as printf.
Figure 3. The final path narrowed the patches by asking which layer had the missing COV5 pre-gfx9 behavior, not by making every generic COV5 path behave like gfx803.
Reproducing The Investigation
Start from the repo and a disposable workspace:
| |
First verify that the patch stack applies to the pinned sources:
| |
This resets the generated LLVM and ROCr source trees, then applies:
| |
For this COV5 migration, a prepare-only pass is essential. If a patch only works because the workspace already had stale source edits, the patch series is not maintainable.
Then build and test:
| |
The successful run prints the backend regressions:
| |
On the dual-GPU machine it also prints:
| |
Check the generated CMake cache for the GPU libc COV5 setting:
| |
Expected important line:
| |
Check device discovery:
| |
Expected important devices:
| |
Compile a small reduction:
| |
Build and run:
| |
For single-architecture checks:
| |
The manual equivalent for gfx803 is:
| |
How To Debug This Alone
If you are starting from the same symptom, do not begin by editing LLVM. Walk the stack in this order.
1. Prove The GPU Is Visible
| |
For WX3200, expect gfx_target_version 80003.
2. Prove The Runtime Can Create A Small Queue
The current stack requires:
| |
The probe result that matters is:
| |
If queue size 64 does not work, COV5 patches are not your first problem.
3. Separate Code Object Policy From Hardware Support
Build the same source for one architecture at a time:
| |
Then build manually with explicit COV5:
| |
If gfx906 works and gfx803 fails, do not conclude “AMDGPU offloading is
broken.” Narrow the question to the pre-gfx9 path.
4. Inspect The LLVM Backend Before Changing The Runtime
Look for the hidden COV5 offsets:
| |
If the backend contains these offsets but the runtime struct does not expose them, the runtime may be launching with a valid 256-byte block that is missing pre-gfx9-specific data.
5. Reduce To llc When Possible
Runtime tests prove behavior, but backend tests prove code generation. The final repo keeps two pure backend regression files in the patched LLVM source:
| |
Run them through the generated verifier:
| |
This catches regressions even if the machine temporarily has no working OpenMP runtime path.
Why This Is Better Than The COV4 Path
The COV4 path was good for discovery. The COV5 path is better for maintenance.
COV4 required LLVM 22 to accept and size an older ABI path that modern offload code no longer treats as normal. Every future LLVM update would ask the same question again: which removed COV4 assumption needs to be restored this time?
COV5 changes the maintenance question:
Which existing COV5 pre-gfx9 behavior is incomplete?
That is narrower and easier to audit. It also keeps gfx803 in the same ABI
family as newer GPUs, while still allowing per-architecture policy:
| |
The final mixed-GPU result is the practical proof. One OpenMP binary can include both images, enumerate both devices, and run work on both GPUs.
Conclusion
The important result was not simply “COV5 works on gfx803.” The important result was why it works now.
Old LLVM and ROCm/AOMP toolchains showed that gfx803 support was real in the
old DeviceRTL era. Modern LLVM showed that COV5 pre-gfx9 pieces still exist in
the backend. The missing pieces were specific:
- libomptarget did not populate the pre-gfx9 COV5 hidden fields;
- non-entry device functions loaded those hidden fields through the wrong base;
- the separate GFX8
m0preservation bug still had to stay fixed; - GPU libc had to be built as COV5 when the gfx803 application image is COV5.
That is a much better patch shape than bringing COV4 back wholesale. The final
repository keeps each behavior as a small indexed patch, verifies patch
application from a clean workspace, verifies codegen with llc, and verifies
runtime behavior with the actual gfx803 and gfx906 devices.
For an old GPU on an unusual host architecture, that is the difference between “it works on my current tree” and a stack that can survive the next rebuild.