Enabling End-To-End LLVM OpenMP AMDGPU Offloading On GFX803 And LoongArch64
The goal sounded small:
Build LLVM 22 with OpenMP GPU offloading and make it run on a Radeon Pro WX 3200.
The actual target was more unusual:
- an old Polaris GPU, reported as
gfx803; - a LoongArch64 host;
- a modern LLVM 22 toolchain;
- no desire to rebuild the full ROCm stack.
That combination matters. gfx803 is old enough that current ROCm no longer
treats it as a normal supported target, and LoongArch64 is not one of the usual
ROCm host architectures. But LLVM OpenMP offloading does not need all of ROCm.
It needs a smaller chain to work end to end.
The simple test for the whole effort was a normal OpenMP reduction:
| |
When this prints sum=32896 with OMP_TARGET_OFFLOAD=MANDATORY, the compiler,
device libraries, OpenMP runtime, HSA runtime, kernel driver, queue submission,
and memory mapping all agreed enough to run real work on the GPU.
This post documents the initial bring-up. It follows the history captured in the
amd-omp-gpu-offloading repository, especially the first durable stack commit:
| |
Figure 1. The bring-up only needed the layers that LLVM OpenMP offloading touches. Avoiding the full ROCm stack made the problem smaller and the patch set auditable.
Step 1: Prove The Machine Sees The GPU
Do not start by editing LLVM. First prove that the host kernel sees the GPU and that KFD exposes it as a compute device.
| |
The expected device is the WX3200:
| |
Then check the KFD topology:
| |
The important lines are:
| |
Why this matters:
lspciproves PCI enumeration.- KFD topology proves the compute-facing kernel path exists.
gfx_target_version 80003is the kernel-side clue that this isgfx803.
If this layer is missing, LLVM cannot fix it. The compiler can generate the right image and still have nowhere to run it.
Step 2: Keep The ROCm Scope Small
The first design decision was to avoid “install ROCm” as the goal. Full ROCm support means HIP headers, libraries, math libraries, profilers, packaging, PyTorch-facing stacks, and many higher-level components. That is much more than LLVM OpenMP needs.
For this project, the required ROCm-side source set was:
- ROCr Runtime: builds
libhsa-runtime64.so. - ROCt / HSAKMT thunk: talks to
/dev/kfd; in ROCr 6.4.4 it is bundled inside the ROCr source tree aslibhsakmt/. - ROCm device libraries: bitcode libraries consumed by Clang during AMDGPU device compilation.
The runtime source of truth was the fork:
| |
ROCr 6.4.4 was chosen because it still contains the old GFX8 doorbell path. That
path matters for WX3200. Newer ROCr 7.x code removed support for the legacy
DoorbellType == 1 queue path, so making 7.x work for this GPU means
reconstructing behavior that upstream already deleted.
The 6.4.4 plan is simpler:
- keep the native old doorbell code already present in that release;
- patch only the LoongArch64 host build/runtime issues;
- keep the whole diff as ordered patch files.
Step 3: Build A Disposable Workspace
The current maintained setup command is:
| |
Historically, the first version of the script was named
setup_gfx803_stack.sh. The current script name is broader because the stack
later grew beyond one GPU, but the single-gfx803 build is still controlled by
AMDGPU_ARCHES=gfx803.
For patch maintenance, first run only source checkout and patch application:
| |
That mode answers one narrow question: do the ordered patch files still apply to the pinned sources?
The generated workspace is disposable:
| |
The repository stays source-only. Build trees and installs live in the workspace.
Figure 2. Each validation step removes one layer from suspicion. The final OpenMP reduction is useful only after the lower HSA and code-object questions are already answered.
Step 4: Let LLVM Build The AMDGPU Plugin On LoongArch64
LLVM’s OpenMP offload build had a host architecture gate for the AMDGPU plugin. The allowed host list covered the common ROCm platforms:
| |
LoongArch64 was not in that list. The first LLVM patch adds:
| |
The patch is small, but important. Without it, the AMDGPU OpenMP target plugin
does not build on this host, and llvm-offload-device-info can only show the
host fallback device even if ROCr itself is present.
The tool name is intentional for this LLVM 22 stack: it is built from
offload/tools/deviceinfo as llvm-offload-device-info.
This is the cleanest part of the work and the most upstream-shaped part: it is not specific to WX3200. It says that the plugin can be built on a Linux LoongArch64 host.
Step 5: Make ROCr Build And Run On LoongArch64
The ROCr patch series is deliberately host-side and small:
| |
The intent is not to change queue behavior. ROCr 6.4.4 already has the GFX8 queue path we need. The patches only remove non-portable host assumptions:
- include the standard integer header where the source used fixed-width types;
- route PCIe fences through helper functions;
- use C++ atomic fences on non-x86 hosts;
- replace
_mm_pause()with a portable yield fallback; - avoid including
mm_malloc.hon non-x86 builds.
After that, the first HSA-level test is not OpenMP. It is queue creation.
The important result on this machine was:
| |
That result became a runtime policy:
| |
The generated env.sh exports it. This is intentionally not an LLVM source
default change. It is a local hardware/runtime configuration.
A raw barrier packet then proves the GPU consumes work:
| |
Only after that does it make sense to debug OpenMP.
Step 6: Use COV4 For GFX803
LLVM 22 defaults to newer AMDGPU code object versions. The WX3200 path used here needs code object version 4.
That has three consequences.
First, compile the gfx803 device image with COV4:
| |
The generated clang-gfx803-openmp wrapper adds this automatically.
Second, LLVM’s offload image checker must accept AMDGPU HSA ELF ABI version 4. The patch changes the accepted AMDGPU HSA ELF ABI versions from:
| |
to:
| |
Third, the AMDGPU plugin must use the COV4 implicit-argument size. LLVM’s COV5+ path used a larger implicit argument area:
| |
For COV4 the working size is:
| |
This became the third LLVM patch. Without it, the runtime can load an image but still launch it with the wrong metadata shape.
The important policy choice is that Clang’s global default stays unchanged. COV4 is selected by the wrapper for this old target.
Step 7: Build Device Libraries Matched To LLVM 22
AMDGPU device compilation links bitcode libraries. The system ROCm bitcode on the machine was too new for LLVM 22 to consume reliably, so the stack keeps a known device-libs snapshot in the source repo:
| |
The setup script copies that snapshot into the workspace, builds it there, and installs bitcode under:
| |
Then env.sh points Clang at that private ROCm prefix:
| |
This makes the compiler, device bitcode, and runtime prefix a coherent local stack instead of a mix of system ROCm and custom LLVM.
Step 8: Verify LLVM Sees The Device
Use the generated environment:
| |
Then check device discovery:
| |
This is LLVM 22’s OpenMP offload device-info tool, not a typo for an
llvm-omp-device-info binary.
For a single WX3200 setup, the important lines are:
| |
If this shows only the host device, check the loader paths first. The AMDGPU
plugin dlopens HSA at runtime, so LD_LIBRARY_PATH must include the private
ROCr install under the workspace.
Step 9: Run The Small OpenMP Tests
Start with a scalar target region:
| |
Compile and run:
| |
Expected:
| |
Then run the reduction:
| |
| |
Expected:
| |
The user’s original reduction test also became a final smoke test:
| |
Expected:
| |
At that point the stack is end-to-end: Clang emits the device image, the private
device libraries link, libomptarget loads the COV4 image, ROCr creates a
queue, KFD accepts the packet, and the old GPU runs the OpenMP kernel.
What Was Actually Needed
The first working stack needed fewer moving parts than “ROCm on gfx803” sounds like:
- a pinned LLVM 22 source baseline;
- a pinned ROCr 6.4.4 baseline from the
ouankou/ROCR-Runtimefork; - small ROCr LoongArch64 portability patches;
- LLVM AMDGPU OpenMP plugin enablement on LoongArch64;
- COV4 ELF ABI acceptance in
libomptarget; - COV4 implicit-argument sizing;
- LLVM-compatible AMD device libraries;
LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64;- wrapper policy that selects COV4 for
gfx803.
The workflow lesson is more important than any single patch: debug from the bottom upward. Hardware visibility, HSA queue creation, raw packet completion, code-object compatibility, device library compatibility, then OpenMP source.
Starting with the OpenMP reduction is good for defining success. It is not good for locating the first failure.