Enabling End-To-End LLVM OpenMP AMDGPU Offloading On GFX803 And LoongArch64

Posted on (Updated on )
The working gfx803 stack was not a full ROCm resurrection. It was a narrow LLVM OpenMP offloading stack: pinned LLVM 22, a pinned ROCr 6.4.4 runtime from the ouankou fork, LLVM-compatible AMD device libraries, a LoongArch64 enablement patch for the AMDGPU OpenMP plugin, COV4 loading support in libomptarget, COV4 implicit-argument sizing, and a runtime queue-size override. The useful debugging order was hardware first, HSA runtime second, code-object compatibility third, then simple OpenMP scalar and reduction tests.

The goal sounded small:

Build LLVM 22 with OpenMP GPU offloading and make it run on a Radeon Pro WX 3200.

The actual target was more unusual:

  • an old Polaris GPU, reported as gfx803;
  • a LoongArch64 host;
  • a modern LLVM 22 toolchain;
  • no desire to rebuild the full ROCm stack.

That combination matters. gfx803 is old enough that current ROCm no longer treats it as a normal supported target, and LoongArch64 is not one of the usual ROCm host architectures. But LLVM OpenMP offloading does not need all of ROCm. It needs a smaller chain to work end to end.

The simple test for the whole effort was a normal OpenMP reduction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#include <omp.h>
#include <stdio.h>

int main(void) {
  int sum = 0;

#pragma omp target teams distribute parallel for reduction(+ : sum)
  for (int i = 1; i <= 256; ++i)
    sum += i;

  printf("sum=%d devices=%d default=%d\n", sum, omp_get_num_devices(),
         omp_get_default_device());
  return sum == 32896 ? 0 : 1;
}

When this prints sum=32896 with OMP_TARGET_OFFLOAD=MANDATORY, the compiler, device libraries, OpenMP runtime, HSA runtime, kernel driver, queue submission, and memory mapping all agreed enough to run real work on the GPU.

This post documents the initial bring-up. It follows the history captured in the amd-omp-gpu-offloading repository, especially the first durable stack commit:

1
d00b0e2 Bootstrap gfx803 OpenMP offload stack
Layered stack from OpenMP source through Clang, device libraries, libomptarget, ROCr, libhsakmt, KFD, and the gfx803 GPU.

Figure 1. The bring-up only needed the layers that LLVM OpenMP offloading touches. Avoiding the full ROCm stack made the problem smaller and the patch set auditable.

Step 1: Prove The Machine Sees The GPU

Do not start by editing LLVM. First prove that the host kernel sees the GPU and that KFD exposes it as a compute device.

1
lspci -nn | rg -i 'amd|ati|vga|display|3d'

The expected device is the WX3200:

1
Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3200] [1002:6981]

Then check the KFD topology:

1
2
3
find /sys/class/kfd/kfd/topology/nodes -maxdepth 2 -type f \
  \( -name properties -o -name gpu_id -o -name name \) \
  -print -exec sed -n '1,80p' {} \;

The important lines are:

1
2
3
name: polaris12
gfx_target_version 80003
device_id 27009

Why this matters:

  • lspci proves PCI enumeration.
  • KFD topology proves the compute-facing kernel path exists.
  • gfx_target_version 80003 is the kernel-side clue that this is gfx803.

If this layer is missing, LLVM cannot fix it. The compiler can generate the right image and still have nowhere to run it.

Step 2: Keep The ROCm Scope Small

The first design decision was to avoid “install ROCm” as the goal. Full ROCm support means HIP headers, libraries, math libraries, profilers, packaging, PyTorch-facing stacks, and many higher-level components. That is much more than LLVM OpenMP needs.

For this project, the required ROCm-side source set was:

  • ROCr Runtime: builds libhsa-runtime64.so.
  • ROCt / HSAKMT thunk: talks to /dev/kfd; in ROCr 6.4.4 it is bundled inside the ROCr source tree as libhsakmt/.
  • ROCm device libraries: bitcode libraries consumed by Clang during AMDGPU device compilation.

The runtime source of truth was the fork:

1
2
3
https://github.com/ouankou/ROCR-Runtime.git
rocm-6.4.4
044c4226baf27401483b2903400647eae96e6f44

ROCr 6.4.4 was chosen because it still contains the old GFX8 doorbell path. That path matters for WX3200. Newer ROCr 7.x code removed support for the legacy DoorbellType == 1 queue path, so making 7.x work for this GPU means reconstructing behavior that upstream already deleted.

The 6.4.4 plan is simpler:

  • keep the native old doorbell code already present in that release;
  • patch only the LoongArch64 host build/runtime issues;
  • keep the whole diff as ordered patch files.

Step 3: Build A Disposable Workspace

The current maintained setup command is:

1
2
3
4
5
6
cd /home/ouankou/Projects/amd-omp-gpu-offloading
export OMP_AMDGPU_WORKSPACE=/home/ouankou/Projects/llvm-22

AMDGPU_ARCHES=gfx803 \
AMDGPU_COV4_ARCHES=gfx803 \
  ./setup_openmp_offload_stack.sh "$OMP_AMDGPU_WORKSPACE"

Historically, the first version of the script was named setup_gfx803_stack.sh. The current script name is broader because the stack later grew beyond one GPU, but the single-gfx803 build is still controlled by AMDGPU_ARCHES=gfx803.

For patch maintenance, first run only source checkout and patch application:

1
2
PREPARE_ONLY=1 AMDGPU_ARCHES=gfx803 AMDGPU_COV4_ARCHES=gfx803 \
  ./setup_openmp_offload_stack.sh /tmp/amd-omp-gfx803-prepare-check

That mode answers one narrow question: do the ordered patch files still apply to the pinned sources?

The generated workspace is disposable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$OMP_AMDGPU_WORKSPACE/
  llvm_src/
  llvm_build/
  llvm_install/
  rocr_src/
  rocr_build/
  rocm-device-libs_src/
  rocm-device-libs_build/
  rocm_install/
  env.sh
  bin/clang-amdgpu-openmp
  bin/clang-gfx803-openmp

The repository stays source-only. Build trees and installs live in the workspace.

A bring-up ladder from hardware discovery to ROCr, LLVM COV4 loading, device libraries, libomptarget discovery, and OpenMP scalar and reduction tests.

Figure 2. Each validation step removes one layer from suspicion. The final OpenMP reduction is useful only after the lower HSA and code-object questions are already answered.

Step 4: Let LLVM Build The AMDGPU Plugin On LoongArch64

LLVM’s OpenMP offload build had a host architecture gate for the AMDGPU plugin. The allowed host list covered the common ROCm platforms:

1
2
3
x86_64
ppc64le
aarch64

LoongArch64 was not in that list. The first LLVM patch adds:

1
loongarch64

The patch is small, but important. Without it, the AMDGPU OpenMP target plugin does not build on this host, and llvm-offload-device-info can only show the host fallback device even if ROCr itself is present.

The tool name is intentional for this LLVM 22 stack: it is built from offload/tools/deviceinfo as llvm-offload-device-info.

This is the cleanest part of the work and the most upstream-shaped part: it is not specific to WX3200. It says that the plugin can be built on a Linux LoongArch64 host.

Step 5: Make ROCr Build And Run On LoongArch64

The ROCr patch series is deliberately host-side and small:

1
2
3
4
patches/rocr/0001-include-cstdint-for-elf-image.patch
patches/rocr/0002-use-portable-host-fences.patch
patches/rocr/0003-use-portable-mutex-pause.patch
patches/rocr/0004-avoid-non-x86-mm-malloc-include.patch

The intent is not to change queue behavior. ROCr 6.4.4 already has the GFX8 queue path we need. The patches only remove non-portable host assumptions:

  • include the standard integer header where the source used fixed-width types;
  • route PCIe fences through helper functions;
  • use C++ atomic fences on non-x86 hosts;
  • replace _mm_pause() with a portable yield fallback;
  • avoid including mm_malloc.h on non-x86 builds.

After that, the first HSA-level test is not OpenMP. It is queue creation.

The important result on this machine was:

1
2
3
4
queue_create(64): HSA_STATUS_SUCCESS
queue_create(128): HSA_STATUS_ERROR_OUT_OF_RESOURCES
queue_create(256): HSA_STATUS_ERROR_OUT_OF_RESOURCES
queue_create(512): HSA_STATUS_ERROR_OUT_OF_RESOURCES

That result became a runtime policy:

1
export LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64

The generated env.sh exports it. This is intentionally not an LLVM source default change. It is a local hardware/runtime configuration.

A raw barrier packet then proves the GPU consumes work:

1
2
gpu=gfx803 queue_max=131072
barrier value=0 read=0 write=1 packet=1

Only after that does it make sense to debug OpenMP.

Step 6: Use COV4 For GFX803

LLVM 22 defaults to newer AMDGPU code object versions. The WX3200 path used here needs code object version 4.

That has three consequences.

First, compile the gfx803 device image with COV4:

1
-mcode-object-version=4

The generated clang-gfx803-openmp wrapper adds this automatically.

Second, LLVM’s offload image checker must accept AMDGPU HSA ELF ABI version 4. The patch changes the accepted AMDGPU HSA ELF ABI versions from:

1
5, 6

to:

1
4, 5, 6

Third, the AMDGPU plugin must use the COV4 implicit-argument size. LLVM’s COV5+ path used a larger implicit argument area:

1
256 bytes

For COV4 the working size is:

1
56 bytes

This became the third LLVM patch. Without it, the runtime can load an image but still launch it with the wrong metadata shape.

The important policy choice is that Clang’s global default stays unchanged. COV4 is selected by the wrapper for this old target.

Step 7: Build Device Libraries Matched To LLVM 22

AMDGPU device compilation links bitcode libraries. The system ROCm bitcode on the machine was too new for LLVM 22 to consume reliably, so the stack keeps a known device-libs snapshot in the source repo:

1
rocm-device-libs/

The setup script copies that snapshot into the workspace, builds it there, and installs bitcode under:

1
$OMP_AMDGPU_WORKSPACE/rocm_install/amdgcn/bitcode

Then env.sh points Clang at that private ROCm prefix:

1
source "$OMP_AMDGPU_WORKSPACE/env.sh"

This makes the compiler, device bitcode, and runtime prefix a coherent local stack instead of a mix of system ROCm and custom LLVM.

Step 8: Verify LLVM Sees The Device

Use the generated environment:

1
2
export OMP_AMDGPU_WORKSPACE=/home/ouankou/Projects/llvm-22
source "$OMP_AMDGPU_WORKSPACE/env.sh"

Then check device discovery:

1
llvm-offload-device-info

This is LLVM 22’s OpenMP offload device-info tool, not a typo for an llvm-omp-device-info binary.

For a single WX3200 setup, the important lines are:

1
2
3
Num Devices: 2
Name: gfx803
Name: Virtual Host Device

If this shows only the host device, check the loader paths first. The AMDGPU plugin dlopens HSA at runtime, so LD_LIBRARY_PATH must include the private ROCr install under the workspace.

Step 9: Run The Small OpenMP Tests

Start with a scalar target region:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <omp.h>
#include <stdio.h>

int main(void) {
  int x = 0;
#pragma omp target map(tofrom : x)
  { x = 42; }

  printf("x=%d devices=%d default=%d\n", x, omp_get_num_devices(),
         omp_get_default_device());
  return x == 42 ? 0 : 1;
}

Compile and run:

1
2
clang-gfx803-openmp -gline-tables-only scalar.c -o /tmp/scalar
OMP_TARGET_OFFLOAD=MANDATORY /tmp/scalar

Expected:

1
x=42 devices=1 default=0

Then run the reduction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <omp.h>
#include <stdio.h>

int main(void) {
  int sum = 0;
#pragma omp target teams distribute parallel for reduction(+ : sum)
  for (int i = 1; i <= 256; ++i)
    sum += i;

  printf("sum=%d devices=%d default=%d\n", sum, omp_get_num_devices(),
         omp_get_default_device());
  return sum == 32896 ? 0 : 1;
}
1
2
clang-gfx803-openmp -gline-tables-only reduction.c -o /tmp/reduction
OMP_TARGET_OFFLOAD=MANDATORY /tmp/reduction

Expected:

1
sum=32896 devices=1 default=0

The user’s original reduction test also became a final smoke test:

1
2
3
4
cd /home/ouankou/Projects/whiteboard/openmp/reduction

clang-gfx803-openmp -gline-tables-only test.c -o /tmp/user-reduction
OMP_TARGET_OFFLOAD=MANDATORY /tmp/user-reduction

Expected:

1
sum = 15050

At that point the stack is end-to-end: Clang emits the device image, the private device libraries link, libomptarget loads the COV4 image, ROCr creates a queue, KFD accepts the packet, and the old GPU runs the OpenMP kernel.

What Was Actually Needed

The first working stack needed fewer moving parts than “ROCm on gfx803” sounds like:

  • a pinned LLVM 22 source baseline;
  • a pinned ROCr 6.4.4 baseline from the ouankou/ROCR-Runtime fork;
  • small ROCr LoongArch64 portability patches;
  • LLVM AMDGPU OpenMP plugin enablement on LoongArch64;
  • COV4 ELF ABI acceptance in libomptarget;
  • COV4 implicit-argument sizing;
  • LLVM-compatible AMD device libraries;
  • LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64;
  • wrapper policy that selects COV4 for gfx803.

The workflow lesson is more important than any single patch: debug from the bottom upward. Hardware visibility, HSA queue creation, raw packet completion, code-object compatibility, device library compatibility, then OpenMP source.

Starting with the OpenMP reduction is good for defining success. It is not good for locating the first failure.