Extending The LLVM OpenMP AMDGPU Stack To GFX906 And Mixed GPUs

Posted on (Updated on )
Adding MI50/gfx906 did not mean replacing the gfx803 stack. The right model was one common OpenMP runtime process, one private ROCr 6.4.4 runtime, and a fat AMDGPU OpenMP binary with per-architecture policy: gfx803 gets COV4, gfx906 keeps the newer default code object ABI. The setup script became architecture-list driven, generated both single-architecture and fat wrappers, and validated mixed execution with OpenMP device(n) target regions launched from host tasks.

The first milestone was one old GPU: WX3200, gfx803, LoongArch64, LLVM 22, and enough ROCr to run OpenMP target regions.

The next question was practical:

What happens when an MI50 is installed next to it?

The MI50 is a very different card from the WX3200:

  • MI50 is Vega 20, reported as gfx906.
  • It supports newer AMDGPU code object ABIs.
  • It does not need the gfx803 COV4 policy.
  • It is much closer to the generation LLVM 22 expects.

That made the extension promising. But there was one important constraint: an OpenMP program does not load one ROCr runtime per GPU architecture. One process loads one HSA runtime, and libomptarget uses that runtime to enumerate and launch work on the visible agents.

So the problem was not “build a separate MI50 stack.” The problem was:

  • keep the working gfx803 path alive;
  • add a gfx906 image to the same application binary;
  • keep COV4 scoped only to gfx803;
  • use one private ROCr runtime that can enumerate both devices;
  • teach users how to select devices explicitly in OpenMP source.

The repository commit that captured this transition was:

1
cd9125f Support mixed AMDGPU OpenMP stack
A mixed AMDGPU OpenMP binary containing a host image, a gfx803 COV4 device image, and a gfx906 default-ABI device image loaded by one libomptarget process.

Figure 1. The mixed setup is one host process and one OpenMP runtime. The binary can carry multiple AMDGPU images, but the process still uses one ROCr runtime at execution time.

Start With Discovery

After installing the MI50, first check what LLVM sees:

1
2
3
4
5
export OMP_AMDGPU_WORKSPACE=/home/ouankou/Projects/llvm-22
source "$OMP_AMDGPU_WORKSPACE/env.sh"

amdgpu-arch
llvm-offload-device-info

For this LLVM 22 stack, the device-info utility is named llvm-offload-device-info because it is built from offload/tools/deviceinfo.

The important amdgpu-arch result is:

1
2
gfx906
gfx803

The important llvm-offload-device-info output shape is:

1
2
3
4
Num Devices: 3
Name: gfx906
Name: gfx803
Name: Virtual Host Device

OpenMP device count is different from this tool’s total because the tool also prints the host plugin. In the mixed AMDGPU machine, the OpenMP program sees two target devices:

1
printf("devices=%d\n", omp_get_num_devices());

Expected:

1
devices=2

Always re-check device order after hardware changes. OpenMP device(0) and device(1) follow runtime enumeration, not the order you wish the cards had.

The Key Design Choice: One Runtime, Per-Arch Images

It is tempting to think of this as two builds:

  • one stack for gfx803;
  • another stack for gfx906.

That is not the model used here.

The maintained model is:

  • one pinned LLVM source baseline;
  • one private ROCr runtime;
  • one private device-library prefix;
  • one OpenMP host runtime;
  • multiple AMDGPU offload images in the application binary.

The common runtime stays ROCr 6.4.4 because the WX3200 needs the legacy GFX8 doorbell path still present there. The MI50 can run through that same runtime, so using ROCr 6.4.4 as the common denominator avoids maintaining two HSA runtime worlds.

The per-architecture part lives at compile time:

1
2
AMDGPU_ARCHES="gfx803 gfx906"
AMDGPU_COV4_ARCHES="gfx803"

That says:

  • build a gfx803 image;
  • build a gfx906 image;
  • apply COV4 only to gfx803;
  • let gfx906 use Clang’s normal newer AMDGPU code object ABI.
OpenMP host tasks launching target regions with device zero and device one, each running on a separate AMD GPU and returning partial results to the host.

Figure 2. OpenMP does not automatically split one target region over two GPUs. The host program has to launch work for each device, usually from host tasks or host threads, and combine results after the target regions finish.

Turn The Script From GFX803-Specific To AMDGPU-Specific

The initial script name encoded the first goal:

1
setup_gfx803_stack.sh

When MI50 entered the system, the script became architecture-list driven. The current maintained entry point is:

1
./setup_openmp_offload_stack.sh "$OMP_AMDGPU_WORKSPACE"

The relevant defaults are:

1
2
3
4
AMDGPU_ARCHES="gfx803 gfx906"
AMDGPU_COV4_ARCHES="gfx803"
ROCR_REPO_URL=https://github.com/ouankou/ROCR-Runtime.git
ROCR_REF=rocm-6.4.4

To build the mixed stack:

1
2
3
4
cd /home/ouankou/Projects/amd-omp-gpu-offloading
export OMP_AMDGPU_WORKSPACE=/home/ouankou/Projects/llvm-22

./setup_openmp_offload_stack.sh "$OMP_AMDGPU_WORKSPACE"

The generated wrappers are:

1
2
3
clang-amdgpu-openmp   # fat binary, default gfx803 + gfx906
clang-gfx803-openmp   # single WX3200 image, COV4
clang-gfx906-openmp   # single MI50 image, default newer ABI

For the mixed wrapper, the manual idea is:

1
2
3
4
5
6
7
"$LLVM_INSTALL/bin/clang" \
  -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa \
  --offload-arch=gfx803 \
  --offload-arch=gfx906 \
  -Xarch_gfx803 -mcode-object-version=4 \
  test.c -o test-mixed

The per-arch forwarding is the important part. A one-architecture command can use -Xarch_device, but a mixed gfx803 + gfx906 command needs COV4 only for the gfx803 device compilation.

This was checked with clang -###. In this LLVM 22 build, -Xarch_gfx803 adds -mcode-object-version=4 only to the -target-cpu gfx803 device job. -Xopenmp-target=amdgcn-amd-amdhsa-gfx803 is unused by the driver, while -Xopenmp-target=amdgcn-amd-amdhsa applies the option to both GPU device jobs.

Verify Single-Region OpenMP First

Start with the same scalar test as the original bring-up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <omp.h>
#include <stdio.h>

int main(void) {
  int x = 0;
#pragma omp target map(tofrom : x)
  { x = 42; }

  printf("x=%d devices=%d default=%d\n", x, omp_get_num_devices(),
         omp_get_default_device());
  return x == 42 ? 0 : 1;
}

Compile with the fat wrapper:

1
2
3
4
source "$OMP_AMDGPU_WORKSPACE/env.sh"

clang-amdgpu-openmp -gline-tables-only scalar.c -o /tmp/scalar
OMP_TARGET_OFFLOAD=MANDATORY /tmp/scalar

Expected on the mixed machine:

1
x=42 devices=2 default=0

Then run the reduction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <omp.h>
#include <stdio.h>

int main(void) {
  int sum = 0;
#pragma omp target teams distribute parallel for reduction(+ : sum)
  for (int i = 1; i <= 256; ++i)
    sum += i;

  printf("sum=%d devices=%d default=%d\n", sum, omp_get_num_devices(),
         omp_get_default_device());
  return sum == 32896 ? 0 : 1;
}

Expected:

1
sum=32896 devices=2 default=0

This proves that the default OpenMP target device works. It does not yet prove that both physical GPUs can run work in the same process.

Use device(n) To Exercise Both GPUs

OpenMP will not split one target teams distribute parallel for region across both GPUs automatically. If you want both GPUs to do work, launch one target region per device.

The mixed smoke test used by the setup script follows this shape:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <omp.h>
#include <stdio.h>

#define MAX_DEVICES 8

int main(void) {
  int num_devices = omp_get_num_devices();
  int checked = num_devices < MAX_DEVICES ? num_devices : MAX_DEVICES;
  int sums[MAX_DEVICES] = {0};
  int on_device[MAX_DEVICES] = {0};

  if (num_devices < 1) {
    printf("openmp amdgpu smoke: no OpenMP target devices found\n");
    return 1;
  }

#pragma omp parallel
#pragma omp single
  {
    for (int d = 0; d < checked; ++d) {
#pragma omp task firstprivate(d) shared(sums, on_device)
      {
        int local_sum = 0;
        int local_on_device = 0;

#pragma omp target teams distribute parallel for device(d) \
    reduction(+ : local_sum) reduction(max : local_on_device)
        for (int i = 0; i < 16; ++i) {
          local_sum += d * 100 + i;
          local_on_device = !omp_is_initial_device();
        }

        sums[d] = local_sum;
        on_device[d] = local_on_device;
      }
    }
#pragma omp taskwait
  }

  int failed = 0;
  printf("openmp amdgpu smoke: devices=%d checked=%d\n", num_devices, checked);
  for (int d = 0; d < checked; ++d) {
    int expected = 16 * d * 100 + 120;
    printf("  device=%d sum=%d on_device=%d\n", d, sums[d], on_device[d]);
    if (sums[d] != expected || !on_device[d])
      failed = 1;
  }

  return failed;
}

Expected output on the mixed machine:

1
2
3
openmp amdgpu smoke: devices=2 checked=2
  device=0 sum=120 on_device=1
  device=1 sum=1720 on_device=1

The values are intentionally simple:

  • device 0 computes 0 * 100 + 0..15, so the sum is 120;
  • device 1 computes 1 * 100 + 0..15, so the sum is 1720;
  • on_device=1 proves the target region did not fall back to the host.

OMP_TARGET_OFFLOAD=MANDATORY should be set while testing:

1
2
clang-amdgpu-openmp mixed_smoke.c -o /tmp/mixed-smoke
OMP_TARGET_OFFLOAD=MANDATORY /tmp/mixed-smoke

Data Is Per Device

A beginner mistake is to treat two OpenMP devices like two CPU threads sharing one memory space. They are not.

Each target region maps data for one selected device. If the host wants to use both GPUs for one larger problem, the host program should split the input and combine results explicitly.

For a reduction, the pattern is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
int num_devices = omp_get_num_devices();
int checked = num_devices < MAX_DEVICES ? num_devices : MAX_DEVICES;
int partial[MAX_DEVICES] = {0};

#pragma omp parallel
#pragma omp single
for (int d = 0; d < checked; ++d) {
#pragma omp task firstprivate(d)
  {
    int local = 0;
#pragma omp target teams distribute parallel for device(d) reduction(+ : local)
    for (int i = begin_for_device(d); i < end_for_device(d); ++i)
      local += work(i);
    partial[d] = local;
  }
}

int total = 0;
for (int d = 0; d < checked; ++d)
  total += partial[d];

The runtime does the per-device mapping. The host program owns the domain split and final combine.

What Changed From The GFX803-Only Stack

The mixed support was mostly a refactor of policy, not a new low-level runtime port.

The script stopped assuming one AMDGPU architecture:

1
AMDGPU_ARCHES="gfx803 gfx906"

The COV4 policy became a list instead of a global AMDGPU flag:

1
AMDGPU_COV4_ARCHES="gfx803"

The wrappers became explicit:

1
2
3
clang-amdgpu-openmp
clang-gfx803-openmp
clang-gfx906-openmp

The documentation started teaching device(n) because a mixed binary is useful only if the source can choose where work runs.

What did not change:

  • the stack still uses the forked ROCr 6.4.4 runtime;
  • LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64 remains part of the generated environment;
  • the LLVM 22 device-libs prefix remains private to the workspace;
  • gfx803 still uses COV4;
  • gfx906 does not require the gfx803 COV4 workaround.

Why One Build Is Better Than Two

Separate installs would make the immediate tests easier, but they create the wrong maintenance model. Real OpenMP applications run in one process. That one process needs a coherent runtime view of all visible devices.

The mixed build is closer to the way users will actually compute:

  • compile once;
  • run one host binary;
  • query omp_get_num_devices();
  • choose devices with device(n);
  • split work on the host;
  • combine results on the host.

That is why the second milestone was not “MI50 works alone.” It was:

1
2
3
4
one LLVM OpenMP stack
one ROCr runtime
one fat AMDGPU binary
two AMD GPUs running target regions in one process

For this machine, that turned the initial gfx803 recovery into a usable multi-device OpenMP development stack.