Extending The LLVM OpenMP AMDGPU Stack To GFX906 And Mixed GPUs
device(n) target regions launched from host tasks.The first milestone was one old GPU: WX3200, gfx803, LoongArch64, LLVM 22,
and enough ROCr to run OpenMP target regions.
The next question was practical:
What happens when an MI50 is installed next to it?
The MI50 is a very different card from the WX3200:
- MI50 is Vega 20, reported as
gfx906. - It supports newer AMDGPU code object ABIs.
- It does not need the
gfx803COV4 policy. - It is much closer to the generation LLVM 22 expects.
That made the extension promising. But there was one important constraint: an
OpenMP program does not load one ROCr runtime per GPU architecture. One process
loads one HSA runtime, and libomptarget uses that runtime to enumerate and
launch work on the visible agents.
So the problem was not “build a separate MI50 stack.” The problem was:
- keep the working
gfx803path alive; - add a
gfx906image to the same application binary; - keep COV4 scoped only to
gfx803; - use one private ROCr runtime that can enumerate both devices;
- teach users how to select devices explicitly in OpenMP source.
The repository commit that captured this transition was:
| |
Figure 1. The mixed setup is one host process and one OpenMP runtime. The binary can carry multiple AMDGPU images, but the process still uses one ROCr runtime at execution time.
Start With Discovery
After installing the MI50, first check what LLVM sees:
| |
For this LLVM 22 stack, the device-info utility is named
llvm-offload-device-info because it is built from offload/tools/deviceinfo.
The important amdgpu-arch result is:
| |
The important llvm-offload-device-info output shape is:
| |
OpenMP device count is different from this tool’s total because the tool also prints the host plugin. In the mixed AMDGPU machine, the OpenMP program sees two target devices:
| |
Expected:
| |
Always re-check device order after hardware changes. OpenMP device(0) and
device(1) follow runtime enumeration, not the order you wish the cards had.
The Key Design Choice: One Runtime, Per-Arch Images
It is tempting to think of this as two builds:
- one stack for
gfx803; - another stack for
gfx906.
That is not the model used here.
The maintained model is:
- one pinned LLVM source baseline;
- one private ROCr runtime;
- one private device-library prefix;
- one OpenMP host runtime;
- multiple AMDGPU offload images in the application binary.
The common runtime stays ROCr 6.4.4 because the WX3200 needs the legacy GFX8 doorbell path still present there. The MI50 can run through that same runtime, so using ROCr 6.4.4 as the common denominator avoids maintaining two HSA runtime worlds.
The per-architecture part lives at compile time:
| |
That says:
- build a
gfx803image; - build a
gfx906image; - apply COV4 only to
gfx803; - let
gfx906use Clang’s normal newer AMDGPU code object ABI.
Figure 2. OpenMP does not automatically split one target region over two GPUs. The host program has to launch work for each device, usually from host tasks or host threads, and combine results after the target regions finish.
Turn The Script From GFX803-Specific To AMDGPU-Specific
The initial script name encoded the first goal:
| |
When MI50 entered the system, the script became architecture-list driven. The current maintained entry point is:
| |
The relevant defaults are:
| |
To build the mixed stack:
| |
The generated wrappers are:
| |
For the mixed wrapper, the manual idea is:
| |
The per-arch forwarding is the important part. A one-architecture command can
use -Xarch_device, but a mixed gfx803 + gfx906 command needs COV4 only for
the gfx803 device compilation.
This was checked with clang -###. In this LLVM 22 build, -Xarch_gfx803
adds -mcode-object-version=4 only to the -target-cpu gfx803 device job.
-Xopenmp-target=amdgcn-amd-amdhsa-gfx803 is unused by the driver, while
-Xopenmp-target=amdgcn-amd-amdhsa applies the option to both GPU device jobs.
Verify Single-Region OpenMP First
Start with the same scalar test as the original bring-up:
| |
Compile with the fat wrapper:
| |
Expected on the mixed machine:
| |
Then run the reduction:
| |
Expected:
| |
This proves that the default OpenMP target device works. It does not yet prove that both physical GPUs can run work in the same process.
Use device(n) To Exercise Both GPUs
OpenMP will not split one target teams distribute parallel for region across
both GPUs automatically. If you want both GPUs to do work, launch one target
region per device.
The mixed smoke test used by the setup script follows this shape:
| |
Expected output on the mixed machine:
| |
The values are intentionally simple:
- device 0 computes
0 * 100 + 0..15, so the sum is120; - device 1 computes
1 * 100 + 0..15, so the sum is1720; on_device=1proves the target region did not fall back to the host.
OMP_TARGET_OFFLOAD=MANDATORY should be set while testing:
| |
Data Is Per Device
A beginner mistake is to treat two OpenMP devices like two CPU threads sharing one memory space. They are not.
Each target region maps data for one selected device. If the host wants to use
both GPUs for one larger problem, the host program should split the input and
combine results explicitly.
For a reduction, the pattern is:
| |
The runtime does the per-device mapping. The host program owns the domain split and final combine.
What Changed From The GFX803-Only Stack
The mixed support was mostly a refactor of policy, not a new low-level runtime port.
The script stopped assuming one AMDGPU architecture:
| |
The COV4 policy became a list instead of a global AMDGPU flag:
| |
The wrappers became explicit:
| |
The documentation started teaching device(n) because a mixed binary is useful
only if the source can choose where work runs.
What did not change:
- the stack still uses the forked ROCr 6.4.4 runtime;
LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE=64remains part of the generated environment;- the LLVM 22 device-libs prefix remains private to the workspace;
gfx803still uses COV4;gfx906does not require thegfx803COV4 workaround.
Why One Build Is Better Than Two
Separate installs would make the immediate tests easier, but they create the wrong maintenance model. Real OpenMP applications run in one process. That one process needs a coherent runtime view of all visible devices.
The mixed build is closer to the way users will actually compute:
- compile once;
- run one host binary;
- query
omp_get_num_devices(); - choose devices with
device(n); - split work on the host;
- combine results on the host.
That is why the second milestone was not “MI50 works alone.” It was:
| |
For this machine, that turned the initial gfx803 recovery into a usable
multi-device OpenMP development stack.