How REX Organizes Generated Host, Device, and Helper Files
rex_lib_<input>.cu, and a small set of shared helper/runtime files such as rex_kmp.h, register_cubin.cpp, rex_nvidia.h, rex_nvidia.cu, and xomp_cuda_lib_inlined.cu. The host file contains launch code and offload entries, the device file contains outlined kernels, and the helper files provide the ABI glue to libomptarget. That split is what makes the source-to-source pipeline buildable, inspectable, and testable.The previous post in this series focused on target data lowering and the lifetime relationships between mapped data and multiple kernels. That post stayed inside the transformed host code and the runtime calls it emits.
The next natural question is broader:
after REX lowers an input file for GPU offloading, what concrete files actually exist on disk, and why are there several of them instead of one?
This is where the source-to-source nature of REX becomes especially visible.
REX does not lower GPU offloading into one monolithic opaque blob. It emits a set of cooperating source artifacts:
- a rewritten host file, typically visible as
rose_<input>.corrose_<input>.cpp, - a synthesized device file named
rex_lib_<input>.cu, - and shared helper/runtime files such as
rex_kmp.h,register_cubin.cpp,rex_nvidia.h,rex_nvidia.cu, andxomp_cuda_lib_inlined.cu.
That split is not incidental. It is one of the key reasons the GPU offloading pipeline remains debuggable and maintainable. A contributor can inspect the host launch code, inspect the generated kernels, inspect the helper/runtime bridge, and reason about problems at the right layer instead of treating the whole compilation as a black box.
This post zooms into that generated-file model. It stays tightly focused on one section of the original lowering story: what the generated files represent and how they fit together in the final build.
Figure 1. GPU lowering in REX produces a small artifact stack rather than one opaque output. The host file, device file, and helper files each carry a different part of the offloading contract.
Why One Input File Becomes Several Artifacts
It is tempting to imagine that a source-to-source compiler should translate one input file into one output file. That would be simpler to describe, but it would be the wrong model for GPU offloading.
Offloading has at least three distinct responsibilities:
host control flow and runtime launch state
The host program still decides when kernels run, what mappings are active, and what arguments are passed to the runtime.device-executable kernel bodies
The transformed target regions must exist as actual GPU kernels in a device translation unit.runtime and toolchain glue
The host and device code have to meet a concrete ABI, register a device image, and share helper definitions that neither the original source nor the runtime can infer automatically.
Trying to flatten all of that into one source artifact would make the lowered output much harder to reason about.
REX instead chooses an explicit split:
- keep the host-side program as a rewritten host translation unit,
- move outlined kernels into a device-specific translation unit,
- and rely on a small shared helper layer to bridge generated code to the LLVM offloading runtime.
That is a source-to-source design decision, not just an implementation detail.
Why this is better than a single giant generated file
Because each artifact has a single dominant responsibility:
- the host file is about runtime sequencing and launch intent,
- the device file is about generated kernels and device-side declarations,
- the helper files are about ABI stability and reusable support code.
That separation makes debugging tractable. If a launch is missing, inspect the host file. If a kernel signature looks wrong, inspect the device file. If registration or wrapper behavior is suspect, inspect the helpers.
It also matches how the downstream build actually works. Clang or another downstream compiler can still compile ordinary host and device sources, and the final executable is assembled through normal compilation and linking steps rather than through a special opaque artifact format.
Artifact 1: The Rewritten Host File
The first artifact is the rewritten host translation unit. In practice, the lowering tests inspect files like:
rose_rodinia_axpy_multi_like.crose_rodinia_hotspot_like.crose_rodinia_pathfinder_like.c
That rose_*.c naming is the familiar ROSE-generated host output naming convention, and it remains the main place where a developer reads the transformed host-side program.
What lives in the host file
This file contains:
- the original host program structure, rewritten around the lowered OpenMP regions,
- inserted runtime includes such as
rex_kmp.h, - host-side offload entry declarations,
- explicit launch blocks with
__tgt_target_kernel(...), - explicit data-lifetime blocks with
__tgt_target_data_begin/end(...), - and the startup insertion of
rex_offload_init()near the top ofmain.
The lowering code makes this explicit through insertRTLHeaders(...) and insertAcceleratorInit(...).
The header insertion path is straightforward:
| |
Then insertAcceleratorInit(...) prepends:
| |
This is a good reminder that the host file is not just “the original file with kernels removed.” It is the primary executable control-flow artifact after lowering.
Why the host file is still the main debugging surface
If something is wrong with:
- the order of runtime calls,
- the number of offload entries,
- the placement of
rex_offload_init(), - repeated calls to a lowered helper,
- or the lifetime of a
target datamapping region,
the host file is usually the first place to look.
That is why the lowering_rodinia verifier checks this file so aggressively. For example, it expects:
- exactly one
#include "rex_kmp.h"in the host file, - the right number of
__tgt_offload_entryobjects, - the right number of host-side
__tgt_target_kernel(...)calls, - and stable placement of generated includes and comments.
The host file is where those invariants are visible.
Figure 2. The rewritten host file and the generated device file do not duplicate responsibilities. The host file owns runtime sequencing; the device file owns executable kernels.
Artifact 2: The Synthesized Device File
The second artifact is the device translation unit. This is the file that holds the actual outlined GPU kernels.
Its naming is deliberate. Inside generate_outlined_function_file(...), the lowerer constructs:
| |
For GPU offloading, that means files such as:
rex_lib_rodinia_axpy_multi_like.curex_lib_rodinia_hotspot_like.curex_lib_rodinia_pathfinder_like.cu
The naming convention itself tells you what the file is:
rex_lib_means “compiler-generated support or outlined library code,”- the original input basename tells you which source file it belongs to,
- and
.cutells you it is intended for device compilation.
How kernels get into the device file
The lowerer does not simply print kernels twice. It synthesizes a new source file and moves the outlined functions into it.
That process has two key steps.
First, generate_outlined_function_file(...) creates the new SgSourceFile, names it, disables token-based unparsing because the file is synthesized rather than token-faithful, and inserts the appropriate device-facing header:
| |
Second, move_outlined_function(...) deep-copies the outlined function into the new file, clears the static storage modifier, and turns the declaration left behind in the original file into an extern declaration:
| |
That is the exact mechanism behind the artifact split:
- the host file keeps only the declaration-level interface it still needs,
- the device file gets the actual kernel definition.
Why all kernels from one input end up in one rex_lib_*.cu
The post_processing(...) phase walks target_outlined_function_list, creates one synthesized .cu file for that original source, and moves all outlined GPU kernels into it:
| |
This means a multi-kernel input such as rodinia_axpy_multi_like.c generates:
- one rewritten host file,
- one device file containing all three kernels.
That is an excellent fit for source-to-source debugging. If the input source produced three kernels, there is one obvious place to inspect them all.
What else is added to the device file
For C and C++ source inputs, the lowerer also wraps the generated device code with extern "C" linkage guards in the synthesized file. The lowering tests explicitly check that the device file contains exactly one extern "C" block and the right number of __global__ void OUT__... kernels.
That is another sign that this file is meant to be a first-class generated artifact, not a temporary intermediate hidden from users and tests.
Artifact 3: The Shared Helper and Runtime Files
The third artifact category is the shared helper layer. These files are not generated per input source, but they are just as important to the build as the rewritten host file and the generated device file.
The most important ones in the current GPU offloading path are:
rex_kmp.hregister_cubin.cpprex_nvidia.hrex_nvidia.cuxomp_cuda_lib_inlined.cu
You can think of these as the reusable support boundary between compiler-generated source and the downstream runtime/toolchain.
rex_kmp.h: the host/runtime ABI facade
rex_kmp.h is the main host-side ABI header. It vendors the runtime-facing struct definitions and rewrites generated __tgt_* calls through the REX wrapper layer.
That is why the host file must include it. Without this header, generated host code would have to rely on system headers matching the expected offload ABI shape exactly, which is precisely the kind of coupling REX tries to avoid.
In other words:
- the host file is generated per input,
- but
rex_kmp.his the stable contract that lets those generated host files talk tolibomptargetconsistently.
register_cubin.cpp: the device-image registration bridge
register_cubin.cpp is the translation unit that loads the CUBIN image, builds the __tgt_device_image / __tgt_bin_desc structures, and registers the device image with the runtime.
It also provides rex_offload_init() and the wrapper entry points for:
rex___tgt_target_kernelrex___tgt_target_data_beginrex___tgt_target_data_end- and related calls.
This file is not the main subject of this post, but it is central to understanding why the generated host and device files alone are not enough. The helper layer is what binds them into a runnable offloading program.
rex_nvidia.h and rex_nvidia.cu: device-side support
On the device side, the synthesized .cu file includes rex_nvidia.h rather than rex_kmp.h. That header declares device-side helper symbols and environment types needed by the generated kernels.
The support source rex_nvidia.cu shows the device-side packaging more clearly:
| |
This is the device counterpart to the host helper layer. It packages shared CUDA/OpenMP helper logic that the outlined kernels can rely on without forcing the lowerer to inline every device utility into every generated kernel body.
xomp_cuda_lib_inlined.cu: shared device helper code
This file contains reusable device-side helper logic such as reduction utilities and loop-scheduling support. The fact that it is shared matters: REX does not want every generated .cu file to reinvent the same device helper implementations from scratch.
So the overall helper story is:
- generated host files include host/runtime glue,
- generated device files include device-side glue,
- and the helper translation units and headers absorb the ABI-sensitive or reusable support logic that should not be duplicated in every lowered input.
Figure 3. The helper layer is not an implementation accident. It isolates ABI-sensitive runtime glue and shared device support from the per-input generated source files.
Why This Split Makes Source-To-Source Compilation Practical
By this point, the generated-file model should look less like “a lot of files” and more like “a clean separation of responsibilities.”
That separation buys REX several important properties.
The build still looks like a normal build
The output of lowering is not a custom binary blob that only REX understands. It is a set of ordinary source files and helper files that a downstream compiler can compile in the usual way.
That makes the workflow feel normal:
- compile the rewritten host file,
- compile the generated device file,
- compile or link the helper layer,
- produce an executable plus device image.
For a source-to-source system, that is a major practical win.
Debugging becomes layered instead of monolithic
When something goes wrong, a contributor can ask a much sharper question:
- is the host file missing a launch or a data-lifetime bracket?
- is the device file missing a kernel or carrying the wrong signature?
- is the helper layer registering the image incorrectly or rewriting runtime calls incorrectly?
That is a better debugging model than “somewhere inside one giant generated artifact, something failed.”
Tests can inspect the right file for the right invariant
The reduced Rodinia verifier already does this. It checks:
rose_*.cfor host-side include count, offload entries, kernel calls, repeated host-call structure, and comment placement;rex_lib_*.cufor device include count,extern "C"wrapping, kernel count, and hidden launch-environment parameters.
That split is only possible because the artifact model itself is explicit.
Helper evolution stays centralized
If the LLVM offloading ABI drifts or if REX needs to change how device-image registration works, much of that change belongs in the helper layer rather than in every lowering site.
That keeps the compiler-generated files simpler and prevents the lowerer from becoming a pile of duplicated runtime glue.
The Generated-File Model In One Sentence
REX does not lower GPU offloading into one output file because GPU offloading is not one responsibility.
It lowers into:
- a rewritten host program that owns control flow and runtime sequencing,
- a generated device translation unit that owns executable kernels,
- and a shared helper layer that owns ABI glue and reusable support code.
That is what makes the source-to-source model workable in practice. The output is still ordinary source, but it is split along the same boundaries a human debugger would naturally care about.
The next post can now move to the final major slice of the original lowering story: how REX tests GPU lowering invariants with reduced Rodinia-style inputs, and why those invariant-driven checks are more valuable than brittle full-file golden outputs.