How REX Organizes Generated Host, Device, and Helper Files

Posted on Mar 30, 2026 (Updated on Apr 15, 2026)

REX does not lower GPU offloading into one opaque output. It emits a rewritten host file, a synthesized device file named rex_lib_<input>.cu, and a small set of shared helper/runtime files such as rex_kmp.h, register_cubin.cpp, rex_nvidia.h, rex_nvidia.cu, and xomp_cuda_lib_inlined.cu. The host file contains launch code and offload entries, the device file contains outlined kernels, and the helper files provide the ABI glue to libomptarget. That split is what makes the source-to-source pipeline buildable, inspectable, and testable.

The previous post in this series focused on target data lowering and the lifetime relationships between mapped data and multiple kernels. That post stayed inside the transformed host code and the runtime calls it emits.

The next natural question is broader:

after REX lowers an input file for GPU offloading, what concrete files actually exist on disk, and why are there several of them instead of one?

This is where the source-to-source nature of REX becomes especially visible.

REX does not lower GPU offloading into one monolithic opaque blob. It emits a set of cooperating source artifacts:

a rewritten host file, typically visible as rose_<input>.c or rose_<input>.cpp,
a synthesized device file named rex_lib_<input>.cu,
and shared helper/runtime files such as rex_kmp.h, register_cubin.cpp, rex_nvidia.h, rex_nvidia.cu, and xomp_cuda_lib_inlined.cu.

That split is not incidental. It is one of the key reasons the GPU offloading pipeline remains debuggable and maintainable. A contributor can inspect the host launch code, inspect the generated kernels, inspect the helper/runtime bridge, and reason about problems at the right layer instead of treating the whole compilation as a black box.

This post zooms into that generated-file model. It stays tightly focused on one section of the original lowering story: what the generated files represent and how they fit together in the final build.

An input source file lowered into three artifact layers: a rewritten host rose file, a generated rex_lib device file, and shared helper/runtime files used by both. — Figure 1. GPU lowering in REX produces a small artifact stack rather than one opaque output. The host file, device file, and helper files each carry a different part of the offloading contract.

Why One Input File Becomes Several Artifacts

It is tempting to imagine that a source-to-source compiler should translate one input file into one output file. That would be simpler to describe, but it would be the wrong model for GPU offloading.

Offloading has at least three distinct responsibilities:

host control flow and runtime launch state
The host program still decides when kernels run, what mappings are active, and what arguments are passed to the runtime.
device-executable kernel bodies
The transformed target regions must exist as actual GPU kernels in a device translation unit.
runtime and toolchain glue
The host and device code have to meet a concrete ABI, register a device image, and share helper definitions that neither the original source nor the runtime can infer automatically.

Trying to flatten all of that into one source artifact would make the lowered output much harder to reason about.

REX instead chooses an explicit split:

keep the host-side program as a rewritten host translation unit,
move outlined kernels into a device-specific translation unit,
and rely on a small shared helper layer to bridge generated code to the LLVM offloading runtime.

That is a source-to-source design decision, not just an implementation detail.

Why this is better than a single giant generated file

Because each artifact has a single dominant responsibility:

the host file is about runtime sequencing and launch intent,
the device file is about generated kernels and device-side declarations,
the helper files are about ABI stability and reusable support code.

That separation makes debugging tractable. If a launch is missing, inspect the host file. If a kernel signature looks wrong, inspect the device file. If registration or wrapper behavior is suspect, inspect the helpers.

It also matches how the downstream build actually works. Clang or another downstream compiler can still compile ordinary host and device sources, and the final executable is assembled through normal compilation and linking steps rather than through a special opaque artifact format.

Artifact 1: The Rewritten Host File

The first artifact is the rewritten host translation unit. In practice, the lowering tests inspect files like:

rose_rodinia_axpy_multi_like.c
rose_rodinia_hotspot_like.c
rose_rodinia_pathfinder_like.c

That rose_*.c naming is the familiar ROSE-generated host output naming convention, and it remains the main place where a developer reads the transformed host-side program.

What lives in the host file

This file contains:

the original host program structure, rewritten around the lowered OpenMP regions,
inserted runtime includes such as rex_kmp.h,
host-side offload entry declarations,
explicit launch blocks with __tgt_target_kernel(...),
explicit data-lifetime blocks with __tgt_target_data_begin/end(...),
and the startup insertion of rex_offload_init() near the top of main.

The lowering code makes this explicit through insertRTLHeaders(...) and insertAcceleratorInit(...).

The header insertion path is straightforward:

1
2
3
4
5
6
if (!file->get_Fortran_only() &&
    (hasOpenMPRuntimeConstructs(file) || hasTargetOffloadConstructs(file))) {
  SageInterface::insertHeader(file, "rex_kmp.h",
                              /*isSystemHeader=*/false,
                              /*asLastHeader=*/true);
}

Then insertAcceleratorInit(...) prepends:

1
2
3
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), buildExprListExp(), currentscope);
prependStatement(expStmt, currentscope);

This is a good reminder that the host file is not just “the original file with kernels removed.” It is the primary executable control-flow artifact after lowering.

Why the host file is still the main debugging surface

If something is wrong with:

the order of runtime calls,
the number of offload entries,
the placement of rex_offload_init(),
repeated calls to a lowered helper,
or the lifetime of a target data mapping region,

the host file is usually the first place to look.

That is why the lowering_rodinia verifier checks this file so aggressively. For example, it expects:

exactly one #include "rex_kmp.h" in the host file,
the right number of __tgt_offload_entry objects,
the right number of host-side __tgt_target_kernel(...) calls,
and stable placement of generated includes and comments.

The host file is where those invariants are visible.

A side-by-side comparison showing the host rose file carrying runtime launches, offload entries, and init code, while the device rex_lib file carries outlined __global__ kernels. — Figure 2. The rewritten host file and the generated device file do not duplicate responsibilities. The host file owns runtime sequencing; the device file owns executable kernels.

Artifact 2: The Synthesized Device File

The second artifact is the device translation unit. This is the file that holds the actual outlined GPU kernels.

Its naming is deliberate. Inside generate_outlined_function_file(...), the lowerer constructs:

1
2
3
4
std::string new_file_name =
    "rex_lib_" + base_name + "." + file_extension;
new_file->get_file_info()->set_filenameString(new_file_name);
new_file->set_unparse_output_filename(new_file_name);

For GPU offloading, that means files such as:

rex_lib_rodinia_axpy_multi_like.cu
rex_lib_rodinia_hotspot_like.cu
rex_lib_rodinia_pathfinder_like.cu

The naming convention itself tells you what the file is:

rex_lib_ means “compiler-generated support or outlined library code,”
the original input basename tells you which source file it belongs to,
and .cu tells you it is intended for device compilation.

How kernels get into the device file

The lowerer does not simply print kernels twice. It synthesizes a new source file and moves the outlined functions into it.

That process has two key steps.

First, generate_outlined_function_file(...) creates the new SgSourceFile, names it, disables token-based unparsing because the file is synthesized rather than token-faithful, and inserts the appropriate device-facing header:

1
2
3
4
5
if (file_extension == "cu") {
  SageInterface::insertHeader(new_file, "rex_nvidia.h",
                              /*isSystemHeader=*/false,
                              /*asLastHeader=*/true);
}

Second, move_outlined_function(...) deep-copies the outlined function into the new file, clears the static storage modifier, and turns the declaration left behind in the original file into an extern declaration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SgFunctionDeclaration *new_outlined_function =
    isSgFunctionDeclaration(deepCopy(outlined_func));
new_outlined_function->get_declarationModifier()
    .get_storageModifier()
    .setUnspecified();
appendStatement(new_outlined_function, new_scope);

SgFunctionDeclaration *extern_header =
    isSgFunctionDeclaration(findFunctionDeclaration(
        original_scope->get_parent(), original_name, original_scope, false));
extern_header->get_declarationModifier().get_storageModifier().setExtern();

removeStatement(outlined_func);

That is the exact mechanism behind the artifact split:

the host file keeps only the declaration-level interface it still needs,
the device file gets the actual kernel definition.

Why all kernels from one input end up in one `rex_lib_*.cu`

The post_processing(...) phase walks target_outlined_function_list, creates one synthesized .cu file for that original source, and moves all outlined GPU kernels into it:

1
2
3
4
5
6
7
8
new_file = generate_outlined_function_file(
    target_outlined_function_list->at(0), "cu");
...
for (i = target_outlined_function_list->begin();
     i != target_outlined_function_list->end(); i++) {
  ...
  move_outlined_function(*i, new_file);
}

This means a multi-kernel input such as rodinia_axpy_multi_like.c generates:

one rewritten host file,
one device file containing all three kernels.

That is an excellent fit for source-to-source debugging. If the input source produced three kernels, there is one obvious place to inspect them all.

What else is added to the device file

For C and C++ source inputs, the lowerer also wraps the generated device code with extern "C" linkage guards in the synthesized file. The lowering tests explicitly check that the device file contains exactly one extern "C" block and the right number of __global__ void OUT__... kernels.

That is another sign that this file is meant to be a first-class generated artifact, not a temporary intermediate hidden from users and tests.

Artifact 3: The Shared Helper and Runtime Files

The third artifact category is the shared helper layer. These files are not generated per input source, but they are just as important to the build as the rewritten host file and the generated device file.

The most important ones in the current GPU offloading path are:

rex_kmp.h
register_cubin.cpp
rex_nvidia.h
rex_nvidia.cu
xomp_cuda_lib_inlined.cu

You can think of these as the reusable support boundary between compiler-generated source and the downstream runtime/toolchain.

`rex_kmp.h`: the host/runtime ABI facade

rex_kmp.h is the main host-side ABI header. It vendors the runtime-facing struct definitions and rewrites generated __tgt_* calls through the REX wrapper layer.

That is why the host file must include it. Without this header, generated host code would have to rely on system headers matching the expected offload ABI shape exactly, which is precisely the kind of coupling REX tries to avoid.

In other words:

the host file is generated per input,
but rex_kmp.h is the stable contract that lets those generated host files talk to libomptarget consistently.

`register_cubin.cpp`: the device-image registration bridge

register_cubin.cpp is the translation unit that loads the CUBIN image, builds the __tgt_device_image / __tgt_bin_desc structures, and registers the device image with the runtime.

It also provides rex_offload_init() and the wrapper entry points for:

rex___tgt_target_kernel
rex___tgt_target_data_begin
rex___tgt_target_data_end
and related calls.

This file is not the main subject of this post, but it is central to understanding why the generated host and device files alone are not enough. The helper layer is what binds them into a runnable offloading program.

`rex_nvidia.h` and `rex_nvidia.cu`: device-side support

On the device side, the synthesized .cu file includes rex_nvidia.h rather than rex_kmp.h. That header declares device-side helper symbols and environment types needed by the generated kernels.

The support source rex_nvidia.cu shows the device-side packaging more clearly:

1
2
3
#include "rex_nvidia.h"
#include "xomp_cuda_lib_inlined.cu"
__device__ DeviceEnvironmentTy __omp_rtl_device_environment = {0, 0, 0, 0};

This is the device counterpart to the host helper layer. It packages shared CUDA/OpenMP helper logic that the outlined kernels can rely on without forcing the lowerer to inline every device utility into every generated kernel body.

`xomp_cuda_lib_inlined.cu`: shared device helper code

This file contains reusable device-side helper logic such as reduction utilities and loop-scheduling support. The fact that it is shared matters: REX does not want every generated .cu file to reinvent the same device helper implementations from scratch.

So the overall helper story is:

generated host files include host/runtime glue,
generated device files include device-side glue,
and the helper translation units and headers absorb the ABI-sensitive or reusable support logic that should not be duplicated in every lowered input.

A diagram mapping helper files to responsibilities: rex_kmp.h for host ABI and wrappers, register_cubin.cpp for image registration, rex_nvidia.h for device declarations, and xomp_cuda_lib_inlined.cu for shared device helpers. — Figure 3. The helper layer is not an implementation accident. It isolates ABI-sensitive runtime glue and shared device support from the per-input generated source files.

Why This Split Makes Source-To-Source Compilation Practical

By this point, the generated-file model should look less like “a lot of files” and more like “a clean separation of responsibilities.”

That separation buys REX several important properties.

The build still looks like a normal build

The output of lowering is not a custom binary blob that only REX understands. It is a set of ordinary source files and helper files that a downstream compiler can compile in the usual way.

That makes the workflow feel normal:

compile the rewritten host file,
compile the generated device file,
compile or link the helper layer,
produce an executable plus device image.

For a source-to-source system, that is a major practical win.

Debugging becomes layered instead of monolithic

When something goes wrong, a contributor can ask a much sharper question:

is the host file missing a launch or a data-lifetime bracket?
is the device file missing a kernel or carrying the wrong signature?
is the helper layer registering the image incorrectly or rewriting runtime calls incorrectly?

That is a better debugging model than “somewhere inside one giant generated artifact, something failed.”

Tests can inspect the right file for the right invariant

The reduced Rodinia verifier already does this. It checks:

rose_*.c for host-side include count, offload entries, kernel calls, repeated host-call structure, and comment placement;
rex_lib_*.cu for device include count, extern "C" wrapping, kernel count, and hidden launch-environment parameters.

That split is only possible because the artifact model itself is explicit.

Helper evolution stays centralized

If the LLVM offloading ABI drifts or if REX needs to change how device-image registration works, much of that change belongs in the helper layer rather than in every lowering site.

That keeps the compiler-generated files simpler and prevents the lowerer from becoming a pile of duplicated runtime glue.

The Generated-File Model In One Sentence

REX does not lower GPU offloading into one output file because GPU offloading is not one responsibility.

It lowers into:

a rewritten host program that owns control flow and runtime sequencing,
a generated device translation unit that owns executable kernels,
and a shared helper layer that owns ABI glue and reusable support code.

That is what makes the source-to-source model workable in practice. The output is still ordinary source, but it is split along the same boundaries a human debugger would naturally care about.

The next post can now move to the final major slice of the original lowering story: how REX tests GPU lowering invariants with reduced Rodinia-style inputs, and why those invariant-driven checks are more valuable than brittle full-file golden outputs.