How To Debug When REX GPU Offloading Builds But Does Not Run

Posted on (Updated on )
When a REX-lowered application compiles but GPU offloading fails at runtime, the fastest way to debug it is to follow the same boundaries that the lowerer uses. First check whether the CUBIN exists and can be loaded. Then check whether the host binary and lowered host source actually contain a valid omp_offloading_entries table. Next confirm that the offload-entry kernel names match the generated device kernels. Then verify that rex_offload_init() runs early enough. If all of that is correct and execution is still wrong, inspect the generated map arrays and __kernel_args, because many apparent CUDA runtime failures are really mapping-shape bugs. REX’s current invariant suite already automates parts of this checklist, which is a good signal that these are the failure classes that matter most in practice.

The previous posts in this series split the runtime boundary into focused pieces:

  • how REX registers CUBIN images with libomptarget,
  • how rex_kmp.h rewrites generated runtime calls,
  • and why the lowerer inserts rex_offload_init() eagerly but avoids automatic rex_offload_fini().

Those posts explain the design.

This one is about failure handling.

Sometimes a REX-lowered program reaches a frustrating state:

  • the source lowers correctly,
  • the host and device files compile,
  • the final executable links,
  • and then GPU offloading still does not work.

That kind of failure is easy to misdiagnose because it feels like “a CUDA runtime problem” or “an LLVM issue” in the abstract.

In practice, the fastest path is much more concrete:

debug the failure in the same order that REX builds the runtime contract.

That is the right mental model because REX’s GPU path is layered:

  • generated host file,
  • generated device file,
  • helper/runtime layer,
  • offloading runtime ABI.

If something breaks, it usually breaks at one of those boundaries, not everywhere at once.

This post turns the brief checklist from the older runtime-glue overview into a focused troubleshooting guide. It stays practical and narrow:

  • what to inspect first,
  • which generated artifacts answer which questions,
  • which commands are useful,
  • and what the current test suite already checks for you.
A ladder of debugging steps: confirm the CUBIN exists, confirm host offload entries exist, confirm kernel-name matching, confirm rex_offload_init placement, then inspect map arrays and kernel arguments.

Figure 1. Runtime triage goes faster when you debug in contract order. Start with image presence and entry identity before you spend time on deeper mapping or execution details.

Why Runtime Failures Need A Layered Checklist

When an offloaded program fails only at runtime, the temptation is to jump straight to the most dramatic explanation:

  • the driver is broken,
  • LLVM changed behavior,
  • CUDA rejected the launch,
  • or the runtime plugin is buggy.

Those explanations are sometimes true, but they are poor first steps.

A REX-lowered program already exposes much more precise evidence:

  • the generated host file shows the launch sequence and mapping packet,
  • the generated device file shows the kernel names and signatures,
  • the helper layer shows how registration and runtime calls are wired together,
  • and the final build artifacts show whether the expected image and symbols even exist.

That is why the checklist in this post is ordered the way it is.

The earlier checks answer cheap, structural questions:

  • is the image there?
  • did the program register anything?
  • do the host and device identities match?

The later checks answer more semantic questions:

  • did init happen at the right time?
  • are the map arrays and packet fields actually correct?

If you reverse that order, you often waste time reading complex lowered packet code before proving that the program even has a valid registered image to launch.

Step 1: Check Whether The CUBIN Exists And Is Reachable

The first question is the simplest:

is register_cubin.cpp even able to read the device image it expects?

The default filename is rex_lib_nvidia.cubin, controlled by REX_CUBIN_NAME in register_cubin.cpp:

1
2
3
#ifndef REX_CUBIN_NAME
#define REX_CUBIN_NAME "rex_lib_nvidia.cubin"
#endif

And the load path is not complicated:

1
2
3
4
bool readFile(const char *filename, std::vector<unsigned char> &buffer) {
  FILE *file = fopen(filename, "rb");
  ...
}

So if the file is missing, unreadable, or not where the executable expects it, registration fails before any kernel can launch.

That is why the first debugging step should be boring and explicit:

1
ls -l rex_lib_nvidia.cubin

If you are debugging inside a benchmark directory with generated artifacts, also check that the file belongs to the right lowered build rather than an older leftover artifact.

This is especially important in source-to-source workflows, because the host executable and the device image are separate artifacts. You can absolutely end up with:

  • a fresh host build,
  • and a stale or missing CUBIN.

If the file is not present, fix that first. Do not debug launch packets yet.

This failure mode is easy to miss because the linker does not need the CUBIN file. The CUBIN is loaded at runtime, not linked into the host executable.

So “the binary linked successfully” tells you almost nothing about whether the image will actually be found when rex_offload_init() or a safe wrapper tries to register it.

That is why this check comes first.

Step 2: Check That The Host Side Actually Contains Offload Entries

If the CUBIN exists, the next question is whether the host side describes anything meaningful for the runtime to register.

REX’s lowerer emits a __tgt_offload_entry for each outlined kernel and places it into the omp_offloading_entries section:

1
2
3
4
5
6
7
SgExprListExp *offload_entry_parameters = buildExprListExp(
    buildCastExp(buildAddressOfOp(buildVarRefExp(outlined_kernel_id_decl)),
                 buildPointerType(buildVoidType())),
    buildStringVal(func_name + "kernel__"), buildIntVal(0), buildIntVal(0),
    buildIntVal(0));
...
offload_entry_decl->set_gnu_attribute_section_name("omp_offloading_entries");

register_cubin.cpp later uses the section boundaries:

1
2
extern struct __tgt_offload_entry __start_omp_offloading_entries;
extern struct __tgt_offload_entry __stop_omp_offloading_entries;

So if the lowerer never emitted the section correctly, or the build stripped or mangled it, the runtime may register an empty or meaningless entry range even though the CUBIN bytes are fine.

What to inspect

Start with the lowered host source.

You should see:

  • one char OUT__...__id__ = 0; symbol per kernel,
  • one struct __tgt_offload_entry OUT__...omp_offload_entry__ per kernel,
  • and those entries should be associated with omp_offloading_entries.

The lowered host file is often the easiest proof because it is fully readable:

1
rg -n 'omp_offload_entry__|omp_offloading_entries|__id__' rose_*.c

If the generated host file does not contain those declarations, the runtime registration path is already broken before you even think about plugins or launch calls.

If you want to inspect the final binary too, a command like this is useful:

1
readelf -sW ./your_binary | rg 'omp_offloading_entries|OUT__.*omp_offload_entry__|__start_omp_offloading_entries|__stop_omp_offloading_entries'

The exact linker view varies by platform and toolchain, but the goal is the same: prove that the host side actually carries a non-empty offload-entry range.

Step 3: Confirm That Host Entry Names Match The Generated Device Kernels

If the entry table exists, the next likely failure class is name mismatch.

The host offload entry records a kernel name string:

1
buildStringVal(func_name + "kernel__")

And the lowerer also generates the actual device kernel definitions using that naming scheme:

1
Outliner::generateFunction(body_block, func_name + "kernel__", all_syms, ...)

So there is a very specific contract:

  • the host entry name in rose_*.c,
  • and the generated device kernel name in rex_lib_*.cu

must match.

If they do not, the runtime/plugin may register the image successfully but still fail to resolve the requested kernel symbol when a launch occurs.

What to inspect

Compare the two generated sources directly:

1
rg -n 'kernel__|omp_offload_entry__' rose_*.c rex_lib_*.cu

You are looking for the same stem appearing on both sides:

  • host offload entry name string,
  • host kernel-id symbol,
  • device __global__ void OUT__...kernel__(...)

If you see a mismatch, fix that before going deeper.

This class of bug is especially plausible during toolchain migration or helper refactoring, because the failure can look like “runtime launch failed” even though the actual cause is just that the host asked for a kernel name the device image does not contain.

A side-by-side comparison showing a host offload entry carrying a kernel__ name string and a device rex_lib file defining a matching __global__ kernel with the same suffix.

Figure 2. The host and device sides meet through identity, not only through data. If the offload-entry name and generated kernel name diverge, registration can still appear fine while launches fail later.

Step 4: Verify That rex_offload_init() Runs Early Enough

If the image exists and the host/device identities line up, the next question is whether initialization actually happened before the first offload use.

For normal generated programs, REX intends this to be explicit:

1
2
3
SgExprStatement *expStmt = buildFunctionCallStmt(
    SgName("rex_offload_init"), buildVoidType(), NULL, currentscope);
prependStatement(expStmt, currentscope);

And the current invariant suite checks this for reduced Rodinia-style cases:

1
2
3
expect_count "${rose_file}" 'rex_offload_init[[:space:]]*\(' 1 "host offload init count"
...
(( init_line < time0_line )) || die "rex_offload_init moved after timer declaration"

So when you debug a runtime failure, inspect the lowered host source and make sure:

  • rex_offload_init() appears exactly once,
  • it appears near the top of main,
  • and it appears before any timing initialization you meant to exclude from offload startup cost.

Use a quick search first:

1
rg -n 'rex_offload_init|clock\(' rose_*.c

Why this matters even though safe wrappers exist

The safe wrappers in register_cubin.cpp can register lazily:

1
2
3
if (register_cubin(REX_CUBIN_NAME) == nullptr) {
  return -1;
}

So a missing early init does not necessarily mean “offloading can never work.”

But it does mean:

  • startup behavior is no longer what the lowerer intended,
  • the first offload site now pays registration cost,
  • and a custom build or manual edit may have dropped the generated lifecycle contract.

If the first launch fails and rex_offload_init() is missing or misplaced, fix that first. It is a cleaner root cause than speculating about later runtime behavior.

Step 5: If Registration Looks Fine, Inspect Map Arrays And __tgt_kernel_arguments

Once you have proven:

  • the CUBIN exists,
  • the offload-entry table exists,
  • host and device kernel identities match,
  • and init placement is sane,

the next likely failure class is no longer “registration failure.”

It is often a launch-packet or mapping bug.

This is where many apparent “CUDA runtime failures” actually come from.

The host launch block materializes:

  • __args_base
  • __args
  • __arg_sizes
  • __arg_types
  • __arg_num
  • and finally __kernel_args

before calling:

1
2
__tgt_target_kernel(__device_id, _num_blocks_, _threads_per_block_,
                    __host_ptr, &__kernel_args);

So if registration is correct but the runtime still faults or the kernel behaves incorrectly, inspect the generated packet values.

What to inspect in the lowered host file

Look for:

  • obviously wrong argument counts,
  • zero sizes where a mapped region should have nonzero extent,
  • wrong map-type bits,
  • missing implicit pointer entries,
  • or suspicious literal-parameter packing where ordinary mapping was expected.

A quick starting point is:

1
rg -n '__args_base|__args|__arg_sizes|__arg_types|__kernel_args|__tgt_kernel_arguments' rose_*.c

If dynamic mapping is involved, also check whether the compiler took the dynamic path and emitted heap-backed arrays plus populate loops instead of pretending a runtime-sized map could be expressed statically.

This is exactly why REX has separate posts and tests for:

  • literal target parameters,
  • dynamic mapper expansion,
  • and runtime packet construction.

Those are not “nice to know” internals. They are the concrete places where registration can be fully correct and execution can still fail.

Why this class of bug often looks like a runtime or CUDA error

Because from the runtime’s perspective, it is receiving a bad contract.

If:

  • sizes are wrong,
  • types are wrong,
  • or addresses are wrong,

then the plugin/runtime may report a low-level launch or memory error even though the true bug lives in the generated host-side packet.

So once the image and identity checks pass, stop treating the failure as “probably still registration.” At that point, the packet is often the better suspect.

What The Current Test Suite Already Checks For You

One useful part of the REX workflow is that several items in this checklist are already covered by the lowering invariant suite.

The reduced Rodinia-style verifier checks host-side invariants such as:

  • exactly one #include "rex_kmp.h",
  • exactly one rex_offload_init(),
  • zero rex_offload_fini() insertions,
  • the expected number of __tgt_offload_entry objects,
  • the expected number of kernel ID symbols,
  • and the expected number of __tgt_target_kernel(...) launch sites.

That means some of the easiest structural mistakes are already testable before you even run the program.

The suite README explicitly calls out the kinds of regressions it is meant to catch:

  • offload-entry integrity,
  • init ordering,
  • multi-kernel lowering shape,
  • and map-list lowering integrity.

That matters because it tells you something about the compiler’s failure history too. These are not random checks. They are the classes of bugs that were serious enough to earn invariant coverage.

So when you hit a runtime failure in a new case, the checklist in this post is not just a manual debugging recipe. It is aligned with the contract boundaries REX already treats as important in its own tests.

A Practical Triage Loop

If you want the shortest useful version of this whole post, use this loop:

  1. check that rex_lib_nvidia.cubin exists in the runtime working directory
  2. inspect the lowered host file for omp_offloading_entries, kernel ID symbols, and __tgt_offload_entry
  3. compare host entry names with the generated device kernel names
  4. verify rex_offload_init() exists and appears before timing-sensitive declarations
  5. inspect __args* arrays and __kernel_args if registration and identity both look correct

That order matters. It moves from cheapest structural failures to more semantic packet failures.

If you skip directly to step five every time, you will spend too much time reading complicated launch blocks when the true problem is that the image was missing or the entry table never existed.

A decision tree splitting runtime failures into image-missing, entry-table/identity mismatch, init-ordering issue, and mapping-packet issue.

Figure 3. Once the CUBIN, entry table, and kernel names are all correct, the failure class usually shifts from registration to launch-packet correctness.

Closing

The phrase “builds but does not run” sounds like one problem.

In REX GPU offloading, it is usually several different problems that only look similar from a distance:

  • image not found,
  • host entries missing,
  • names not matching,
  • init happening too late,
  • or launch packets being wrong.

The reason the checklist in this post works is that it follows the same boundaries the compiler itself uses.

REX does not generate one opaque magic artifact. It generates:

  • host launch code,
  • device kernels,
  • helper/runtime glue,
  • and an ABI packet.

So the right way to debug it is to ask which boundary broke first, prove that with the generated artifacts, and only then move deeper.