How REX Packs Literal Target Parameters For GPU Kernels

Posted on (Updated on )
REX does not force every mapped scalar through ordinary address-based OpenMP mapping. If a mapped value is a small scalar that fits in a host pointer slot and does not require a real copy-back, the lowerer marks it as a literal target parameter, packs its bytes with rex_pack_literal_arg_bytes(...), stores that packed value directly in the runtime argument arrays, and rewrites the outlined kernel signature so the original scalar is reconstructed with __builtin_memcpy on entry.

The previous posts in this series covered three adjacent pieces of the GPU path:

  • how REX outlines target regions into device kernels,
  • how target loops are rewritten into direct GPU execution,
  • and how the host side builds the runtime packet for __tgt_target_kernel(...).

There is one small-looking mechanism that sits across all three of those stories and turns out to matter a lot:

how should a mapped scalar travel from host code into the generated GPU kernel?

The naive answer is “by address, like everything else.” That works, but it is not always the best match for what the runtime and kernel actually need.

If the mapped object is really just a scalar value such as an int n, forcing it through the ordinary address-based path means treating that value like storage even when the kernel only needs its bits. REX now has a more precise option for this case: the scalar can be lowered as a literal target parameter.

That phrase is specific, and the mechanism behind it is precise:

  • the map analysis marks a subset of mapped scalars as literal-eligible,
  • the host-side lowering packs their bytes into a pointer-sized transport value,
  • the runtime argument arrays carry that packed value with OMP_TGT_MAPTYPE_LITERAL,
  • the outlined kernel signature is rewritten to accept the transport type,
  • and the original scalar is reconstructed inside the kernel body before the real work starts.

This post stays tightly on that mechanism inside src/midend/programTransformation/ompLowering/omp_lowering.cpp and src/midend/programTransformation/ompLowering/rex_kmp.h. It is not a general map-clause post. It is about one data path and why REX handles it explicitly.

A mapped scalar n moves through four stages: literal eligibility, host packing into rex_pack_literal_arg_bytes, runtime arrays marked with TARGET_PARAM and LITERAL, and kernel-side reconstruction into n__rex_value before the loop body runs.

Figure 1. Literal target parameters are a full end-to-end lowering path, not a late packet tweak. The scalar is classified, packed, transported, and reconstructed deliberately.

Why Literal Target Parameters Exist At All

The offloading runtime does not care about C syntax. It cares about an ABI packet:

  • base pointers,
  • effective mapped pointers,
  • sizes,
  • type flags,
  • launch geometry,
  • and the kernel entry symbol.

That means the compiler gets to choose how a particular source-level object is represented in that packet, as long as the final ABI is correct.

For arrays and true pointer-based mappings, address transport is the natural model. The kernel needs a device-visible address, and the runtime must understand extents and mapping kinds.

For a small scalar, the situation is different. Suppose the user writes:

1
2
3
4
5
6
7
int n = 32;

#pragma omp target map(to: n) map(tofrom: a[0:32])
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
  a[i] = a[i] + n;
}

The array a clearly needs normal mapping. The scalar n does not necessarily need “storage semantics.” The device code just needs the value 32.

REX therefore treats some mapped scalars as transport-by-value through the runtime packet, rather than as “take the address of the host scalar and build a normal mapped entry.”

That distinction matters for two reasons.

First, it better matches intent. When a scalar is only an input value, the lowerer should not pretend it is a device-resident object that needs full address-based handling.

Second, it keeps the packet shape honest. A literal target parameter is still represented in the same argument arrays as everything else, but its flags and expressions explicitly say “this slot is a literal value carried as a target parameter,” not “this is a normal mapped memory region.”

That is cleaner than silently overloading the ordinary mapping path and hoping the consumer will infer what happened.

Step 1: The Lowerer Decides Whether A Scalar Is Eligible

The decision point is canUseLiteralTargetParam(...) in omp_lowering.cpp:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
bool canUseLiteralTargetParam(const SgOmpClauseBodyStatement *target,
                              SgVariableSymbol *var_sym,
                              SgOmpClause::omp_map_operator_enum map_operator) {
  if (target == NULL || var_sym == NULL) {
    return false;
  }

  SgType *type = stripTypeAliasesAndReferences(var_sym->get_type());
  if (type == NULL || !SageInterface::isScalarType(type) ||
      isPointerType(type) || isSgTypeLongDouble(type) != NULL) {
    return false;
  }

  const bool is_implicit = isImplicitTargetMapVariable(target, var_sym);
  const bool need_copy_from =
      map_operator == SgOmpClause::e_omp_map_from ||
      map_operator == SgOmpClause::e_omp_map_tofrom;
  if (need_copy_from && !is_implicit) {
    return false;
  }

  return get_target_type_size_bytes(type, target) <=
         get_host_pointer_size_bytes(target);
}

This tells you almost everything important about the feature.

REX only takes the literal path when all of the following are true:

  1. the mapped object is a scalar,
  2. it is not a pointer,
  3. it is not long double,
  4. its target-side size fits within a host pointer-sized transport slot,
  5. and it does not require a real explicit copy-back path.

That last point is easy to miss and important to get right.

If the user writes an explicit map(from: x) or map(tofrom: x) for a scalar, REX does not quietly reinterpret that as a literal parameter unless the scalar is only present because of the implicit target-variable rules. That is the correct design. A true copy-back request is asking for storage semantics, not “please smuggle these bytes through one launch slot.”

So literal transport is not a trick applied to every scalar. It is a gated lowering choice for cases where by-value transport matches both semantics and ABI reality.

An eligibility funnel shows a mapped variable moving through checks: scalar, non-pointer, not long double, no explicit copy-back requirement, size less than or equal to host pointer size, then finally use literal target param.

Figure 2. The literal path is intentionally conservative. It is only used when the scalar can be represented as a pointer-sized value without violating map semantics.

Step 2: Map Analysis Tags The Variable Before Outlining

The feature does not start in the packet builder. It starts earlier, during map analysis in transOmpMapVariables(...):

1
2
3
4
5
6
7
8
if (!canUseLiteralTargetParam(target, item.direct_variable_symbol,
                              item.map_operator)) {
  continue;
}

item.use_literal_target_param = true;
offload_ctx->literal_target_param_syms.insert(
    item.direct_variable_symbol);

This is the right place for the decision.

At this stage, REX still has the semantic view of the mapping information:

  • what the original map operator was,
  • whether the variable is implicit,
  • whether it is a direct scalar rather than an array section,
  • and whether that symbol is actually used in the target region.

Once the compiler moves past this point and starts building concrete host expressions and outlined parameters, it is already too late to “rediscover” the original map semantics cleanly. So the lowerer records the choice directly in two places:

  • on the ResolvedMapItem itself with use_literal_target_param,
  • and in offload_ctx.literal_target_param_syms for the later kernel-signature rewrite.

That second record is especially important. The host packet and the device signature must agree on transport class. If the host side packs a value as literal but the outlined kernel still expects an ordinary pointer-shaped parameter, the launch ABI is broken even though both pieces may look locally reasonable.

This is one of the recurring design themes in REX GPU lowering: decisions that affect both sides of the host-device boundary are recorded in shared lowering context rather than inferred twice.

Step 3: The Host Packet Builder Emits A Pack Call Instead Of &scalar

Once a ResolvedMapItem is marked literal, buildResolvedMapItemArgumentExpressions(...) takes a different path:

1
2
3
4
5
6
7
8
if (item.use_literal_target_param) {
  result.mapping_expression = buildLiteralTargetParamArgExpression(
      item.direct_variable_symbol, scope);
  result.mapping_base_expression = copyExpression(result.mapping_expression);
  result.mapping_size_expression =
      buildCastExp(buildSizeOfOp(mapping_variable_type),
                   buildOpaqueType("int64_t", scope));
}

And buildLiteralTargetParamArgExpression(...) constructs exactly one helper call:

1
2
3
4
5
return buildFunctionCallExp(
    "rex_pack_literal_arg_bytes", buildPointerType(buildVoidType()),
    buildExprListExp(buildAddressOfOp(buildVarRefExp(var_sym)),
                     buildSizeOfOp(type)),
    scope);

So instead of generating something like:

1
2
__args_base[] = { &n, ... };
__args[]      = { &n, ... };

REX generates the semantic equivalent of:

1
2
void *__rex_packed_literal_arg_0 =
    rex_pack_literal_arg_bytes(&n, sizeof(int));

and then routes both __args_base and __args through that packed value.

The helper in rex_kmp.h is intentionally small:

1
2
3
4
5
6
7
8
9
static inline void *rex_pack_literal_arg_bytes(const void *src, size_t size) {
  assert(size <= sizeof(uintptr_t));
  if (size > sizeof(uintptr_t)) {
    return NULL;
  }
  uintptr_t bits = 0;
  __builtin_memcpy(&bits, src, size);
  return (void *)bits;
}

The design choice here is subtle but good:

  • REX does not type-pun through casts from arbitrary scalar types.
  • It copies raw bytes into a uintptr_t.
  • It asserts that the value fits in the transport slot.
  • Then it returns that bit-pattern as a void *, because that is the transport type expected by the runtime argument arrays.

The important point is not “a pointer is being dereferenced later.” It is that the runtime packet already uses pointer-shaped storage for argument transport, so REX uses that slot as a byte container for small scalars.

Step 4: Literal Arguments Need Stabilization Before Final Array Construction

If REX stopped at “insert a helper call directly into the expression list,” the final lowered code would still be fragile. The same pack call can conceptually feed both ArgsBase and Args, and those lists are materialized separately.

That is why the static packet path runs materializeLiteralTargetArgExpressions(...) before emitting the final arrays:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
if (isLiteralTargetParamPackCall(arg_exprs[i])) {
  packed_expr = arg_exprs[i];
} else if (isLiteralTargetParamPackCall(base_exprs[i])) {
  packed_expr = base_exprs[i];
}

SgVariableDeclaration *packed_decl = buildVariableDeclaration(
    "__rex_packed_literal_arg_" + std::to_string(literal_arg_counter++),
    buildPointerType(buildVoidType()),
    buildAssignInitializer(copyExpression(packed_expr)), scope);

arg_exprs[i] = buildVarRefExp(packed_sym);
base_exprs[i] = buildVarRefExp(packed_sym);

This helper does three things that are easy to underappreciate.

First, it gives the packed value one stable name in the generated host code. That improves readability and makes debugging the lowered output less confusing.

Second, it avoids evaluating the same pack expression twice. Even though rex_pack_literal_arg_bytes(...) is pure in practice, duplicated helper calls would make the intent look muddier than it is.

Third, it forces ArgsBase and Args to share exactly the same transport value. That is not merely an optimization. For literal target parameters, it is the semantic statement that base and effective mapping value are intentionally the same thing.

The dynamic packet path does the same kind of stabilization with its own temporary naming scheme. So this is not a quirk of the static array builder. It is the transport rule for literals.

Step 5: The Type Flags Explicitly Tell The Runtime This Is A Literal

Transport expressions alone are not enough. The runtime packet also needs the right type bits.

For literal parameters, the lowerer emits:

1
2
3
4
5
6
int literal_flags =
    OMP_TGT_MAPTYPE_TARGET_PARAM | OMP_TGT_MAPTYPE_LITERAL;
if (item.is_implicit_target_variable) {
  literal_flags |= OMP_TGT_MAPTYPE_IMPLICIT;
}
result.mapping_type_expression = buildIntVal(literal_flags);

That matters because the packet is not just a bag of pointers. The type flags tell the offloading runtime what kind of argument each slot represents.

So the literal lowering path is not:

  • “pretend the scalar is an address,”
  • then “hope the consumer notices the size is small.”

It is:

  • pack the scalar bytes into the transport slot,
  • set the target-param bit,
  • set the literal bit,
  • and preserve IMPLICIT when the original mapping semantics require it.

That is a much cleaner contract.

It also explains why this mechanism belongs in the lowerer rather than in a runtime wrapper. The runtime can only consume the packet it is given. Only the compiler still knows that this argument originated as a scalar n with specific map semantics.

Step 6: The Outlined Kernel Signature Is Rewritten To Match The Packet

At this point, the host packet is correct, but the device-side kernel boundary still has to be brought into agreement. That is the job of lowerLiteralTargetKernelParameters(...).

This helper always checks one ABI detail first:

1
2
3
4
5
6
if (params->get_args().empty() ||
    params->get_args().front()->get_name().getString() !=
        "__rex_kernel_launch_env") {
  prependArg(params, buildInitializedName("__rex_kernel_launch_env",
                                          buildPointerType(buildVoidType())));
}

That hidden environment parameter is not specific to literal scalars, but it matters here because literal transport depends on the final runtime-facing signature being exactly right. REX makes the launch-environment slot explicit instead of pretending the outlined CUDA kernel is a plain user-level function.

After that, every literal parameter is rewritten from its original scalar type to a pointer-sized transport type:

1
2
3
4
5
SgType *transport_type =
    get_host_pointer_size_bytes(body) <= 4
        ? static_cast<SgType *>(buildUnsignedIntType())
        : static_cast<SgType *>(buildUnsignedLongLongType());
param->set_type(transport_type);

This is the mirror image of the host-side packing rule. The packet carried raw scalar bytes inside a pointer-sized slot, so the generated kernel declaration must expose a parameter type large enough to receive exactly that slot.

Then REX reconstructs the original scalar immediately on entry:

1
2
3
4
5
6
7
8
9
SgVariableDeclaration *shadow_decl =
    buildVariableDeclaration(shadow_name, original_type, NULL, body);

SgExprStatement *memcpy_stmt = buildFunctionCallStmt(
    "__builtin_memcpy", buildPointerType(buildVoidType()),
    buildExprListExp(buildAddressOfOp(buildVarRefExp(shadow_sym)),
                     buildAddressOfOp(buildVarRefExp(param_sym)),
                     buildSizeOfOp(original_type)),
    body);

Finally, body references to the parameter are rewritten to use the shadow local rather than the transport-shaped parameter.

This part is worth pausing on, because it shows why the feature was implemented in the lowerer and not as some late codegen hack.

REX is not merely changing a type on the function declaration. It is changing the meaning of the parameter boundary:

  • externally, the kernel now accepts a raw pointer-sized transport slot;
  • internally, the kernel still behaves as if it received the original scalar;
  • and the bridge between the two is an explicit reconstruction step that the AST can reason about.

That is the right compiler shape for a feature like this.

A before and after kernel signature diagram shows an outlined kernel with parameter int n becoming void* __rex_kernel_launch_env, unsigned long long n, followed by a local int n__rex_value reconstructed with __builtin_memcpy and used by the original loop body.

Figure 3. The device kernel does not directly consume the original scalar type anymore. It receives a transport-sized parameter, reconstructs the original value into a shadow local, and the body continues against that reconstructed scalar.

Why The Same Mechanism Works Across SPMD And Worksharing Paths

One of the nice properties of this design is that it is not tied to one surface OpenMP form.

The literal decision is made in shared map analysis.

The host-side packing happens in shared map-argument construction.

The parameter rewrite happens after outlining, on the actual outlined function.

That means the mechanism works regardless of whether the target region eventually goes through:

  • the direct loop-lowering path,
  • the worksharing path,
  • or an SPMD-style outlined kernel path.

That is exactly where this feature belongs architecturally.

If literal parameter support had been bolted onto only one lowering branch, REX would have ended up with mismatched behavior across constructs that are supposed to share the same offloading model. Instead, the implementation sits at the common seams:

  • map resolution,
  • runtime argument construction,
  • outlined kernel parameter shaping.

So the compiler does not need a separate “literal scalar special case” in every target construct handler. It makes the decision once and lets the shared lowering machinery carry it through.

What The Current Tests Actually Prove

There is already a focused input for this feature in tests/nonsmoke/functional/CompileTests/OpenMP_tests/target_literal_scalar.c:

1
2
3
4
5
6
7
8
int n = 32;
int a[32];

#pragma omp target map(to : n) map(tofrom : a[0 : 32])
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
  a[i] = a[i] + n;
}

And the lowering test harness has a dedicated check script, tests/nonsmoke/functional/roseTests/ompLoweringTests/scripts/run_literal_target_param_check.sh, that verifies three concrete invariants in the lowered output:

  • rex_pack_literal_arg_bytes(&n,sizeof(int )) is present,
  • the emitted type bits include the literal target-parameter flag value,
  • and the generated __args_base / __args arrays no longer lower n by reference.

That is a good starting point because it checks the visible ABI shape, not just whether some internal flag was flipped.

What it does not prove yet is the full matrix of scalar types and target constructs. For example, future tests could tighten coverage around:

  • different scalar widths,
  • implicit versus explicit target variables,
  • rejection cases such as long double,
  • and mixed constructs where outlining shape differs but the literal mechanism should remain identical.

So the current test coverage is real and useful, but it is best understood as a targeted invariant test for the new lowering path rather than the last word on every legal literal case.

Why This Mechanism Is Better Than A “Just Pass The Address” Fallback

It is tempting to ask whether all of this is really worth having. Why not just keep passing &n and avoid the extra machinery?

The answer is that the literal path is more honest about the transport model.

When REX lowers a scalar as literal, it is making a compiler-visible statement:

  • this object is semantically a scalar value for the kernel,
  • it fits into one transport slot,
  • it does not need general mapped-storage behavior,
  • and both the host packet and device signature can be simplified accordingly.

That is not a hack around the runtime. It is a better use of the runtime ABI.

It also sets REX up for cleaner future work. Once the compiler can represent “this target argument is a literal value” explicitly, later optimizations do not need to rediscover that fact from address expressions and magic constants. The lowering already recorded the truth in the AST and in the packet.

Closing

Literal target parameters are a small feature in terms of line count, but they are a good example of what a source-to-source lowerer has to do well.

The important part is not the helper itself. The important part is that REX carries one semantic decision all the way through:

  • from map analysis,
  • to host argument expression building,
  • to runtime flag emission,
  • to kernel parameter rewriting,
  • to device-side value reconstruction.

If any one of those stages were missing, the feature would collapse into a brittle local tweak.

Because all five stages agree, REX can treat a mapped scalar as what it really is: not a pretend memory region, but a value that the generated GPU kernel needs to receive correctly and explicitly.