How REX Made Literal Scalar Target Parameters Match The Modern OpenMP Launch ABI

Posted on (Updated on )
After launch geometry was made fair, the next source of avoidable overhead was target-argument representation. REX was still treating many scalar values like address-based mapped objects. Native LLVM’s modern OpenMP offload path distinguishes scalars that can be transported as literal target parameters from objects that require real mapping. REX adopted the same structural distinction: only safe scalar values become literal parameters, the host packs their bytes into pointer-sized runtime slots, map types mark them as literal target parameters, and the device kernel reconstructs the original scalar locally. This is not a benchmark hack. It is an ABI repair that makes scalar arguments cheaper and more faithful to the runtime contract.

The previous post fixed the rules around launch geometry. REX may use source-level tripcount knowledge to improve compiler-owned launch defaults, but it must not silently rewrite explicit user launch clauses just to win a benchmark.

That made the next performance layer easier to see.

Once the launch shape is fair, the generated host still has to describe the target region to the OpenMP offload runtime. That description includes every argument:

1
2
3
4
base pointer
argument pointer
argument size
argument map type

For arrays and real mapped objects, that address-based model is exactly what the runtime needs. For many scalar inputs, it is too heavy and semantically misleading.

Consider a target region that uses a scalar target_lat as an input:

1
2
3
4
5
6
#pragma omp target teams distribute parallel for \
    map(to: target_lat, target_long)             \
    map(from: distances[0:n])
for (int i = 0; i < n; ++i) {
  distances[i] = distance(locations[i], target_lat, target_long);
}

The scalar target_lat is not an array section. It does not need device allocation as an object. It does not need copy-back. The kernel only needs the value.

The old lowering path did not make that distinction sharply enough. It often put scalars through the same argument machinery used for mapped storage:

1
2
3
4
void *__args_base[] = { &target_lat, ... };
void *__args[]      = { &target_lat, ... };
int64_t __sizes[]   = { sizeof(float), ... };
int64_t __types[]   = { TARGET_PARAM | TO, ... };

That is correct enough for some runtime paths, but it is not the shape native LLVM’s modern offload pipeline uses for simple scalar kernel parameters. Native LLVM can encode eligible scalars as literal target parameters. In the LLVM host IR we inspected during the REX performance work, the scalar entries showed the literal-target-parameter map-type form. One representative native map-type array contained values like:

1
2
@.offload_maptypes = private unnamed_addr constant [5 x i64]
  [i64 800, i64 34, i64 33, i64 800, i64 800]

The exact flag value is less important than the structural distinction: some entries are target parameters whose values are carried literally, not by mapping a host object address.

REX needed to make the same distinction.

A diagram splitting OpenMP target arguments into mapped storage objects and literal scalar target parameters.

Figure 1. The ABI should not describe every target argument as mapped storage. Some scalar inputs are just values that the kernel needs directly.

Why Scalar Representation Affects Performance

It is easy to underestimate this layer because the kernel math does not change. A scalar value is still a scalar value once the device code runs.

But the launch path changes.

When a scalar is treated as a mapped object, the host-side runtime argument packet has to describe it like storage:

1
2
3
4
where is the host object?
what size is the object?
what map direction applies?
does the runtime need to create, look up, or synchronize a mapping?

For a scalar input that only needs to be copied by value into the kernel argument list, that is the wrong abstraction. It adds address-taking, larger mapping metadata, and a less precise contract between the generated host code and the runtime.

The effect is most visible on short kernels. Long-running kernels can hide host launch ceremony behind device work. Small kernels such as the nn target loop cannot. Once REX removed cold-start registration from the timed path and replaced the old scheduler loop with direct grid-stride lowering, scalar argument overhead became part of the remaining noise floor.

This is the recurring pattern in the REX GPU work:

1
2
3
4
first remove the huge wrong cost;
then remove the generic device scheduler tax;
then make launch shape fair;
then make the launch packet itself accurate.

Literal scalar target parameters belong in that fourth layer.

They are not just an optimization trick. They are an ABI accuracy fix.

The Compiler Rule Must Be Conservative

Not every scalar can be lowered as a literal target parameter.

The safe case is narrow:

1
2
3
4
5
6
the expression resolves to a direct variable symbol;
the variable has scalar type;
the type is not a pointer;
the type is not long double;
the map direction does not require copy-back;
the value fits in the pointer-sized transport slot used by the runtime path.

That rule is intentionally boring. It does not mention nn, gaussian, or any benchmark name. It only mentions source structure and ABI constraints.

The current REX analysis reflects that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
bool canUseLiteralTargetParam(const SgOmpClauseBodyStatement *target,
                              SgVariableSymbol *var_sym,
                              SgOmpClause::omp_map_operator_enum map_operator) {
  SgType *type = stripTypeAliasesAndReferences(var_sym->get_type());
  if (type == NULL || !SageInterface::isScalarType(type) ||
      isSgPointerType(type) != NULL || isSgTypeLongDouble(type) != NULL) {
    return false;
  }

  const SgOmpClause::omp_map_operator_enum normalized_op =
      normalizeMapperMapOperator(map_operator);
  const bool need_copy_from =
      normalized_op == SgOmpClause::e_omp_map_from ||
      normalized_op == SgOmpClause::e_omp_map_tofrom;
  if (need_copy_from) {
    return false;
  }

  return get_target_type_size_bytes(type, target) <=
         get_host_pointer_size_bytes(target);
}

The important part is the negative space.

Pointers are not literalized. Arrays are not literalized. long double is not literalized. Copy-back scalars are not literalized, whether the copy-back came from an explicit map clause or an implicit default mapping. Oversized scalar representations are not literalized.

Those exclusions keep the optimization honest. A scalar that needs memory semantics must still be represented as memory. A scalar that is only an input value can become a literal target parameter.

A decision diagram showing the checks REX applies before a scalar can become a literal target parameter.

Figure 2. Literal scalar lowering is useful because it is narrow. The compiler has to prove the value can safely fit the runtime’s literal-parameter contract.

What Changes On The Host Side

Once a variable is classified as a literal target parameter, the host argument expression changes.

The old address-oriented form was effectively:

1
&target_lat

The literal form is:

1
rex_pack_literal_arg_bytes(&target_lat, sizeof(target_lat))

The lowering builds that expression directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SgExpression *buildLiteralTargetParamArgExpression(SgVariableSymbol *var_sym,
                                                   SgScopeStatement *scope) {
  SgType *type = stripTypeAliasesAndReferences(var_sym->get_type());

  return buildFunctionCallExp(
      "rex_pack_literal_arg_bytes", buildPointerType(buildVoidType()),
      buildExprListExp(buildAddressOfOp(buildVarRefExp(var_sym)),
                       buildSizeOfOp(type)),
      scope);
}

The helper itself is small:

1
2
3
4
5
6
7
8
9
static inline void *rex_pack_literal_arg_bytes(const void *src, size_t size) {
  assert(size <= sizeof(uintptr_t));
  if (size > sizeof(uintptr_t)) {
    return NULL;
  }
  uintptr_t bits = 0;
  __builtin_memcpy(&bits, src, size);
  return (void *)bits;
}

The assert is not cosmetic. The compiler analysis is expected to ensure the size is safe. If that invariant is violated, it is a compiler bug, not a user-program condition. The runtime helper still returns NULL defensively after the assertion path so the function is not a raw unchecked stack write.

The generated argument arrays now carry packed values for eligible scalars:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
void *__args_base[] = {
  __rex_packed_literal_arg_0,
  __rex_packed_literal_arg_1,
  locations,
  distances
};

void *__args[] = {
  __rex_packed_literal_arg_0,
  __rex_packed_literal_arg_1,
  locations + 0,
  distances + 0
};

REX materializes those packed expressions into temporaries before building the arrays. That detail avoids duplicating side-effect-sensitive expressions and ensures ArgsBase and Args agree on the same packed token:

1
2
3
4
5
6
7
SgVariableDeclaration *packed_decl = buildVariableDeclaration(
    "__rex_packed_literal_arg_0", buildPointerType(buildVoidType()),
    buildAssignInitializer(copyExpression(packed_expr)), scope);

SgVariableSymbol *packed_sym = SageInterface::getFirstVarSym(packed_decl);
arg_exprs[i] = buildVarRefExp(packed_sym);
base_exprs[i] = buildVarRefExp(packed_sym);

That is a small generated-code hygiene improvement, but it matters. ABI repair should not introduce a new class of double-evaluation bugs.

The Map Type Changes Too

Packing the scalar bytes is only half of the contract.

The runtime also needs to know how to interpret the argument slot. For a literal scalar target parameter, REX emits the literal-target-parameter map-type flags:

1
2
3
4
5
6
7
8
if (item.use_literal_target_param) {
  int literal_flags =
      OMP_TGT_MAPTYPE_TARGET_PARAM | OMP_TGT_MAPTYPE_LITERAL;
  if (item.is_implicit_target_variable) {
    literal_flags |= OMP_TGT_MAPTYPE_IMPLICIT;
  }
  result.mapping_type_expression = buildIntVal(literal_flags);
}

That flag choice is the bridge to the modern OpenMP offload ABI. The pointer-shaped slot in Args is not a host address that should be mapped as storage. It is a value token.

This is where the old REX representation was most misleading. If the bytes are packed but the map type still says “ordinary mapped object”, the runtime and the generated code are not speaking the same language. If the map type says “literal” but the host passes an address, the runtime receives the wrong representation.

Both sides have to change together:

1
2
host argument expression: packed scalar bytes
map type: TARGET_PARAM | LITERAL

That is the ABI-level fix.

A flow diagram showing scalar source values being packed into pointer-sized tokens and paired with literal target parameter map types.

Figure 3. A literal target parameter is not just a different pointer expression. It is a paired representation: packed bytes plus a map type that tells the runtime the slot is literal.

The Device Side Has To Agree

A source-to-source compiler cannot stop at the host packet. The generated CUDA-side kernel signature must also agree with the representation.

If the host packs a scalar into a pointer-sized token, the device kernel cannot keep treating that parameter as the original scalar type in the old ABI layout. REX rewrites eligible literal parameters to use a transport-sized integer type, then reconstructs the original scalar as a local variable inside the kernel body.

Conceptually:

1
2
3
4
5
6
7
8
__global__ void kernel__(void *__rex_kernel_launch_env,
                         unsigned long long target_lat_bits,
                         float *distances) {
  float target_lat;
  __builtin_memcpy(&target_lat, &target_lat_bits, sizeof(float));

  /* normal lowered kernel body uses target_lat */
}

In schematic form, the lowering does this through the ROSE AST:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
SgType *transport_type =
    get_host_pointer_size_bytes(body) <= 4
        ? buildUnsignedIntType()
        : buildOpaqueType("uint64_t", body);

param->set_type(transport_type);

SgVariableDeclaration *local_decl =
    buildVariableDeclaration(local_name, original_type, NULL, body);

SgExprStatement *unpack_stmt = buildFunctionCallStmt(
    "__builtin_memcpy", buildPointerType(buildVoidType()),
    buildExprListExp(buildAddressOfOp(buildVarRefExp(local_symbol)),
                     buildAddressOfOp(buildVarRefExp(param_symbol)),
                     buildSizeOfOp(original_type)),
    body);

That unpack step is not optional. It is what keeps the kernel body written in source-level types while the ABI uses transport-level slots.

This also explains why scalar literal work overlaps with the later direct __tgt_target_kernel migration. LLVM’s modern kernel-launch path has a specific argument layout, and REX eventually had to match that layout exactly. Literal scalar parameters were one piece of the same alignment problem. The next post covers the larger direct-kernel migration and the hidden launch-environment parameter. This post isolates the scalar-value part because it is independently important and easier to reason about.

Why This Is Not A Wrapper-Layer Story

A tempting but wrong reading is:

REX added wrappers around LLVM runtime calls, therefore it got faster.

That is not the point.

The important change is not the wrapper. The important change is the data contract.

The runtime call can only be efficient and correct if the argument packet is accurate. Literal scalar target parameters require all of these to line up:

1
2
3
4
5
the compiler proves the scalar is eligible;
the host packs the scalar bytes into a pointer-sized token;
the map type marks the slot as a literal target parameter;
the device signature receives a transport slot;
the device body reconstructs the source scalar locally.

If any of those pieces is missing, the result is either wrong or slower than it needs to be.

The wrapper declarations in rex_kmp.h still matter because REX-generated code has to compile against LLVM’s offload runtime. The header defines the runtime data structures and entry points that the generated code needs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
struct __tgt_kernel_arguments {
  int32_t Version;
  int32_t NumArgs;
  void **ArgsBase;
  void **Args;
  int64_t *ArgSizes;
  int64_t *ArgTypes;
  void **ArgNames;
  void **ArgMappers;
  int64_t Tripcount;
  int64_t Flags;
  int32_t Teams[3];
  int32_t Threads[3];
  int32_t DynCGroupMem;
};

But the header is support infrastructure. The performance-relevant design is that scalar arguments are no longer forced through object-mapping semantics when the ABI can carry them as values.

How This Helped The Performance Work

This change did not produce the cleanest standalone benchmark A/B result in the whole campaign. It landed amid other work: offload initialization placement, direct grid-stride lowering, launch-geometry policy, and the later direct-kernel migration.

That does not make it optional.

Without literal scalar target parameters, every later performance experiment would still be comparing a REX launch packet that describes scalar values less accurately than native LLVM. That would muddy the root cause analysis. A benchmark might still run. It might even run fast enough on a large kernel. But short-kernel comparisons would keep carrying avoidable host argument ceremony.

The generated-code evidence was the important checkpoint. For nn-like inputs, the host argument arrays moved from address-based scalar entries to packed scalar tokens and literal map types. That proved the compiler was generating the intended ABI shape, not relying on hand-edited benchmark files.

Correctness also became easier to reason about. The eligibility rule preserves copy-back cases and mapped storage cases. The device-side unpack gives the kernel body ordinary scalar locals again. The runtime sees a literal target parameter instead of an address pretending to be a value.

That is exactly the kind of compiler optimization REX should prefer:

1
2
3
4
5
generic;
source-structure based;
ABI-aware;
safe by construction;
and visible in generated code.

What To Test

A useful test for this layer should not only run a benchmark. It should inspect the generated host and device shapes.

The host-side checks should verify:

1
2
3
4
5
eligible scalar inputs use rex_pack_literal_arg_bytes;
ArgsBase and Args reference materialized packed temporaries;
literal entries use TARGET_PARAM | LITERAL map flags;
array and pointer entries still use address-based mappings;
copy-back scalars are not incorrectly literalized.

The device-side checks should verify:

1
2
3
4
literal kernel parameters use a transport-sized integer slot;
the original scalar type is reconstructed with __builtin_memcpy;
kernel body references use the reconstructed local scalar;
nonliteral mapped objects keep their pointer or address-based representation.

The benchmark checks then answer a different question: does the resulting program still produce the same output as native LLVM, and does it avoid the old scalar launch overhead in small kernels?

Those are separate checks. A program can be correct while still using a needlessly heavy launch packet. A program can also look fast while silently dropping copy-back semantics. The compiler needs both structural checks and benchmark checks.

The Design Rule To Keep

The durable rule from this phase is:

1
do not model values as mapped storage when the ABI has a value representation

That rule is simple, but it changes where the compiler has to be careful. The hard part is not emitting rex_pack_literal_arg_bytes. The hard part is deciding when it is legal, carrying that decision through host arrays and map types, and repairing the device signature so the kernel receives what the runtime actually passes.

This is why literal scalar target parameters are a real performance topic, not just cleanup. Performance work often starts with big visible costs, but it eventually reaches representation mismatches. Once the scheduler path and launch geometry are under control, the ABI details become part of the hot path.

For REX, the literal-scalar work was the bridge between “we can generate a direct-looking kernel” and “we can describe that kernel to LLVM’s offload runtime using the same kind of argument contract native LLVM uses.”

The next post moves from scalar values to the whole launch call: completing the direct __tgt_target_kernel lowering and repairing the device ABI so the generated kernel entry point matches the runtime’s argument layout end to end.