How REX Lowers `target data` Regions and Multi-Kernel Lifetimes

Posted on (Updated on )
REX lowers OpenMP target data regions into explicit begin/end lifetime brackets, reusing the same mapping vocabulary as kernel launches. This post explains how static and dynamic lowering paths preserve device data lifetimes across multiple kernels and repeated launches, as demonstrated by reduced Rodinia benchmarks.

The previous post in this series focused on the runtime packet for kernel launches: how REX builds __args_base, __args, __arg_sizes, __arg_types, and finally __tgt_kernel_arguments before calling __tgt_target_kernel(...).

That still leaves the other half of GPU offloading:

what happens when the program wants data to live on the device across more than one kernel launch?

That is the job of target data.

In OpenMP offloading, kernel launch and data lifetime are related but not identical problems. A kernel launch says “run this work on the device.” A target data region says “make these mappings live on the device across the body of this region.”

REX lowers that distinction explicitly.

This post zooms into transOmpTargetData() in src/midend/programTransformation/ompLowering/omp_lowering.cpp and the multi-kernel patterns it enables. The focus is intentionally narrow:

  • how target data reuses the same mapping vocabulary as kernel launches,
  • how the static and dynamic lowering paths differ,
  • how __tgt_target_data_begin and __tgt_target_data_end are inserted,
  • why this lowering is about lifetime, not just copying arrays once,
  • and how reduced Rodinia cases exercise multi-kernel and repeated-call behavior on top of those lifetimes.
A target data region lowered into an explicit begin/body/end lifetime bracket, with multiple target kernels inside the bracket reusing mapped device data.

Figure 1. target data lowering is a lifetime transformation. REX turns the directive into an explicit begin/body/end region so multiple kernels can reuse one mapping window.

Why target data Is A Different Lowering Problem From Kernel Launch

At first glance, target data looks simpler than a kernel launch because there is no __tgt_kernel_arguments object and no launch geometry. That is true, but it can be misleading.

The real complexity here is not execution configuration. It is lifetime.

Kernel launch lowering answers questions like:

  • which kernel is this host block launching?
  • how many teams and threads should it use?
  • what exact packet layout does the runtime expect?

target data lowering answers different questions:

  • what mappings should become live on the device?
  • when does that lifetime start?
  • when does it end?
  • what host statements remain inside that lifetime window?
  • how do repeated kernels inside that window keep seeing the same mapped state?

That is why the runtime API is different too. Kernel launches go through __tgt_target_kernel(...). Data-lifetime regions go through:

  • __tgt_target_data_begin(...)
  • body statements
  • __tgt_target_data_end(...)

REX does not blur those two contracts. It lowers them as separate source-level shapes even though they share the same map-clause analysis.

Step 1: Reuse The Same Mapping Vocabulary As Kernel Launches

One of the cleanest parts of the current GPU lowerer is that target data does not invent a new representation for mappings.

transOmpTargetData() starts exactly where the launch path starts:

1
2
3
4
5
6
7
8
9
SgExprListExp *map_variable_list = buildExprListExp();
SgExprListExp *map_variable_base_list = buildExprListExp();
SgExprListExp *map_variable_size_list = buildExprListExp();
SgExprListExp *map_variable_type_list = buildExprListExp();
std::vector<ExpandedMapEntry> dynamic_map_entries;

transOmpMapVariables(target, map_variable_list, map_variable_base_list,
                     map_variable_size_list, map_variable_type_list,
                     &offload_ctx, &dynamic_map_entries);

That is the same column-oriented mapping vocabulary used in the kernel-launch path:

  • base addresses,
  • mapped addresses,
  • sizes,
  • and map-type bits.

This is a very good design choice because kernel launch and target data are not semantically identical, but they are talking about the same map clauses. If the compiler had one mapping vocabulary for target data and another for kernel launch, it would eventually have to keep them in sync or debug the same mapping bug twice.

Why this reuse matters

It gives REX a clean layering:

  • transOmpMapVariables(...) answers “what mappings exist and how should they be represented?”;
  • the surrounding lowering path answers “what runtime protocol should consume those mappings?”

For kernel launches, that protocol becomes:

  • runtime arrays,
  • __tgt_kernel_arguments,
  • __tgt_target_kernel.

For target data, that protocol becomes:

  • runtime arrays,
  • __tgt_target_data_begin,
  • region body,
  • __tgt_target_data_end.

The input vocabulary stays the same. Only the consumer changes.

It keeps dynamic-mapping support aligned too

The same dynamic_map_entries mechanism used by kernel launch lowering is also used here. That means mapper-expanded sections and runtime-sized array regions are not a special case for one API only. The lowerer treats them as a property of mapping itself.

This is the real architectural win:

target data and kernel launch lowering do not share behavior because they happen to look similar. They share behavior because they are both downstream consumers of the same mapping analysis.

A comparison diagram showing transOmpMapVariables feeding both target-data lowering and target-kernel lowering, with the former going to target_data_begin/end and the latter going to __tgt_kernel_arguments and __tgt_target_kernel.

Figure 2. REX keeps one mapping vocabulary and lets different lowering paths consume it. That is cleaner than teaching each offloading construct its own mapping representation.

Step 2: Lower Static target data Regions In Place

If there are no dynamic map entries, transOmpTargetData() takes the simpler path.

First it reuses the directive body block directly:

1
2
3
4
5
6
7
8
SgBasicBlock *body = isSgBasicBlock(target->get_body());
if (!body) {
  body = buildBasicBlock();
  body->append_statement(target->get_body());
  target->set_body(body);
}
...
SgBasicBlock *target_data_begin_block = body;

Then it prepends the local runtime state that the data-region API expects:

1
2
3
4
SgVariableDeclaration *device_id_decl = buildVariableDeclaration(
    "__device_id", buildOpaqueType("int64_t", p_scope),
    buildAssignInitializer(buildLongLongIntVal(-1)), p_scope);
target_data_begin_block->prepend_statement(device_id_decl);

and then, in the same style as the static kernel-launch packet path, it emits:

  • __args_base
  • __args
  • __arg_sizes
  • __arg_types
  • __arg_num

all as ordinary braced local arrays or scalars.

This is one of the easiest points to overlook if you think only in terms of runtime calls. The lowerer is not replacing target data with one function call. It is replacing it with a whole block that contains the mapping state required by that call.

Where begin and end are inserted

After those declarations exist, the begin call is inserted after the last mapping array declaration (such as __arg_num) to ensure the arrays are declared before the call that uses them:

1
2
3
SgExprStatement *func_offloading_stmt = buildFunctionCallStmt(
    "__tgt_target_data_begin", buildVoidType(), parameters, p_scope);
insertStatementAfter(arg_num_decl, func_offloading_stmt);

Then the end call is appended at the end of the original body block:

1
2
3
func_offloading_stmt = buildFunctionCallStmt(
    "__tgt_target_data_end", buildVoidType(), isSgExprListExp(deepCopy(parameters)), p_scope);
body->append_statement(func_offloading_stmt);

Finally, the original SgOmpTargetDataStatement node is replaced with that rewritten body:

1
2
replaceStatement(target, body, true);
attachComment(body, "Translated from #pragma omp target data ...");

That means the lowered host code has an explicit lifetime bracket:

  1. create mapping-state locals,
  2. call __tgt_target_data_begin,
  3. run the original body statements,
  4. call __tgt_target_data_end.

This is a very direct translation from directive semantics into ordinary source structure.

Why “in place” is the right mental model here

In the static case, REX does not need a synthetic wrapper block separate from the original body. The existing basic block already is the right lifetime body, so the compiler can prepend the begin-side state and append the end-side call around the user’s original statements.

That keeps the lowered source readable. When someone inspects the output, they can still recognize the original body statements sitting between explicit begin and end markers.

Step 3: Lower Dynamic target data Regions With A Fresh Wrapper Block

If dynamic_map_entries is not empty, the lowerer takes a different path:

1
2
3
4
if (!dynamic_map_entries.empty()) {
  SgBasicBlock *translated_body = buildBasicBlock();
  ...
}

This is not just a cosmetic variation. It reflects a real difference in how the mapping arrays must be constructed.

Why the body is wrapped instead of edited in place

In the dynamic case, the lowerer needs to:

  • count final runtime argument slots,
  • allocate heap-backed arrays,
  • populate them procedurally,
  • run the original body,
  • call __tgt_target_data_end,
  • and free the heap storage.

That shape is easier and safer to express as:

  1. a new wrapper block,
  2. dynamic-array setup,
  3. begin call,
  4. original body inserted as a child block,
  5. end call,
  6. cleanup.

That is exactly what the code does:

1
2
3
4
5
6
7
8
9
RuntimeMapArgumentArrayDeclarations dynamic_arrays =
    buildDynamicRuntimeMapArgumentArrays(
        translated_body,
        p_scope,
        map_variable_list,
        map_variable_base_list,
        map_variable_size_list,
        map_variable_type_list,
        dynamic_map_entries);

Then it emits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SgExprStatement *begin_stmt = buildFunctionCallStmt(
    "__tgt_target_data_begin", buildVoidType(), parameters, p_scope);
translated_body->append_statement(begin_stmt);

body->set_parent(NULL);
target->set_body(NULL);
translated_body->append_statement(body);

SgExprStatement *end_stmt = buildFunctionCallStmt(
    "__tgt_target_data_end", buildVoidType(), isSgExprListExp(deepCopy(parameters)), p_scope);
translated_body->append_statement(end_stmt);

appendDynamicRuntimeMapArgumentArrayCleanup(dynamic_arrays, translated_body, p_scope);

Only after all of that does it replace the original target-data statement with translated_body.

The lifetime is still the same, even though the construction path is not

This is the same design principle we saw in the __tgt_kernel_arguments post:

  • static and dynamic cases use different construction strategies,
  • but they converge on the same runtime contract.

Here the runtime contract is not a kernel packet. It is the begin/body/end lifetime bracket plus the standard six mapping arguments:

  • device id,
  • arg count,
  • args base,
  • args,
  • sizes,
  • types.

The dynamic case changes how those arrays are built and cleaned up. It does not change the meaning of the region lifetime.

Step 4: target data Is What Makes Multi-Kernel Lifetimes Real

This is the most important conceptual point in the whole post.

Without target data, every offloaded kernel launch is free to behave like a mostly standalone transaction: establish mappings, launch, and finish. That is workable for some kernels, but it is not how many real applications are structured.

Real applications often want:

  • one device-resident working set,
  • several kernels that operate over it,
  • and only one begin/end lifetime for the region as a whole.

That is exactly what target data lowering makes explicit in host code.

rodinia_hotspot_like: two kernels inside one data lifetime

The reduced rodinia_hotspot_like.c input is a perfect example. It has:

  • one #pragma omp target data region,
  • then two target teams distribute parallel for collapse(2) kernels inside that region.

The first kernel computes result[idx] = temp[idx] + power[idx]. The second kernel updates temp[idx] = result[idx] * 0.5f.

The important point is not that there are two kernels. The important point is that they are deliberately bracketed by one data lifetime:

  • power, temp, and result become mapped for the region,
  • both kernels execute while those mappings remain live,
  • and the lifetime ends only after the second kernel finishes.

That is exactly the kind of source structure a lowerer must preserve. If the compiler accidentally lowered each nested kernel as if it owned an unrelated mapping lifetime, the source meaning would drift.

rodinia_pathfinder_like: one data lifetime around many looped kernels

The reduced rodinia_pathfinder_like.c input is even more interesting because it contains a host loop inside the target-data region:

1
2
3
4
5
6
7
#pragma omp target data ...
{
  for (t = 0; t < RODINIA_PATH_ROWS - 1; t++) {
    #pragma omp target teams distribute parallel for ...
    ...
  }
}

This is a classic lifetime test. The mapped arrays src, dst, and wall should remain valid across a sequence of kernel launches driven by a host loop.

That pattern exercises a stronger property than “can the compiler lower one data region and one kernel?” It asks:

can the compiler preserve a device mapping lifetime across repeated launches of the same lowered helper shape?

That is much closer to real HPC code than an isolated one-off launch.

Repeated calls outside a target-data region still matter too

The reduced rodinia_axpy_multi_like.c case does not use target data, but it is still valuable here because it tests the other half of lifetime correctness: repeated host calls to the same lowered offload helper.

It contains three kernels:

  • scale_like
  • axpy_like
  • bias_like

and then calls axpy_like(...) twice from main.

That tells us whether the lowered host/device artifacts are stable across repeated invocation, just as the target-data cases tell us whether mappings remain live across repeated kernels inside one lifetime region.

These are not the same thing, but together they form the broader “multi-kernel lifetime” story:

  • mapping lifetime across several kernels,
  • and kernel/helper identity stability across repeated calls.
Three patterns side by side: one target data region containing two kernels, one target data region containing a host loop that repeatedly launches a kernel, and repeated host calls to the same lowered kernel helper outside a target data region.

Figure 3. Multi-kernel correctness is not just about kernel count. It is about preserving the right lifetime relationships between mapped data, repeated launches, and generated helper identities.

Step 5: Why This Lowering Style Makes Source-To-Source Debugging Practical

A good source-to-source lowerer should not hide lifetime structure inside opaque helper internals if it can express that structure directly in the generated source.

REX’s target data lowering gets that right.

When you inspect the lowered host output, you can literally see:

  • the mapping declarations,
  • the begin call,
  • the user body,
  • the end call,
  • and, in the dynamic case, the cleanup.

That makes several debugging questions much easier to answer.

Did the compiler preserve the body inside the lifetime window?

Because the original body remains visibly between begin and end, you can answer this by inspection.

Did the compiler reuse the same mapping vocabulary as kernel launches?

Again, yes, because you can see __args_base, __args, __arg_sizes, __arg_types, and __arg_num in the lowered region.

Did the compiler choose the right construction path for dynamic mappings?

The source shape makes that visible too. Static regions show braced local arrays. Dynamic regions show heap-backed arrays plus cleanup.

This is one of the recurring strengths of REX’s design. Even at ABI-sensitive boundaries, the compiler emits ordinary C/C++ structures rather than disappearing into an opaque binary-only phase.

What This Buys REX

Looking at target data lowering directly makes a few design strengths obvious.

Data-lifetime lowering is explicit, not implied

REX does not assume the runtime will infer a region lifetime from nested constructs. It emits explicit begin/end calls in the host code.

Kernel launch and data lifetime stay separate but compatible

The same mapping vocabulary feeds both paths, but only kernel launches build a __tgt_kernel_arguments packet. That is a clean separation of concerns.

Dynamic-mapping complexity stays local

Dynamic regions require a different construction strategy, but that complexity stays in the array-building path rather than infecting the rest of the host lifetime logic.

Multi-kernel applications get the right lifetime model

A region with several kernels, or a host loop that repeatedly launches kernels inside one data lifetime, can be lowered without collapsing everything into isolated single-launch transactions.

That is exactly the kind of behavior the reduced Rodinia cases exercise, and it is why those cases matter more than toy single-kernel examples alone.

The companion post on the test layer covers how REX checks lowering invariants for multi-kernel GPU code without relying on brittle full-file golden outputs, and how those invariants catch the kind of regressions that a source-to-source lowerer is especially prone to introduce.