How REX Lowers Target Loops Into Direct GPU Kernels

Mon, 06 Apr 2026 00:00:00 +0000

The previous posts in this series covered two neighboring stages of GPU lowering:

how REX outlines a target region into a real device kernel,
and how the host side builds launch packets and runtime map arrays for that kernel.

There is still one very important step in between those two stories:

what happens to the actual for loop inside the outlined kernel body?

That step matters more than it first sounds.

How REX Validates GPU Offloading With Real Benchmarks

Sun, 05 Apr 2026 00:00:00 +0000

The previous post in this series covered the semantic checkpoint before GPU execution: lowering_cpu runs the original OpenMP program and the REX-lowered program on the same CPU runtime and asks whether the transformation still preserves meaning.

That is a strong test layer.

It is still not the last one.

At some point the compiler has to survive the real thing:

a real application,
real offloading,
real runtime glue,
real device launches,
real numerical output,
and real performance scrutiny.

That is the benchmark-validation layer.

Performance on ./Code

How REX Lowers Target Loops Into Direct GPU Kernels

How REX Validates GPU Offloading With Real Benchmarks