<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Libomptarget on ./Code</title><link>https://blog.ouankou.com/tags/libomptarget/</link><description>Recent content in Libomptarget on ./Code</description><generator>Hugo</generator><language>en-US</language><copyright>© Anjia Wang</copyright><lastBuildDate>Mon, 04 May 2026 14:24:09 -0700</lastBuildDate><atom:link href="https://blog.ouankou.com/tags/libomptarget/index.xml" rel="self" type="application/rss+xml"/><item><title>How REX Removed Process-Exit Offload Teardown From Generated GPU Programs</title><link>https://blog.ouankou.com/2026/04/29/how-rex-removed-process-exit-offload-teardown-from-generated-gpu-programs/</link><pubDate>Wed, 29 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.ouankou.com/2026/04/29/how-rex-removed-process-exit-offload-teardown-from-generated-gpu-programs/</guid><description>&lt;p&gt;The previous post separated GPU-total performance from wall-clock noise. That was necessary because &lt;code&gt;pathfinder&lt;/code&gt;, &lt;code&gt;srad_v1&lt;/code&gt;, and &lt;code&gt;srad_v2&lt;/code&gt; looked suspicious in broad timing tables, but profiler totals did not show a remaining native LLVM advantage in the actual GPU work.&lt;/p&gt;
&lt;p&gt;That did not make the wall-clock discrepancy imaginary.&lt;/p&gt;
&lt;p&gt;It meant the discrepancy lived somewhere else.&lt;/p&gt;
&lt;p&gt;If kernel time and copy time are tied or better for REX, but whole-process timing still sometimes looks worse, the remaining cost is probably not inside the generated device kernel. It is in process lifetime: setup, registration, host work, teardown, or the interaction between those pieces.&lt;/p&gt;</description></item></channel></rss>