<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Rodinia on ./Code</title><link>https://blog.ouankou.com/tags/rodinia/</link><description>Recent content in Rodinia on ./Code</description><generator>Hugo</generator><language>en-US</language><copyright>© Anjia Wang</copyright><lastBuildDate>Mon, 04 May 2026 12:42:33 -0700</lastBuildDate><atom:link href="https://blog.ouankou.com/tags/rodinia/index.xml" rel="self" type="application/rss+xml"/><item><title>How REX Separated GPU-Total From Wall-Clock Noise In pathfinder And srad</title><link>https://blog.ouankou.com/2026/04/28/how-rex-separated-gpu-total-from-wall-clock-noise-in-pathfinder-and-srad/</link><pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.ouankou.com/2026/04/28/how-rex-separated-gpu-total-from-wall-clock-noise-in-pathfinder-and-srad/</guid><description>&lt;p&gt;The previous post closed the last obvious fair &lt;code&gt;b+tree&lt;/code&gt; kernel-body gap. REX was no longer relying on an unfair launch-shape rewrite, and it no longer needed a global cache flag. It recovered read-only provenance in the generated device kernel and emitted selective &lt;code&gt;__ldg(...)&lt;/code&gt; loads where the proof was strong enough.&lt;/p&gt;
&lt;p&gt;That left a strange-looking benchmark table.&lt;/p&gt;
&lt;p&gt;Some rows were clearly resolved. &lt;code&gt;b+tree&lt;/code&gt; had moved from a fair loss into a clear REX win. Several other benchmarks already had stable REX advantages. But three rows still looked suspicious if we looked only at the broad comparison table:&lt;/p&gt;</description></item><item><title>How REX Kept b+tree Launch Geometry Fair</title><link>https://blog.ouankou.com/2026/04/26/how-rex-kept-btree-launch-geometry-fair/</link><pubDate>Sun, 26 Apr 2026 00:00:00 +0000</pubDate><guid>https://blog.ouankou.com/2026/04/26/how-rex-kept-btree-launch-geometry-fair/</guid><description>&lt;p&gt;The previous post established the general fairness principles for GPU launch geometry. This post applies those rules to a specific case: the &lt;code&gt;b+tree&lt;/code&gt; benchmark, which exposed a performance gap that remained even after the direct &lt;code&gt;__tgt_target_kernel&lt;/code&gt; and ABI migration work.&lt;/p&gt;
&lt;p&gt;At that point, the easy failure modes were mostly gone. The &lt;code&gt;b+tree&lt;/code&gt; benchmark built. It registered its cubin. It launched its kernels. Its output matched.&lt;/p&gt;
&lt;p&gt;That left the uncomfortable kind of performance gap: a small one.&lt;/p&gt;</description></item></channel></rss>