How REX Cut Build Time With Private Precompiled Headers
lld held constant dropped from 1:29.17 to 0:59.60, a 33.2% reduction in build time, while full CTest, sanitizer, and memcheck gates stayed green. A local 2x2 check showed the rough split: PCH saved about 29 seconds, while switching from bfd to lld saved about 13-14 seconds. The QEMU image-build workflow saw an even larger operational improvement after the PCH change and the required linker/build-script follow-up: riscv64 dropped from 5:13:21 on 2026-05-31 to 1:53:46 on 2026-06-02, and loong64 dropped from 4:05:10 to 1:29:55. For installed REX users, the change is intentionally invisible: same headers, same library, same include model, same ABI.Compiler projects spend a lot of time compiling themselves.
That sounds obvious, but it becomes painfully concrete in REX. The project is a source-to-source compiler built around Sage IR, generated ROSETTA headers, a Clang frontend, a Flang frontend, OpenMP parsing, unparsing, midend analysis, and a large test suite. Many translation units include the same heavy internal headers. Every clean build asks the compiler to parse those headers again and again.
After the full CTest suite was stabilized, that cost became more visible. Linker selection had already been improved by preferring faster linkers such as lld, but linking was no longer the only major cost. The compile phase was still spending a lot of time re-reading the same large header surface.
The optimization was to use precompiled headers, but only in a way that preserved the public contract:
| |
That boundary is why the change worked as a clean optimization instead of becoming another compatibility burden.
Figure 1. The compile-time problem was repeated parsing. Many translation units paid the same header cost independently.
The Problem
REX has some naturally expensive headers.
The largest source is Sage itself. REX’s internal implementation files commonly start with the header sage3basic.h, which brings in generated declarations, configuration state, core IR classes, utility headers, and enough shared infrastructure for implementation files to talk about Sg* nodes. That header is central to the compiler. It is also expensive to parse.
The Clang frontend adds another heavy surface. It needs LLVM and Clang declarations, REX’s Clang-to-Sage conversion helpers, source-location utilities, declaration/type/statement lowering support, and frontend-private state. Those headers are not optional for the frontend. They are the working set.
The Flang frontend has the same shape. Parsing Fortran through Flang means seeing Flang parser headers, parse-tree visitors, provenance, and the REX builder layer that turns Flang structures into Sage.
None of this is surprising. Compiler code has big internal APIs. The cost becomes a problem because clean builds repeat it per translation unit:
| |
Modern compilers are fast, but they are not magic. If a project asks them to parse the same heavy header graph hundreds of times, clean build time will show it.
Measurement Before The Fix
The first step was measurement, not guessing.
REX added scripts/measure-build-time.sh for this reason. It configures a Debug Ninja build, builds with a selected job count, and summarizes .ninja_log so the build can be grouped by subsystem and top individual actions. That matters because wall-clock time alone tells you that the build is slow; the Ninja log tells you where the compile work is concentrated.
The script also makes the experiment repeatable:
- fresh build directory
- Debug Ninja build
lldlinker- same source tree
- same measurement path
This is important because build-time optimization can be misleading. A warm build, a partially reused build tree, a different linker, or a different job count can make two measurements look comparable when they are not.
The controlled local result before the PCH change was:
| |
Both the baseline and the PCH measurement used lld. That matters because switching from the default GNU bfd linker to lld can also improve REX build time. The local measurement was designed to keep that linker effect out of the PCH comparison.
The useful observation was that configure time was not the main issue. The compile phase was the target.
What PCH Is
A precompiled header is a compiler cache for parsed header state.
Normally, when a C++ source file includes a large header, the compiler lexes, preprocesses, parses, and semantically analyzes that included header as part of compiling that source file. If one hundred source files include the same large header, much of that work is repeated one hundred times.
A PCH changes the flow:
| |
It does not make the header disappear. It does not change the C++ language. It does not change object-file semantics by itself. It is a build-time cache, and like every cache it only helps when the selected header is stable, broadly shared, and expensive enough to justify the cache creation cost.
That makes PCH a good fit for REX’s internal implementation headers:
sage3basic.his widely shared by in-tree implementation files.- Clang frontend files share an expensive Clang/LLVM-facing header surface.
- Flang frontend files share an expensive Flang-facing header surface.
- REX tools and module targets that link ROSE repeatedly include public ROSE headers.
It also explains what PCH is not good for. It should not be used to hide broken include dependencies. It should not be exported casually to downstream users. It should not force every consumer of an installed compiler library to adopt the producer’s build-cache strategy.
The REX Solution
The REX change added CMake-owned PCH support with a default-on option:
| |
The option can be disabled with:
| |
The important implementation choice is that PCH is applied privately.
REX calls the CMake command target_precompile_headers, and the helper applies headers with the PRIVATE scope keyword. That means the PCH affects how REX’s own targets are compiled. It does not become part of the installed target Rose::rose usage requirements.
The main internal layers are:
- A Sage PCH based on the generated build-tree
sage3basic.h. - A Clang frontend PCH based on a small wrapper that includes
sage3basic.hplus frontend-private Clang headers. - A Flang frontend PCH based on a small wrapper that includes
sage3basic.hplus Flang parser headers. - An internal
rosePublicPchobject target that precompilesrose.hfor in-tree tools and modules that link ROSE.
The rosePublicPch name can sound public, but the build-system role is internal. It creates a reusable PCH seed for REX’s own executable and module targets. It is not an installed interface target for users.
Figure 2. PCH stays on the producer side. Installed users still see the normal REX headers and library interface.
This boundary matters because REX has two audiences:
- developers building REX itself,
- users building tools against an installed REX.
The PCH optimization is for the first audience. It should help developers and CI build REX faster. It should not ask the second audience to care.
Why It Is Invisible To Installed Users
For installed REX users, the change is intentionally invisible.
The installed files and usage model remain the same:
- include
rose.h, - link against
librose/Rose::rose, - use the same public headers,
- use the same compiler and linker model as before.
There is no installed PCH file that users must include. There is no new downstream CMake requirement saying “your target must reuse REX’s PCH.” There is no ABI effect, because PCH is not a runtime artifact. It is a compiler-side cache used while building REX.
That is the key distinction:
| |
This is also why the CMake scope matters. If PCH were exported as an interface requirement, downstream projects could inherit compiler-specific assumptions from the REX build. That would be fragile. The REX change avoids that by keeping PCH private to in-tree targets.
There is one visible CMake option for source builders: ROSE_ENABLE_PCH. If a compiler package, CMake version, or local environment has a PCH-specific issue, the optimization can be turned off without changing the rest of the build.
Include Discipline Still Matters
PCH works best when implementation files include the PCH header first, before meaningful C++ tokens are processed.
REX already had a sage3basic.h convention: implementation files that become part of librose should include that header early. That convention was not invented for this optimization, but PCH makes it more valuable.
The optimization does not mean every header should include sage3basic.h. In fact, that would be the wrong direction. Public and internal headers should stay disciplined about what they include. Implementation files can pay for broad context; headers should avoid forcing broad context on every user.
That distinction preserves both build speed and public usability:
| |
PCH rewards the existing implementation-file convention without weakening the installed header contract.
Evaluation
The controlled clean-build measurement after the PCH change was:
| |
Compared with the baseline:
| |
Configure time stayed effectively flat:
| |
That is the shape we wanted. The optimization targeted compile work, and the measurement showed compile-time reduction rather than a shifted configure cost.
The validation gate mattered as much as the timing:
| |
Build-time improvements are not useful if they make the compiler less trustworthy. The result had to preserve the normal regression guarantees.
Figure 3. The optimization was accepted only after timing improved and the normal regression gates stayed green.
Separating PCH From Linker Choice
There were two build-speed improvements close together:
- use PCH to avoid repeated C++ header parsing,
- use a faster linker, especially
lld, instead of the defaultbfdpath.
They should not be treated as equal contributors.
On the local native build, PCH was the larger effect. A 2x2 clean Debug Ninja measurement on the same workspace gave:
| PCH | Linker | Clean build time (M:SS) |
|---|---|---|
| off | bfd | 1:42.40 |
| off | lld | 1:29.17 |
| on | bfd | 1:13.44 |
| on | lld | 0:59.60 |
The main deltas were:
- PCH effect with
lldheld constant: 1:29.17 -> 0:59.60, saving 29.57 seconds, about 33%. - PCH effect with
bfdheld constant: 1:42.40 -> 1:13.44, saving 28.96 seconds, about 28%. lldeffect with PCH off: 1:42.40 -> 1:29.17, saving 13.23 seconds, about 13%.lldeffect with PCH on: 1:13.44 -> 0:59.60, saving 13.84 seconds, about 19%.
The exact percentages depend on which denominator is used, and the effects are not perfectly additive because a highly parallel build changes scheduling as soon as link steps move. But the order of magnitude is clear:
| |
The reason is structural. PCH affects many C++ compilation actions. lld affects the link actions. REX has many more expensive compile actions than final link actions, so PCH has the broader surface.
That does not make lld unimportant. Linker choice still matters, and under QEMU the cost of bfd can be much worse than on native x86_64. That cost can also be architecture-sensitive. The observed image-build behavior is consistent with loong64 benefiting especially strongly from avoiding the slow bfd path, plausibly because bfd work is more expensive there under emulation than it is for riscv64. That is a linker hypothesis, not a PCH result, and it would need a per-architecture image-build matrix with PCH held constant and only the linker changed to quantify precisely.
The practical split is still clear: PCH is the dominant compile-time optimization; lld is a smaller but meaningful link-time optimization whose value can be amplified on slow emulated targets.
Nightly QEMU Image Builds
The local measurement isolates the PCH change best. The nightly image workflow answers a different question:
| |
That answer was yes.
The nightly image workflow builds riscv64 and loong64 Docker images under QEMU. Those are much slower than native local builds, so repeated C++ header parsing is amplified. The last successful scheduled image run before the PCH merge was on 2026-05-31 at commit 8df03f1947. The first fully successful scheduled image run after the PCH merge and the follow-up linker/build-script fix was on 2026-06-02 at commit 5103c6a496.
The job timings were:
| Workflow run | Date | Commit | Job | Duration (H:MM:SS) |
|---|---|---|---|---|
26708483908 | 2026-05-31 | 8df03f1947 | build-riscv64 | 5:13:21 |
26708483908 | 2026-05-31 | 8df03f1947 | build-loong64 | 4:05:10 |
26811176748 | 2026-06-02 | 5103c6a496 | build-riscv64 | 1:53:46 |
26811176748 | 2026-06-02 | 5103c6a496 | build-loong64 | 1:29:55 |
26876672195 | 2026-06-03 | 5103c6a496 | build-riscv64 | 1:59:21 |
26876672195 | 2026-06-03 | 5103c6a496 | build-loong64 | 1:23:38 |
That is a large operational shift. Comparing 2026-05-31 with 2026-06-02:
| |
The percentages are roughly:
| |
The 2026-06-03 scheduled run repeated the same shape, staying under two hours for riscv64 and under ninety minutes for loong64.
There are two important caveats.
First, the QEMU workflow records do not give a perfect PCH-only and linker-only decomposition. They measure whole image jobs. Those jobs include configure, compile, link, install, image export, and push. They also run under emulation, where architecture-specific tool behavior can dominate in ways that a native x86_64 build does not show. In particular, if bfd is substantially slower on emulated loong64 than on emulated riscv64, the loong64 job can show a larger linker-choice benefit even though the PCH improvement remains the broader compile-time change.
Second, the first scheduled run after the PCH merge, on 2026-06-01 at commit f9d73a90cd, failed quickly in the riscv64 job because the image workflow still had a linker-selection problem. The loong64 job in that same run succeeded and used mold, finishing in 1:24:09, already showing the post-PCH image-build shape. The follow-up commit made the Docker path use build-rex.sh, selected versioned ld.lld-22 consistently, and avoided the failing riscv64 mold path. So the successful QEMU comparison measures the practical result after both pieces were present:
- PCH reduces the compile cost.
lldreduces linker cost compared withbfdwhere it is on the path.- The linker/build-script follow-up makes the image workflow take the intended build path.
The controlled local measurement remains the cleanest PCH-only number. The nightly image data shows why the combined build-system optimization mattered even more in practice: the slowest CI path was dominated enough by compile work that private PCH, helped by the consistent faster-linker path, turned multi-hour QEMU builds into roughly two-hour or shorter builds.
What Was Not Optimized Away
The PCH change did not try to reduce compiler work by weakening the compiler.
It did not remove generated headers. It did not split the public API differently. It did not hide include errors with accidental order dependencies. It did not change test coverage. It did not mask sanitizer or memcheck findings. It did not ask the unparser or frontend to behave differently.
The improvement came from avoiding repeated parsing of stable, expensive headers.
That matters because build optimizations can easily become semantic changes in disguise. For a compiler infrastructure project, that would be a bad trade. Faster builds are valuable only when the project being built is still the same project.
The accepted boundary was therefore narrow:
| |
Conclusion
PCH was a good fit for REX because the project had a specific build-time problem: many in-tree targets repeatedly parsed the same expensive compiler headers.
The solution was not to expose a new user-facing build model. It was to keep PCH private, target the expensive internal header surfaces, reuse PCH where CMake could do so safely, and leave installed users alone.
That is why the optimization is unusually clean:
| |
For REX, that is the ideal kind of build performance improvement. It removes waiting time without adding a new concept that users have to learn.