How REX Cut Build Time With Private Precompiled Headers

Posted on (Updated on )
REX build time was dominated by repeatedly parsing the same expensive compiler headers, especially Sage, Clang, and Flang-facing headers. The fix was not a semantic refactor. It was a build-system optimization: add private precompiled headers for in-tree targets, reuse a public ROSE PCH for internal tools that link ROSE, and keep the installed interface unchanged. A controlled clean Debug Ninja measurement with lld held constant dropped from 1:29.17 to 0:59.60, a 33.2% reduction in build time, while full CTest, sanitizer, and memcheck gates stayed green. A local 2x2 check showed the rough split: PCH saved about 29 seconds, while switching from bfd to lld saved about 13-14 seconds. The QEMU image-build workflow saw an even larger operational improvement after the PCH change and the required linker/build-script follow-up: riscv64 dropped from 5:13:21 on 2026-05-31 to 1:53:46 on 2026-06-02, and loong64 dropped from 4:05:10 to 1:29:55. For installed REX users, the change is intentionally invisible: same headers, same library, same include model, same ABI.

Compiler projects spend a lot of time compiling themselves.

That sounds obvious, but it becomes painfully concrete in REX. The project is a source-to-source compiler built around Sage IR, generated ROSETTA headers, a Clang frontend, a Flang frontend, OpenMP parsing, unparsing, midend analysis, and a large test suite. Many translation units include the same heavy internal headers. Every clean build asks the compiler to parse those headers again and again.

After the full CTest suite was stabilized, that cost became more visible. Linker selection had already been improved by preferring faster linkers such as lld, but linking was no longer the only major cost. The compile phase was still spending a lot of time re-reading the same large header surface.

The optimization was to use precompiled headers, but only in a way that preserved the public contract:

1
2
3
4
speed up REX's own build;
do not change how installed REX users include headers;
do not export PCH requirements to downstream projects;
do not change compiler behavior, ABI, or runtime semantics.

That boundary is why the change worked as a clean optimization instead of becoming another compatibility burden.

A diagram showing many REX source files repeatedly parsing the same Sage, Clang, and Flang header surface before code generation and linking.

Figure 1. The compile-time problem was repeated parsing. Many translation units paid the same header cost independently.

The Problem

REX has some naturally expensive headers.

The largest source is Sage itself. REX’s internal implementation files commonly start with the header sage3basic.h, which brings in generated declarations, configuration state, core IR classes, utility headers, and enough shared infrastructure for implementation files to talk about Sg* nodes. That header is central to the compiler. It is also expensive to parse.

The Clang frontend adds another heavy surface. It needs LLVM and Clang declarations, REX’s Clang-to-Sage conversion helpers, source-location utilities, declaration/type/statement lowering support, and frontend-private state. Those headers are not optional for the frontend. They are the working set.

The Flang frontend has the same shape. Parsing Fortran through Flang means seeing Flang parser headers, parse-tree visitors, provenance, and the REX builder layer that turns Flang structures into Sage.

None of this is surprising. Compiler code has big internal APIs. The cost becomes a problem because clean builds repeat it per translation unit:

1
2
3
4
5
6
7
source file A parses sage3basic.h
source file B parses sage3basic.h
source file C parses sage3basic.h
...
frontend file A parses Clang/LLVM headers
frontend file B parses Clang/LLVM headers
...

Modern compilers are fast, but they are not magic. If a project asks them to parse the same heavy header graph hundreds of times, clean build time will show it.

Measurement Before The Fix

The first step was measurement, not guessing.

REX added scripts/measure-build-time.sh for this reason. It configures a Debug Ninja build, builds with a selected job count, and summarizes .ninja_log so the build can be grouped by subsystem and top individual actions. That matters because wall-clock time alone tells you that the build is slow; the Ninja log tells you where the compile work is concentrated.

The script also makes the experiment repeatable:

  • fresh build directory
  • Debug Ninja build
  • lld linker
  • same source tree
  • same measurement path

This is important because build-time optimization can be misleading. A warm build, a partially reused build tree, a different linker, or a different job count can make two measurements look comparable when they are not.

The controlled local result before the PCH change was:

1
2
baseline clean build: 1:29.17 wall time
configure time:        28.63 seconds

Both the baseline and the PCH measurement used lld. That matters because switching from the default GNU bfd linker to lld can also improve REX build time. The local measurement was designed to keep that linker effect out of the PCH comparison.

The useful observation was that configure time was not the main issue. The compile phase was the target.

What PCH Is

A precompiled header is a compiler cache for parsed header state.

Normally, when a C++ source file includes a large header, the compiler lexes, preprocesses, parses, and semantically analyzes that included header as part of compiling that source file. If one hundred source files include the same large header, much of that work is repeated one hundred times.

A PCH changes the flow:

1
2
3
parse the selected header once
save the compiler's parsed representation
reuse that representation when compiling matching source files

It does not make the header disappear. It does not change the C++ language. It does not change object-file semantics by itself. It is a build-time cache, and like every cache it only helps when the selected header is stable, broadly shared, and expensive enough to justify the cache creation cost.

That makes PCH a good fit for REX’s internal implementation headers:

  • sage3basic.h is widely shared by in-tree implementation files.
  • Clang frontend files share an expensive Clang/LLVM-facing header surface.
  • Flang frontend files share an expensive Flang-facing header surface.
  • REX tools and module targets that link ROSE repeatedly include public ROSE headers.

It also explains what PCH is not good for. It should not be used to hide broken include dependencies. It should not be exported casually to downstream users. It should not force every consumer of an installed compiler library to adopt the producer’s build-cache strategy.

The REX Solution

The REX change added CMake-owned PCH support with a default-on option:

1
ROSE_ENABLE_PCH=ON

The option can be disabled with:

1
-DROSE_ENABLE_PCH=OFF

The important implementation choice is that PCH is applied privately.

REX calls the CMake command target_precompile_headers, and the helper applies headers with the PRIVATE scope keyword. That means the PCH affects how REX’s own targets are compiled. It does not become part of the installed target Rose::rose usage requirements.

The main internal layers are:

  • A Sage PCH based on the generated build-tree sage3basic.h.
  • A Clang frontend PCH based on a small wrapper that includes sage3basic.h plus frontend-private Clang headers.
  • A Flang frontend PCH based on a small wrapper that includes sage3basic.h plus Flang parser headers.
  • An internal rosePublicPch object target that precompiles rose.h for in-tree tools and modules that link ROSE.

The rosePublicPch name can sound public, but the build-system role is internal. It creates a reusable PCH seed for REX’s own executable and module targets. It is not an installed interface target for users.

A diagram showing private PCH on the REX build side and unchanged public headers, installed library, and user tools on the consumer side.

Figure 2. PCH stays on the producer side. Installed users still see the normal REX headers and library interface.

This boundary matters because REX has two audiences:

  • developers building REX itself,
  • users building tools against an installed REX.

The PCH optimization is for the first audience. It should help developers and CI build REX faster. It should not ask the second audience to care.

Why It Is Invisible To Installed Users

For installed REX users, the change is intentionally invisible.

The installed files and usage model remain the same:

  • include rose.h,
  • link against librose / Rose::rose,
  • use the same public headers,
  • use the same compiler and linker model as before.

There is no installed PCH file that users must include. There is no new downstream CMake requirement saying “your target must reuse REX’s PCH.” There is no ABI effect, because PCH is not a runtime artifact. It is a compiler-side cache used while building REX.

That is the key distinction:

1
2
PCH changes how REX is built from source.
PCH does not change what installed REX is.

This is also why the CMake scope matters. If PCH were exported as an interface requirement, downstream projects could inherit compiler-specific assumptions from the REX build. That would be fragile. The REX change avoids that by keeping PCH private to in-tree targets.

There is one visible CMake option for source builders: ROSE_ENABLE_PCH. If a compiler package, CMake version, or local environment has a PCH-specific issue, the optimization can be turned off without changing the rest of the build.

Include Discipline Still Matters

PCH works best when implementation files include the PCH header first, before meaningful C++ tokens are processed.

REX already had a sage3basic.h convention: implementation files that become part of librose should include that header early. That convention was not invented for this optimization, but PCH makes it more valuable.

The optimization does not mean every header should include sage3basic.h. In fact, that would be the wrong direction. Public and internal headers should stay disciplined about what they include. Implementation files can pay for broad context; headers should avoid forcing broad context on every user.

That distinction preserves both build speed and public usability:

1
2
implementation files can use PCH-friendly broad headers;
headers should expose the narrowest reasonable dependency surface.

PCH rewards the existing implementation-file convention without weakening the installed header contract.

Evaluation

The controlled clean-build measurement after the PCH change was:

1
2
PCH/public-reuse clean build: 0:59.60 wall time
configure time:                28.52 seconds

Compared with the baseline:

1
2
3
4
baseline clean build:          1:29.17
PCH clean build:               0:59.60
reduction:                     29.57 seconds
relative wall-time reduction:  about 33.2%

Configure time stayed effectively flat:

1
2
28.63 seconds before
28.52 seconds after

That is the shape we wanted. The optimization targeted compile work, and the measurement showed compile-time reduction rather than a shifted configure cost.

The validation gate mattered as much as the timing:

1
2
3
4
full CTest:            35220/35220 passed
targeted pre-push:      6857/6857 passed
sanitizer selection:      530/530 passed
memcheck selection:       530/530 passed

Build-time improvements are not useful if they make the compiler less trustworthy. The result had to preserve the normal regression guarantees.

A ladder showing baseline timing, PCH implementation, faster clean build, full CTest, sanitizer, and memcheck validation.

Figure 3. The optimization was accepted only after timing improved and the normal regression gates stayed green.

Separating PCH From Linker Choice

There were two build-speed improvements close together:

  • use PCH to avoid repeated C++ header parsing,
  • use a faster linker, especially lld, instead of the default bfd path.

They should not be treated as equal contributors.

On the local native build, PCH was the larger effect. A 2x2 clean Debug Ninja measurement on the same workspace gave:

PCHLinkerClean build time (M:SS)
offbfd1:42.40
offlld1:29.17
onbfd1:13.44
onlld0:59.60

The main deltas were:

  • PCH effect with lld held constant: 1:29.17 -> 0:59.60, saving 29.57 seconds, about 33%.
  • PCH effect with bfd held constant: 1:42.40 -> 1:13.44, saving 28.96 seconds, about 28%.
  • lld effect with PCH off: 1:42.40 -> 1:29.17, saving 13.23 seconds, about 13%.
  • lld effect with PCH on: 1:13.44 -> 0:59.60, saving 13.84 seconds, about 19%.

The exact percentages depend on which denominator is used, and the effects are not perfectly additive because a highly parallel build changes scheduling as soon as link steps move. But the order of magnitude is clear:

1
PCH saved roughly twice as much as the linker change in this native build.

The reason is structural. PCH affects many C++ compilation actions. lld affects the link actions. REX has many more expensive compile actions than final link actions, so PCH has the broader surface.

That does not make lld unimportant. Linker choice still matters, and under QEMU the cost of bfd can be much worse than on native x86_64. That cost can also be architecture-sensitive. The observed image-build behavior is consistent with loong64 benefiting especially strongly from avoiding the slow bfd path, plausibly because bfd work is more expensive there under emulation than it is for riscv64. That is a linker hypothesis, not a PCH result, and it would need a per-architecture image-build matrix with PCH held constant and only the linker changed to quantify precisely.

The practical split is still clear: PCH is the dominant compile-time optimization; lld is a smaller but meaningful link-time optimization whose value can be amplified on slow emulated targets.

Nightly QEMU Image Builds

The local measurement isolates the PCH change best. The nightly image workflow answers a different question:

1
does the same improvement matter in the slowest operational builds?

That answer was yes.

The nightly image workflow builds riscv64 and loong64 Docker images under QEMU. Those are much slower than native local builds, so repeated C++ header parsing is amplified. The last successful scheduled image run before the PCH merge was on 2026-05-31 at commit 8df03f1947. The first fully successful scheduled image run after the PCH merge and the follow-up linker/build-script fix was on 2026-06-02 at commit 5103c6a496.

The job timings were:

Workflow runDateCommitJobDuration (H:MM:SS)
267084839082026-05-318df03f1947build-riscv645:13:21
267084839082026-05-318df03f1947build-loong644:05:10
268111767482026-06-025103c6a496build-riscv641:53:46
268111767482026-06-025103c6a496build-loong641:29:55
268766721952026-06-035103c6a496build-riscv641:59:21
268766721952026-06-035103c6a496build-loong641:23:38

That is a large operational shift. Comparing 2026-05-31 with 2026-06-02:

1
2
riscv64: 5:13:21 -> 1:53:46
loong64: 4:05:10 -> 1:29:55

The percentages are roughly:

1
2
riscv64: about 64% less wall time
loong64: about 63% less wall time

The 2026-06-03 scheduled run repeated the same shape, staying under two hours for riscv64 and under ninety minutes for loong64.

There are two important caveats.

First, the QEMU workflow records do not give a perfect PCH-only and linker-only decomposition. They measure whole image jobs. Those jobs include configure, compile, link, install, image export, and push. They also run under emulation, where architecture-specific tool behavior can dominate in ways that a native x86_64 build does not show. In particular, if bfd is substantially slower on emulated loong64 than on emulated riscv64, the loong64 job can show a larger linker-choice benefit even though the PCH improvement remains the broader compile-time change.

Second, the first scheduled run after the PCH merge, on 2026-06-01 at commit f9d73a90cd, failed quickly in the riscv64 job because the image workflow still had a linker-selection problem. The loong64 job in that same run succeeded and used mold, finishing in 1:24:09, already showing the post-PCH image-build shape. The follow-up commit made the Docker path use build-rex.sh, selected versioned ld.lld-22 consistently, and avoided the failing riscv64 mold path. So the successful QEMU comparison measures the practical result after both pieces were present:

  • PCH reduces the compile cost.
  • lld reduces linker cost compared with bfd where it is on the path.
  • The linker/build-script follow-up makes the image workflow take the intended build path.

The controlled local measurement remains the cleanest PCH-only number. The nightly image data shows why the combined build-system optimization mattered even more in practice: the slowest CI path was dominated enough by compile work that private PCH, helped by the consistent faster-linker path, turned multi-hour QEMU builds into roughly two-hour or shorter builds.

What Was Not Optimized Away

The PCH change did not try to reduce compiler work by weakening the compiler.

It did not remove generated headers. It did not split the public API differently. It did not hide include errors with accidental order dependencies. It did not change test coverage. It did not mask sanitizer or memcheck findings. It did not ask the unparser or frontend to behave differently.

The improvement came from avoiding repeated parsing of stable, expensive headers.

That matters because build optimizations can easily become semantic changes in disguise. For a compiler infrastructure project, that would be a bad trade. Faster builds are valuable only when the project being built is still the same project.

The accepted boundary was therefore narrow:

1
2
3
4
same source semantics,
same public install model,
same tests,
less repeated header parsing.

Conclusion

PCH was a good fit for REX because the project had a specific build-time problem: many in-tree targets repeatedly parsed the same expensive compiler headers.

The solution was not to expose a new user-facing build model. It was to keep PCH private, target the expensive internal header surfaces, reuse PCH where CMake could do so safely, and leave installed users alone.

That is why the optimization is unusually clean:

1
2
3
4
source builders get faster builds;
installed users keep the same headers and usage;
validation gates stay green;
the compiler's behavior does not change.

For REX, that is the ideal kind of build performance improvement. It removes waiting time without adding a new concept that users have to learn.