CUDA PCH blocker¶
Why PCH is off by default, and what it would take to turn it on.
The symptom¶
Attempting to build a precompiled header for srdatalog.h and
consume it in a normal compile:
# Build host-only PCH
clang++ -std=c++23 -fPIC -x cuda --cuda-host-only \
--cuda-path=/opt/cuda-12.9 \
-I... -D... -Xclang -emit-pch \
-c _pch_stub.cu -o srdatalog.pch
# Consume it in a trivial TU
echo 'int main() {return 0;}' > t.cu
clang++ ... --cuda-host-only -include-pch srdatalog.pch -c t.cu -o t.o
fails with:
cuda_wrappers/new:95:51: error: 'operator new' has different
definitions in different modules; defined here first difference
is 1st parameter with type 'unsigned long'
95 | __device__ inline void *operator new(__SIZE_TYPE__, void *__ptr)
/usr/include/c++/13/new:174:57: note: but in '' found 1st parameter
with type 'std::size_t' (aka 'unsigned long')
174 | _GLIBCXX_NODISCARD inline void* operator new(std::size_t, ...)
Both __SIZE_TYPE__ and std::size_t resolve to unsigned long on
x86_64, but clang-20’s PCH module-ODR check compares the textual
declarations and rejects the mismatch.
Why none of the obvious workarounds apply¶
Workaround |
Blocker |
|---|---|
|
NVIDIA’s CUDA SDK |
Older clang (≤ 18) |
Doesn’t compile our C++23 runtime ( |
|
No |
|
No C++23 support (we use |
|
ODR check is semantic, not “validation” — these flags don’t disable it. |
Split host + device PCH ( |
The device-side |
What would unblock it¶
One of:
Clang fixes the ODR regression. The check was tightened in clang-20 and now false-positives on
cuda_wrappers/newvs libstdc++’s<new>. Worth filing upstream.Restructure
gpu/search.h(and a few other runtime headers) to isolate host-only vs device-only code so that the consumer TU doesn’t transitively pull incuda_wrappers/new. Multi-day structural refactor but would unlock a ~40 – 60 s compile-time win for doop and should make future toolchain upgrades easier.Swap out the GCC libstdc++ for a CUDA-compatible alternative. Not a realistic workaround on Linux with NVIDIA CUDA 12.x — every distribution ships libstdc++ and NVIDIA’s headers assume it.
The scaffold is still there¶
srdatalog.codegen.jit.compiler_ninja.emit_build_ninja()
accepts a use_pch=True kwarg that emits the full split host/device
PCH rules. It’s off by default because it doesn’t work today, but
the code path is preserved so someone (you, or clang, or a runtime
refactor) can flip it on and immediately measure the savings.
from srdatalog.codegen.jit.compiler_ninja import emit_build_ninja
emit_build_ninja(project_result, config, use_pch=True)
If you try this and it works on your setup, please open an issue with your clang / CUDA / libstdc++ versions — it probably means the upstream bug is fixed.