srdatalog.ir.codegen.cuda.runner

target.cuda — per-rule runner emission.

emit_runner_full(ep, db, rel_index_types) is the canonical entry point that compile.compile_runner calls into. It produces the per-rule JitRunner_<rule> struct plus all kernel definitions and out-of-line phase methods — the content of jit_runner.<rule>.cpp.

Today the implementation delegates to the legacy ir.codegen.cuda.complete_runner.gen_complete_runner for the runner scaffolding (phase methods, type aliases, execute() dispatcher, LaunchParams struct, BG variants, fused kernel) and routes kernel bodies through compile_kernel_body when _dialect_safe_kernel holds. Subsequent milestones port the remaining pieces:

  • N2 Fused composer (count + materialize back-to-back operator())

  • N4 par.data.block_group dialect (BG warp-cumulative dispatch)

  • N5 relation.d2l dialect (multi-view plugin dispatch + setup)

  • N6 Dedup-hash WriteOutput variant

  • N7 Tiled-Cartesian ballot-reuse on relation.sorted_array

  • N8 par.data.atomic_ws dialect (WCOJ task queue)

Each milestone collapses one slice of the delegation into native dialect emission, validated by tests/test_runner_byte_equivalence.py.

The emission output of this module is byte-equivalent (modulo _cpp_norm) to the upstream Nim jit_runner.<rule>.cpp goldens on every fixture that the legacy emitter handled.

Module Contents

Functions

emit_execute

<runner_prefix>::execute — top-level dispatcher. For BG materialize rules, fans out into a 5-step pipeline (histogram → prefix sum → BG count → scan + resize → BG materialize) with adaptive fallback to baseline below the size threshold.

emit_execute_fused

<runner_prefix>::execute_fused — single-pass fused dispatcher with speculative output buffer + automatic capacity growth on overflow.

emit_grid_config_code

Grid configuration template — populates <prefix>num_threads / num_blocks based on whether the rule is a binary join (row-based) or WCOJ (unique-key-based).

emit_launch_count

<runner_prefix>::launch_count — fires kernel_count (and the BG variant when is_block_group=True) on the given stream after the zero-key fast path. When is_dedup_hash=True, passes p.dedup_table to the kernel.

emit_launch_fused

<runner_prefix>::launch_fused — fires kernel_fused (or kernel_bg_fused with stream-ordered histogram) into the given stream.

emit_launch_materialize

<runner_prefix>::launch_materialize — fires the materialize kernel (and BG variant when ep.block_group). Pure template; ProvPtrType is always nullptr today (no provenance materialization yet).

emit_launch_params_struct

LaunchParams block — shared between full and decl emission. When for_decl is True the BG-block comment uses the decl variant (“must match JIT batch definition exactly!”) to mirror Nim exactly.

emit_method_forward_decls

Phase-method forward declarations inside struct JitRunner_X.

emit_read_fused_result

<runner_prefix>::read_fused_result — readback fused write counts

emit_read_total

<runner_prefix>::read_total — read the post-scan total count (call after device sync).

emit_runner_decl

Emit the forward-declaration variant — type aliases + LaunchParams

emit_runner_full

Emit the full per-rule runner — struct + kernel defs + out-of-line phase methods + execute(). Goes into the per-rule jit_batch_N.cpp file at production-build time.

emit_scan_and_resize

<runner_prefix>::scan_and_resize — exclusive prefix-scan over thread_counts, read total, resize each dest relation in place.

emit_scan_only

<runner_prefix>::scan_only — async prefix-scan, no host sync.

emit_struct_type_aliases

Type alias block shared between full and decl. Does NOT include struct JitRunner_X { or the closing brace.

Data

API

srdatalog.ir.codegen.cuda.runner.__all__

[‘emit_execute’, ‘emit_execute_fused’, ‘emit_grid_config_code’, ‘emit_launch_count’, ‘emit_launch_fu…

srdatalog.ir.codegen.cuda.runner.emit_execute(rule_name: str, runner_prefix: str, is_count: bool, *, is_block_group: bool = False, is_dedup_hash: bool = False, dest_specs: list[srdatalog.ir.mir.types.InsertInto] | None = None) str[source]

<runner_prefix>::execute — top-level dispatcher. For BG materialize rules, fans out into a 5-step pipeline (histogram → prefix sum → BG count → scan + resize → BG materialize) with adaptive fallback to baseline below the size threshold.

srdatalog.ir.codegen.cuda.runner.emit_execute_fused(ep: srdatalog.ir.mir.types.ExecutePipeline, runner_prefix: str) str[source]

<runner_prefix>::execute_fused — single-pass fused dispatcher with speculative output buffer + automatic capacity growth on overflow.

srdatalog.ir.codegen.cuda.runner.emit_grid_config_code(prefix: str, root_is_scan: bool) str[source]

Grid configuration template — populates <prefix>num_threads / num_blocks based on whether the rule is a binary join (row-based) or WCOJ (unique-key-based).

srdatalog.ir.codegen.cuda.runner.emit_launch_count(runner_prefix: str, *, is_block_group: bool = False, is_dedup_hash: bool = False) str[source]

<runner_prefix>::launch_count — fires kernel_count (and the BG variant when is_block_group=True) on the given stream after the zero-key fast path. When is_dedup_hash=True, passes p.dedup_table to the kernel.

srdatalog.ir.codegen.cuda.runner.emit_launch_fused(ep: srdatalog.ir.mir.types.ExecutePipeline, runner_prefix: str) str[source]

<runner_prefix>::launch_fused — fires kernel_fused (or kernel_bg_fused with stream-ordered histogram) into the given stream.

srdatalog.ir.codegen.cuda.runner.emit_launch_materialize(ep: srdatalog.ir.mir.types.ExecutePipeline, runner_prefix: str) str[source]

<runner_prefix>::launch_materialize — fires the materialize kernel (and BG variant when ep.block_group). Pure template; ProvPtrType is always nullptr today (no provenance materialization yet).

srdatalog.ir.codegen.cuda.runner.emit_launch_params_struct(num_dests: int, is_fused_eligible: bool, is_block_group: bool = False, is_dedup_hash: bool = False, for_decl: bool = False) str[source]

LaunchParams block — shared between full and decl emission. When for_decl is True the BG-block comment uses the decl variant (“must match JIT batch definition exactly!”) to mirror Nim exactly.

srdatalog.ir.codegen.cuda.runner.emit_method_forward_decls(is_count: bool, is_fused_eligible: bool) str[source]

Phase-method forward declarations inside struct JitRunner_X.

srdatalog.ir.codegen.cuda.runner.emit_read_fused_result(ep: srdatalog.ir.mir.types.ExecutePipeline, runner_prefix: str) str[source]

<runner_prefix>::read_fused_result — readback fused write counts

  • overflow flag (call after device sync).

srdatalog.ir.codegen.cuda.runner.emit_read_total(runner_prefix: str) str[source]

<runner_prefix>::read_total — read the post-scan total count (call after device sync).

srdatalog.ir.codegen.cuda.runner.emit_runner_decl(ep: srdatalog.ir.mir.types.ExecutePipeline, db_type_name: str, rel_index_types: dict[str, str] | None = None) str[source]

Emit the forward-declaration variant — type aliases + LaunchParams

  • method declarations only. Goes into the main compile unit so the orchestrator can call JitRunner_<rule>::execute().

srdatalog.ir.codegen.cuda.runner.emit_runner_full(ep: srdatalog.ir.mir.types.ExecutePipeline, db_type_name: str, rel_index_types: dict[str, str] | None = None) str[source]

Emit the full per-rule runner — struct + kernel defs + out-of-line phase methods + execute(). Goes into the per-rule jit_batch_N.cpp file at production-build time.

srdatalog.ir.codegen.cuda.runner.emit_scan_and_resize(ep: srdatalog.ir.mir.types.ExecutePipeline, runner_prefix: str) str[source]

<runner_prefix>::scan_and_resize — exclusive prefix-scan over thread_counts, read total, resize each dest relation in place.

srdatalog.ir.codegen.cuda.runner.emit_scan_only(runner_prefix: str) str[source]

<runner_prefix>::scan_only — async prefix-scan, no host sync.

srdatalog.ir.codegen.cuda.runner.emit_struct_type_aliases(rule_name: str, db_type_name: str, first_schema: str, first_version: str, dest_specs: list[srdatalog.ir.mir.types.InsertInto], dest_arities: list[int], total_view_count: int) str[source]

Type alias block shared between full and decl. Does NOT include struct JitRunner_X { or the closing brace.