Source code for srdatalog.ir.dialects.parallel.data.block_group

'''par.data.block_group — block-group work-balanced parallelism strategy.

The block-group strategy assigns each CUDA block a contiguous slice of
the total flat work-space `[0, bg_total_work)`. Per-block:

  1. Binary-search `bg_cumulative_work[]` for the starting key.
  2. Iterate keys from there until the block's work budget is consumed.
  3. Inside each key, redistribute work across warps proportional to
     the first source's degree.

Used for skewed root-key workloads where uniform warp-strided dispatch
would leave warps idle. Three runtime arrays produced by the host:

  - `bg_work_per_key[]`   filled by `kernel_bg_histogram` (per-key
                          work estimate = product of root-source degrees)
  - `bg_cumulative_work[]` exclusive prefix-sum over the histogram
  - `bg_total_work`        sum of all per-key work

This module owns the block-group **dialect ops**:

  - `BgRootCjMulti` — the BG dispatch shape for root multi-source
    ColumnJoin (count/materialize/fused kernel bodies). Lifted N4.1
    from legacy `jit_root_column_join_block_group`. Bundles the
    work-assignment preamble, binary-search key loop, per-source
    handle narrowing, warp-row redistribution, optional D2L
    segment-loops, and the wrapped body into one IR op.
  - `BgSourceSpec` — per-source descriptor consumed by `BgRootCjMulti`.

CUDA emission lives in `codegen/cuda/render/parallel_data.py` — both
`_render_bg_root_cj_multi` (BgRootCjMulti's renderer) and
`emit_bg_histogram_kernel` (the standalone histogram template called
by the runner). Per docs/stage3a_execution_plan.md §7 task S3A.9b,
the dialect file holds only data; rendering is the codegen's job.
'''

from __future__ import annotations

from dataclasses import dataclass
from typing import final

from srdatalog.ir.core import Op


[docs] @final @dataclass(frozen=True, slots=True) class BgSourceSpec: '''Per-source descriptor for `BgRootCjMulti`. Carries the legacy state that `jit_root_column_join_block_group` threads through its emit: rel_name + view/handle var names, multi-view view_count for D2L segment loops, base view-slot for per-segment view rebinding, and the index_type passed to legacy helpers (`gen_root_handle`, `gen_valid`). ''' rel_name: str view_var: str handle_var: str view_count: int # 1 = DSAI / single-view; 2 = D2L FULL_VER base_slot: int index_type: str
[docs] @final @dataclass(frozen=True, slots=True) class BgRootCjMulti(Op): '''Block-group root multi-source ColumnJoin (count/materialize body). Lifts legacy `jit_root_column_join_block_group` into a single dialect op. Emits the full BG scaffolding around `body`: - block-level work assignment preamble (work_per_block, block_begin/end, return-if-out-of-range — with `thread_counts[thread_id] = 0;` in count mode); - binary search for the starting key index; - per-key loop with key-range checks; - per-source handle narrow with the first source using the key_idx hint; multi-view (D2L FULL_VER) non-first sources defer their handle bind to a `_bg_seg_<idx>` segment loop; - warp-row redistribution narrowing the first source handle proportionally on its degree; - segment loops (when present) wrapping the body; - body emit; - segment-loop close braces, `bg_remaining_begin = ...;`, key-loop close brace. `var_name` is the sanitized inner-bind name (`auto <var> = root_val_<n>;`). `is_counting` toggles the count-phase `thread_counts[...] = 0;` early-exit branch. `key_idx_var`, `root_val_var`, `hint_lo`, `hint_hi` are the counter-allocated names. Sources is in pipeline order (sources[0] = first/hint source; sources[i>0] = either single-view direct narrow or multi-view segment-loop deferred). ''' var_name: str is_counting: bool key_idx_var: str root_val_var: str hint_lo: str hint_hi: str sources: tuple[BgSourceSpec, ...] body: Op
__all__ = ['BgRootCjMulti', 'BgSourceSpec']