Skip to content

hilbertsfc.torch

Core APIs

Hilbert

Morton

Cache Management

hilbertsfc.torch

PyTorch-API for HilbertSFC.

This subpackage provides 2D/3D Hilbert and Morton encode/decode functions that operate on integer torch.Tensor inputs.

hilbert_encode_2d

hilbert_encode_2d(
    x: Tensor,
    y: Tensor,
    *,
    nbits: int | None = None,
    out: Tensor | None = None,
    lut_cache: TorchCacheMode = "device",
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> Tensor

Encode 2D integer coordinates to Hilbert indices.

This function provides a PyTorch equivalent of hilbert_encode_2d. It accepts integer torch.Tensor of arbitrary shape on any device, and dispatches to backend-specific implementations depending on device and backend settings.

Parameters:

Name Type Description Default
x Tensor

Integer coordinate tensors to encode.

Must have identical shape and be on the same device.

required
y Tensor

Integer coordinate tensors to encode.

Must have identical shape and be on the same device.

required
nbits int | None

Number of bits per coordinate axis. This defines a coordinate domain of [0, 2**nbits) on each axis. For inputs outside that domain, only the low nbits bits of each coordinate are used.

Must satisfy 1 <= nbits <= 32. If provided, it must also fit within the usable bits of the coordinate dtype.

If None:

  • Array mode: inferred from the coordinate dtype using its usable bit width, capped at 32. For example, uint16 -> 16, int16 -> 15 (sign bit excluded), and uint64/int64 -> 32.
  • Scalar mode: defaults to 32.

For best performance and tighter output dtypes, pass the smallest value that covers the input coordinate range.

None
out Tensor | None

Optional output tensor.

Must have the same shape and device as x and y and an integer dtype wide enough to hold 2 * nbits bits.

None
lut_cache TorchCacheMode

Cache mode for look-up tables (LUTs) used by the Torch/Triton kernels.

  • "device" (default): cache the converted LUT tensors per-device for reuse across calls.
  • "host_only": do not keep a torch-side LUT cache; materialize on demand from the (process-wide) NumPy LUT cache.

This setting is ignored by the CPU Numba path.

'device'
cpu_parallel bool | None

Controls whether the CPU Numba kernel may execute in parallel.

Only applies when dispatching to the CPU Numba backend and the input is not a scalar tensor. If None, a heuristic is used.

None
cpu_backend CPUBackend

CPU backend selection.

  • "auto" (default): use the Numba kernel unless inside torch.compile, in which case the torch backend is used.
  • "numba": always use the Numba kernel. This mode is not torch.compile-friendly.
  • "torch": always use the torch implementation.
'auto'
gpu_backend GPUBackend

GPU (accelerator) backend selection.

  • "auto" (default): on CUDA, use the Triton kernel when available and all tensors are contiguous; otherwise fall back to the Torch kernel. Fallbacks due to non-contiguity or Triton runtime failure emit a UserWarning.
  • "triton": force the Triton kernel. Requires CUDA tensors, Triton availability, and contiguous inputs/outputs; raises on violation or kernel failure.
  • "torch": force the Torch implementation.
'auto'
triton_tuning TritonTuningMode

Triton launch config selection policy.

  • "heuristic" (default): use static launch heuristics.
  • "autotune_bucketed": autotune from a fixed config set and cache by input size bucket.
  • "autotune_exact": autotune from the same config set and cache by exact input size.

Only applies when the Triton backend is used.

'heuristic'

Returns:

Type Description
Tensor

Hilbert indices.

  • Has the same shape/device as the inputs.
  • If out is provided, returns out.
  • Otherwise, chooses a minimal integer dtype that can represent 2 * nbits bits, preferring unsigned if all inputs are unsigned and a fitting unsigned dtype is available.

Raises:

Type Description
TypeError

If a non-integer tensor is provided.

ValueError

If inputs are on different devices, have mismatched shapes, if nbits is invalid or does not fit in the input/output dtypes, or if backend arguments are invalid.

RuntimeError

If gpu_backend='triton' is requested but Triton is unavailable or the Triton kernel fails at runtime.

Notes

When using this function with torch.compile, call precache_compile_luts before compilation. This avoids materialization of LUTs inside the compiled region, which causes graph breaks, extra overhead, and failure with fullgraph=True.

hilbert_decode_2d

hilbert_decode_2d(
    index: Tensor,
    *,
    nbits: int | None = None,
    out_x: Tensor | None = None,
    out_y: Tensor | None = None,
    lut_cache: TorchCacheMode = "device",
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> tuple[Tensor, Tensor]

Decode Hilbert indices to 2D integer coordinates.

This function provides a PyTorch equivalent of hilbert_decode_2d. It accepts integer torch.Tensor of arbitrary shape on any device, and dispatches to backend-specific implementations depending on device and backend settings.

Parameters:

Name Type Description Default
index Tensor

Integer Hilbert index tensor to decode.

required
nbits int | None

Number of bits per coordinate axis. This defines a coordinate domain of [0, 2**nbits) on each axis and a Hilbert index range of [0, 2**(2 * nbits)). For indices outside that range, only the low 2 * nbits bits are used.

Must satisfy 1 <= nbits <= 32. If provided, it must also fit within the usable bits of the index dtype.

If None:

  • Inferred from the index dtype as half its usable bit width, capped at 32. For example, uint16 -> 8, uint64 -> 32, and int64 -> 31 (sign bit excluded).

For best performance and tighter output dtypes, pass the smallest value that covers the input index range.

None
out_x Tensor | None

Optional output coordinate tensors. Either provide both or neither.

Each must have the same shape and device as index and an integer dtype wide enough to hold nbits bits.

None
out_y Tensor | None

Optional output coordinate tensors. Either provide both or neither.

Each must have the same shape and device as index and an integer dtype wide enough to hold nbits bits.

None
lut_cache TorchCacheMode

Cache mode for look-up tables (LUTs) used by the Torch/Triton kernels.

  • "device" (default): cache the converted LUT tensors per-device for reuse across calls.
  • "host_only": do not keep a torch-side LUT cache; materialize on demand from the (process-wide) NumPy LUT cache.

This setting is ignored by the CPU Numba path.

'device'
cpu_parallel bool | None

Controls whether the CPU Numba kernel may execute in parallel.

Only applies when dispatching to the CPU Numba backend and the input is not a scalar tensor. If None, a heuristic is used.

None
cpu_backend CPUBackend

CPU backend selection.

  • "auto" (default): use the Numba kernel unless inside torch.compile, in which case the torch backend is used.
  • "numba": always use the Numba kernel. This mode is not torch.compile-friendly.
  • "torch": always use the torch implementation.
'auto'
gpu_backend GPUBackend

GPU (accelerator) backend selection.

  • "auto" (default): on CUDA, use the Triton kernel when available and all tensors are contiguous; otherwise fall back to the Torch kernel. Fallbacks due to non-contiguity or Triton runtime failure emit a UserWarning.
  • "triton": require CUDA tensors, Triton availability, and contiguous inputs/outputs; raises on violation or kernel failure.
  • "torch": force the Torch implementation.
'auto'
triton_tuning TritonTuningMode

Triton launch config selection policy.

  • "heuristic" (default): use static launch heuristics.
  • "autotune_bucketed": autotune from a fixed config set and cache by input size bucket.
  • "autotune_exact": autotune from the same config set and cache by exact input size.

Only applies when the Triton backend is used.

'heuristic'

Returns:

Type Description
tuple[Tensor, Tensor]

Decoded coordinates (x, y).

  • Each tensor has the same shape/device as index.
  • If out_x and out_y are provided, returns (out_x, out_y).
  • Otherwise, each result uses a minimal integer dtype that can represent nbits bits, preferring unsigned if the input is unsigned and a fitting unsigned dtype is available.

Raises:

Type Description
TypeError

If a non-integer tensor is provided.

ValueError

If nbits is invalid or does not fit in the input/output dtypes, if outputs are inconsistent or have incorrect shapes/devices, or if backend arguments are invalid.

RuntimeError

If gpu_backend='triton' is requested but Triton is unavailable or the Triton kernel fails at runtime.

Notes

When using this function with torch.compile, call precache_compile_luts before compilation. This avoids materialization of LUTs inside the compiled region, which causes graph breaks, extra overhead, and failure with fullgraph=True.

hilbert_encode_3d

hilbert_encode_3d(
    x: Tensor,
    y: Tensor,
    z: Tensor,
    *,
    nbits: int | None = None,
    out: Tensor | None = None,
    lut_cache: TorchCacheMode = "device",
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> Tensor

Encode 3D integer coordinates to Hilbert indices.

This function provides a PyTorch equivalent of hilbert_encode_3d. It accepts integer torch.Tensor of arbitrary shape on any device, and dispatches to backend-specific implementations depending on device and backend settings.

Parameters:

Name Type Description Default
x Tensor

Integer coordinate tensors to encode.

Must have identical shape and be on the same device.

required
y Tensor

Integer coordinate tensors to encode.

Must have identical shape and be on the same device.

required
z Tensor

Integer coordinate tensors to encode.

Must have identical shape and be on the same device.

required
nbits int | None

Number of bits per coordinate axis. This defines a coordinate domain of [0, 2**nbits) on each axis. For inputs outside that domain, only the low nbits bits of each coordinate are used.

Must satisfy 1 <= nbits <= 21. If provided, it must also fit within the usable bits of the coordinate dtype.

If None:

  • Array mode: inferred from the coordinate dtype using its usable bit width, capped at 21. For example, uint16 -> 16, int16 -> 15 (sign bit excluded), and uint64/int64 -> 21.
  • Scalar mode: defaults to 21.

For best performance and tighter output dtypes, pass the smallest value that covers the input coordinate range.

None
out Tensor | None

Optional output tensor.

Must have the same shape and device as x, y, and z and an integer dtype wide enough to hold 3 * nbits bits.

None
lut_cache TorchCacheMode

Cache mode for look-up tables (LUTs) used by the Torch/Triton kernels.

  • "device" (default): cache the converted LUT tensors per-device for reuse across calls.
  • "host_only": do not keep a torch-side LUT cache; materialize on demand from the (process-wide) NumPy LUT cache.

This setting is ignored by the CPU Numba path.

'device'
cpu_parallel bool | None

Controls whether the CPU Numba kernel may execute in parallel.

Only applies when dispatching to the CPU Numba backend and the input is not a scalar tensor. If None, a heuristic is used.

None
cpu_backend CPUBackend

CPU backend selection.

  • "auto" (default): use the Numba kernel unless inside torch.compile, in which case the torch backend is used.
  • "numba": always use the Numba kernel. This mode is not torch.compile-friendly.
  • "torch": always use the torch implementation.
'auto'
gpu_backend GPUBackend

GPU (accelerator) backend selection.

  • "auto" (default): on CUDA, use the Triton kernel when available and all tensors are contiguous; otherwise fall back to the Torch kernel. Fallbacks due to non-contiguity or Triton runtime failure emit a UserWarning.
  • "triton": require CUDA tensors, Triton availability, and contiguous inputs/outputs; raises on violation or kernel failure.
  • "torch": force the Torch implementation.
'auto'
triton_tuning TritonTuningMode

Triton launch config selection policy.

  • "heuristic" (default): use static launch heuristics.
  • "autotune_bucketed": autotune from a fixed config set and cache by input size bucket.
  • "autotune_exact": autotune from the same config set and cache by exact input size.

Only applies when the Triton backend is used.

'heuristic'

Returns:

Type Description
Tensor

Hilbert indices.

  • Has the same shape/device as the inputs.
  • If out is provided, returns out.
  • Otherwise, chooses a minimal integer dtype that can represent 3 * nbits bits, preferring unsigned if all inputs are unsigned and a fitting unsigned dtype is available.

Raises:

Type Description
TypeError

If a non-integer tensor is provided.

ValueError

If inputs are on different devices, have mismatched shapes, if nbits is invalid or does not fit in the input/output dtypes, or if backend arguments are invalid.

RuntimeError

If gpu_backend='triton' is requested but Triton is unavailable or the Triton kernel fails at runtime.

Notes

When using this function with torch.compile, call precache_compile_luts before compilation. This avoids materialization of LUTs inside the compiled region, which causes graph breaks, extra overhead, and failure with fullgraph=True.

hilbert_decode_3d

hilbert_decode_3d(
    index: Tensor,
    *,
    nbits: int | None = None,
    out_x: Tensor | None = None,
    out_y: Tensor | None = None,
    out_z: Tensor | None = None,
    lut_cache: TorchCacheMode = "device",
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> tuple[Tensor, Tensor, Tensor]

Decode Hilbert indices to 3D integer coordinates.

This function provides a PyTorch equivalent of hilbert_decode_3d. It accepts integer torch.Tensor of arbitrary shape on any device, and dispatches to backend-specific implementations depending on device and backend settings.

Parameters:

Name Type Description Default
index Tensor

Integer Hilbert index tensor to decode.

required
nbits int | None

Number of bits per coordinate axis. This defines a coordinate domain of [0, 2**nbits) on each axis and a Hilbert index range of [0, 2**(3 * nbits)). For indices outside that range, only the low 3 * nbits bits are used.

Must satisfy 1 <= nbits <= 21. If provided, it must also fit within the usable bits of the index dtype.

If None:

  • Inferred from the index dtype as one third of its usable bit width, capped at 21. For example, uint16 -> 5, uint64 -> 21, and int64 -> 21 (sign bit excluded).

For best performance and tighter output dtypes, pass the smallest value that covers the input index range.

None
out_x Tensor | None

Optional output coordinate tensors. Either provide all three or none.

Each must have the same shape and device as index and an integer dtype wide enough to hold nbits bits.

None
out_y Tensor | None

Optional output coordinate tensors. Either provide all three or none.

Each must have the same shape and device as index and an integer dtype wide enough to hold nbits bits.

None
out_z Tensor | None

Optional output coordinate tensors. Either provide all three or none.

Each must have the same shape and device as index and an integer dtype wide enough to hold nbits bits.

None
lut_cache TorchCacheMode

Cache mode for look-up tables (LUTs) used by the Torch/Triton kernels.

  • "device" (default): cache the converted LUT tensors per-device for reuse across calls.
  • "host_only": do not keep a torch-side LUT cache; materialize on demand from the (process-wide) NumPy LUT cache.

This setting is ignored by the CPU Numba path.

'device'
cpu_parallel bool | None

Controls whether the CPU Numba kernel may execute in parallel.

Only applies when dispatching to the CPU Numba backend and the input is not a scalar tensor. If None, a heuristic is used.

None
cpu_backend CPUBackend

CPU backend selection.

  • "auto" (default): use the Numba kernel unless inside torch.compile, in which case the torch backend is used.
  • "numba": always use the Numba kernel. This mode is not torch.compile-friendly.
  • "torch": always use the torch implementation.
'auto'
gpu_backend GPUBackend

GPU (accelerator) backend selection.

  • "auto" (default): on CUDA, use the Triton kernel when available and all tensors are contiguous; otherwise fall back to the Torch kernel. Fallbacks due to non-contiguity or Triton runtime failure emit a UserWarning.
  • "triton": require CUDA tensors, Triton availability, and contiguous inputs/outputs; raises on violation or kernel failure.
  • "torch": force the Torch implementation.
'auto'
triton_tuning TritonTuningMode

Triton launch config selection policy.

  • "heuristic" (default): use static launch heuristics.
  • "autotune_bucketed": autotune from a fixed config set and cache by input size bucket.
  • "autotune_exact": autotune from the same config set and cache by exact input size.

Only applies when the Triton backend is used.

'heuristic'

Returns:

Type Description
tuple[Tensor, Tensor, Tensor]

Decoded coordinates (x, y, z).

  • Each tensor has the same shape/device as index.
  • If out_x, out_y, and out_z are provided, returns (out_x, out_y, out_z).
  • Otherwise, each result uses a minimal integer dtype that can represent nbits bits, preferring unsigned if the input is unsigned and a fitting unsigned dtype is available.

Raises:

Type Description
TypeError

If a non-integer tensor is provided.

ValueError

If nbits is invalid or does not fit in the input/output dtypes, if outputs are inconsistent or have incorrect shapes/devices, or if backend arguments are invalid.

RuntimeError

If gpu_backend='triton' is requested but Triton is unavailable or the Triton kernel fails at runtime.

Notes

When using this function with torch.compile, call precache_compile_luts before compilation. This avoids materialization of LUTs inside the compiled region, which causes graph breaks, extra overhead, and failure with fullgraph=True.

morton_encode_2d

morton_encode_2d(
    x: Tensor,
    y: Tensor,
    *,
    nbits: int | None = None,
    out: Tensor | None = None,
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> Tensor

Encode 2D integer coordinate tensors to Morton (Z-order) indices.

API semantics for parameters, returns, and errors match hilbert_encode_2d, except that Morton kernels do not use lookup tables and therefore do not accept lut_cache.

morton_decode_2d

morton_decode_2d(
    index: Tensor,
    *,
    nbits: int | None = None,
    out_x: Tensor | None = None,
    out_y: Tensor | None = None,
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> tuple[Tensor, Tensor]

Decode Morton (Z-order) index tensors to 2D integer coordinates.

API semantics for parameters, returns, and errors match hilbert_decode_2d, except that Morton kernels do not use lookup tables and therefore do not accept lut_cache.

morton_encode_3d

morton_encode_3d(
    x: Tensor,
    y: Tensor,
    z: Tensor,
    *,
    nbits: int | None = None,
    out: Tensor | None = None,
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> Tensor

Encode 3D integer coordinate tensors to Morton (Z-order) indices.

API semantics for parameters, returns, and errors match hilbert_encode_3d, except that Morton kernels do not use lookup tables and therefore do not accept lut_cache.

morton_decode_3d

morton_decode_3d(
    index: Tensor,
    *,
    nbits: int | None = None,
    out_x: Tensor | None = None,
    out_y: Tensor | None = None,
    out_z: Tensor | None = None,
    cpu_parallel: bool | None = None,
    cpu_backend: CPUBackend = "auto",
    gpu_backend: GPUBackend = "auto",
    triton_tuning: TritonTuningMode = "heuristic",
) -> tuple[Tensor, Tensor, Tensor]

Decode Morton (Z-order) index tensors to 3D integer coordinates.

API semantics for parameters, returns, and errors match hilbert_decode_3d, except that Morton kernels do not use lookup tables and therefore do not accept lut_cache.

precache_compile_luts

precache_compile_luts(
    device: TorchDeviceLike = None,
    *,
    op: TorchHilbertOp = "all",
) -> None

Pre-cache Torch LUT tensors for use with torch.compile.

When using HilbertSFC Torch functions with torch.compile, call this before compilation to avoid materializing LUT tensors inside the compiled region, which can cause graph breaks, extra overhead, and failure with fullgraph=True.

Parameters:

Name Type Description Default
device TorchDeviceLike

Device for which to cache LUT tensors.

None means CPU.

None
op TorchHilbertOp

Operation used to select which LUT tensors are pre-cached.

  • "all" (default): pre-cache all LUT tensors needed for supported operations.
  • Otherwise: pre-cache only the LUT tensors used by that operation.
'all'
Notes

It is generally not useful to pre-cache LUT tensors with this function when not using torch.compile, as this function materializes LUT tensors that may not be used outside compiled regions.

clear_torch_lut_caches

clear_torch_lut_caches(
    device: TorchDeviceLike = None,
    *,
    op: TorchHilbertOp = "all",
) -> None

Clear Torch-side LUT caches.

Parameters:

Name Type Description Default
device TorchDeviceLike

Device whose cached LUT tensors should be cleared.

If None (default), clears cached LUT tensors for all devices.

None
op TorchHilbertOp

Operation used to filter which cached LUT tensors are cleared.

  • "all" (default): clear all cached LUT tensors.
  • Otherwise: clear only the cached LUT tensors used by that operation (for example, "hilbert_encode_2d").
'all'
Notes

This does not clear the root process-wide LUT cache. Use clear_lut_caches for that.