mstar.model.pi05.kernels.image_normalize

mstar.model.pi05.kernels.image_normalize#

GPU-side image range normalization without CPU–GPU sync.

Replaces the two blocking transfers in _prepare_one():

img_min = float(images.min()) # GPU → CPU (sync) img_max = float(images.max()) # GPU → CPU (sync) if img_min >= -1e-4 and img_max <= 1.0 + 1e-4:

images = images * 2.0 - 1.0

with three GPU kernel launches and zero CPU transfers:

  1. torch.min → GPU scalar tensor (no .item())

  2. torch.max → GPU scalar tensor (no .item())

  3. Triton kernel reads both scalars from device memory, applies x*2-1 if the range says [0,1], identity otherwise.

The Triton kernel avoids materialising a CPU-visible boolean by keeping the “needs_rescale” predicate entirely in registers and using tl.where to select between the two outcomes per element.

Falls back to the original sync-based path on CPU tensors or when Triton is not installed.

Functions

normalize_float_images(images)

Detect and rescale float32 images from [0,1] to [-1,1], sync-free.

mstar.model.pi05.kernels.image_normalize.normalize_float_images(images)[source]#

Detect and rescale float32 images from [0,1] to [-1,1], sync-free.

Intended to replace the inline range-check in _prepare_one() which performs two CPU–GPU synchronisations via float(images.min()) and float(images.max()). This function computes those reductions on the GPU and feeds the result directly to a Triton kernel; the values never surface to the CPU.

Parameters:

images (Tensor) – float32 tensor on CUDA, any shape (typically (num_cameras, 3, H, W)).

Returns:

float32 tensor, same shape and device. Pixels in [0,1] are remapped to [-1,1]; pixels already in [-1,1] are unchanged.

Return type:

Tensor