Enables Programmatic Dependent Launch for better performance
Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (#22522) Adds initial PDL setup. Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. Further optimization pass of the first half of kernels Optimized PDL barriers for the second batch of kernels Further refinements after rebase. Moves pdl logic to separate function, removes some whitespace Strips post-hoc PDL logic Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 Enrolls flash_attn_combine_results Fix: Drops needless and broken check of CUDA arch for PDL.
PDL either works or is without effect. Enrolls flash-attention kernels to pdl Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL…
- github.comllama.cpp b9254primary