ollama/x
Jesse Gross 2bbe2405fe mlxrunner: decouple models from attention cache storage layout
Models build their own attention masks and read K/V directly from
the cache's buffers, which ties them to the cache's storage layout.
That blocks multi-sequence batching — right-padded rows need a
query-padding mask composed onto every model — and rules out
variants like paged attention where K/V isn't one contiguous tensor.

Caches now hand back a per-layer KVHistory holding post-update K, V,
and a MaskApplier that merges the cache's storage restrictions into
the model's logical mask. Models describe their mask in logical
terms; SDPA composes model, padding, and applier contributions and
dispatches to the kernel's causal or no-mask fast path when it can.
KVHistory still exposes K, V, and the composed mask for manual
attention paths (e.g. CUDA prefill at head_dim > 128).

Performance for single-sequence inference is unchanged.
2026-04-27 20:04:46 -07:00
..
agent x/cmd: enable web search and web fetch with flag (#13690) 2026-01-12 13:59:40 -08:00
cmd Reapply "don't require pulling stubs for cloud models" again (#14608) 2026-03-06 14:27:47 -08:00
create mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
imagegen mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
mlxrunner mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
models mlxrunner: decouple models from attention cache storage layout 2026-04-27 20:04:46 -07:00
safetensors mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
server mlx: Support NVIDIA TensorRT Model Optimizer import (#15566) 2026-04-27 18:28:10 -07:00
tokenizer mlx: quantized embeddings, fast SwiGLU, and runtime fixes (#14884) 2026-03-17 11:21:38 -07:00
tools add ability to disable cloud (#14221) 2026-02-12 15:47:00 -08:00