Environment
- Platform: Apple Silicon Mac
- Host: Apple M4 Max
- OS: macOS
- Compiler: Homebrew clang 18.1.8
- BitNet / submodule state: BitNet using vendored
3rdparty/llama.cpp at Eddie-Wang1120/llama.cpp commit 1f86f058de0c3f4098dedae2ae8653c335c868a1
- Model:
microsoft/BitNet-b1.58-2B-4T-gguf / ggml-model-i2_s.gguf
- Build flags:
GGML_METAL=ON
GGML_ACCELERATE=ON
GGML_BLAS=ON
GGML_BLAS_VENDOR=Apple
BITNET_ARM_TL1=OFF
Problem
On Apple Silicon with Metal enabled, i2_s inference can segfault when BLAS is enabled and the physical micro-batch crosses the BLAS routing threshold.
The crash is tied to physical ubatch, not logical batch:
-b 2048 -ub 31 -> stable
-b 32 -ub 31 -> stable
-b 2048 -ub 32 -> segfault
-b 2048 -ub 512 -> segfault
This means the failure starts exactly when the BLAS backend begins claiming the generic MUL_MAT path for larger batches.
Control Experiment
The same Metal runtime is stable when BLAS is disabled:
- BLAS ON +
-b 2048 -ub 512 -> segfault
- BLAS OFF +
-b 2048 -ub 512 -> stable
This strongly suggests the crash is in the BLAS-side handling of GGML_TYPE_I2_S, not in Metal itself and not in the outer chat request schema.
Root Cause
ggml-blas.cpp allows the generic BLAS MUL_MAT path to accept quantized source tensors when ggml_get_type_traits(src0->type)->to_float != NULL.
For GGML_TYPE_I2_S, that is not safe:
I2_S stores an external scale outside the per-row payload
- the generic BLAS dequantize-to-float path assumes self-contained per-row data
- once
ubatch >= 32, BLAS starts claiming MUL_MAT
- that eventually crashes in the
i2_s dequant / BLAS matmul path
In crash reports, the top frames consistently land in:
dequantize_row_i2_s
ggml_backend_blas_mul_mat
Proposed Fix
Reject GGML_TYPE_I2_S in the generic BLAS MUL_MAT support check so that I2_S continues using its specialized non-BLAS path:
return src0->type != GGML_TYPE_I2_S &&
ggml_is_contiguous(src0) &&
ggml_is_contiguous(src1) &&
src1->type == GGML_TYPE_F32 &&
(ne0 >= min_batch && ne1 >= min_batch && ne10 >= min_batch) &&
(src0->type == GGML_TYPE_F32 || ggml_get_type_traits(src0->type)->to_float != NULL);
Result After Patch
After applying the BLAS guard above:
- BLAS ON + Metal +
-b 2048 -ub 512 is stable
- managed broker end-to-end requests no longer segfault under the same settings
This does not solve all i2_s quality issues, but it does remove the native crash path.
Related Issues
Environment
3rdparty/llama.cppat Eddie-Wang1120/llama.cpp commit1f86f058de0c3f4098dedae2ae8653c335c868a1microsoft/BitNet-b1.58-2B-4T-gguf/ggml-model-i2_s.ggufGGML_METAL=ONGGML_ACCELERATE=ONGGML_BLAS=ONGGML_BLAS_VENDOR=AppleBITNET_ARM_TL1=OFFProblem
On Apple Silicon with Metal enabled,
i2_sinference can segfault when BLAS is enabled and the physical micro-batch crosses the BLAS routing threshold.The crash is tied to physical
ubatch, not logicalbatch:-b 2048 -ub 31-> stable-b 32 -ub 31-> stable-b 2048 -ub 32-> segfault-b 2048 -ub 512-> segfaultThis means the failure starts exactly when the BLAS backend begins claiming the generic
MUL_MATpath for larger batches.Control Experiment
The same Metal runtime is stable when BLAS is disabled:
-b 2048 -ub 512-> segfault-b 2048 -ub 512-> stableThis strongly suggests the crash is in the BLAS-side handling of
GGML_TYPE_I2_S, not in Metal itself and not in the outer chat request schema.Root Cause
ggml-blas.cppallows the generic BLASMUL_MATpath to accept quantized source tensors whenggml_get_type_traits(src0->type)->to_float != NULL.For
GGML_TYPE_I2_S, that is not safe:I2_Sstores an external scale outside the per-row payloadubatch >= 32, BLAS starts claimingMUL_MATi2_sdequant / BLAS matmul pathIn crash reports, the top frames consistently land in:
dequantize_row_i2_sggml_backend_blas_mul_matProposed Fix
Reject
GGML_TYPE_I2_Sin the generic BLASMUL_MATsupport check so thatI2_Scontinues using its specialized non-BLAS path:Result After Patch
After applying the BLAS guard above:
-b 2048 -ub 512is stableThis does not solve all
i2_squality issues, but it does remove the native crash path.Related Issues