Skip to main content

Technical Blog

Beyond 4-bit: How BitNet.cpp and 1.58-bit Ternary LLMs are Making GPUs Obsolete for Inference

llmbitnetinferenceoptimizationcpuc++

Multiply-accumulate limits in FP Transformers, why PTQ keeps you memory-bandwidth bound, the ternary BitNet b1.58 formulation with AbsMean activations, and how BitNet.cpp maps inference to SIMD-heavy CPU kernels.

The multiply–accumulate wall in dense Transformers

A standard Transformer block is dominated by batched matrix–matrix products: projections for self-attention (QQ, KK, VV, output), feed-forward up/down projections, and (in MoE stacks) expert routing and expert MLPs. Each of these is implemented as a sequence of multiply–accumulate (MAC) operations. For a single output neuron in a fully connected layer,

yj=i=1dWjixi,y_j = \sum_{i=1}^{d} W_{ji} x_i,

which is one fused MAC per input dimension when the hardware maps the inner product to SIMD lanes.

Two scaling laws bite simultaneously:

  1. Arithmetic intensity vs. memory bandwidth. For large dd, the dominant cost is often reading WW and xx from DRAM (or across the GPU’s memory hierarchy) at a rate bounded by memory bandwidth, not by peak FLOPS. Doubling model width roughly doubles both parameter traffic and activation traffic per token for the affected layers.

  2. Parameter count vs. footprint. Linear growth in parameter count implies linear growth in stored weight bytes at fixed precision. FP16/BF16 weights consume 2 bytes per element; scaling width and depth multiplies both capacity (VRAM or host RAM) and bytes moved per forward pass.

GPUs mitigate this with massive parallelism and high-bandwidth memory, but the fundamental coupling remains: each MAC requires fresh operands from memory unless weights stay resident in cache—a condition that fails for large layers at batch sizes typical of single-user inference.

Post-training quantization as a partial fix

Uniform post-training quantization (PTQ) to INT8 or INT4 reduces storage and off-chip traffic for weights (and sometimes activations). That is not equivalent to removing the MAC bottleneck:

  • Dequantization in the hot path. Many kernels unpack low-bit packed weights into FP16/BF32 accumulators on the fly to reuse existing GEMM microkernels. The forward pass still executes floating-point MACs after dequantization, so peak throughput remains tied to FP tensor-core or SIMD paths and wide register operands.

  • Mixed-precision epilogues. INT8 ×\times INT8 \rightarrow INT32 accumulation with FP scaling is closer to “true” integer inference, but per-channel or per-tensor scales, zero-points, and activation quantization inject additional ops and often force widening to FP32 for numerical stability in attention and layer-norm adjacent blocks.

  • Accuracy–throughput trade-offs. Aggressive 4-bit PTQ without retraining frequently requires outlier handling (e.g., mixed bit-width blocks, separate high-precision channels), which fragments the memory layout and prevents a single homogeneous integer kernel from saturating the machine.

Net effect: PTQ narrows the memory side of the roofline but often leaves execution semantically close to dense GEMM on accelerators built for FP16/BF16 throughput.

BitNet b1.58: native ternary weights and AbsMean activations

BitNet b1.58 (Microsoft Research) trains language models whose weights are constrained to a ternary alphabet:

Wji{1,0,+1}.W_{ji} \in \{-1, 0, +1\}.

During training, continuous weights are steered toward this discrete set (via straight-through estimators or similar), so the learned model is natively low-bit rather than a compressed surrogate of an FP teacher.

From MAC to add/subtract accumulation

For one output coordinate, the linear map reduces to a signed sum over inputs:

yj=i=1dWjixi,Wji{1,0,+1}.y_j = \sum_{i=1}^{d} W_{ji} x_i, \qquad W_{ji} \in \{-1,0,+1\}.

No pairwise multiply is required for the discrete part: ternary selection replaces WjixiW_{ji} x_i with 00, +xi+x_i, or xi-x_i. Algebraically this is still a dot product, but micro-architecturally it decomposes into masking, sign flips, and integer/fixed-point accumulation rather than full-width floating-point products.

Block-wise, if activations are scaled to a fixed dynamic range before accumulation (see below), the inner loop becomes load–sign–accumulate, which maps cleanly to integer SIMD pipelines on CPUs that lack the FP throughput density of a datacenter GPU.

Activation quantization with AbsMean

BitNet-style training pairs ternary weights with quantized activations. A common choice is to define a per-tensor (or per-row) scale from the first-order magnitude of the activation vector using AbsMean:

γ=1di=1dxi=mean(x).\gamma = \frac{1}{d} \sum_{i=1}^{d} |x_i| = \mathrm{mean}(|x|).

Activations are then normalized and rounded to a small discrete set (e.g. ternary {1,0,+1}\{-1,0,+1\} in the 1.58-bit construction, with the fractional name reflecting entropy / coding rather than three physical levels):

x~i=round(xiγ),x^i=γx~i(STE through γ as needed).\tilde{x}_i = \mathrm{round}\left( \frac{x_i}{\gamma} \right), \qquad \hat{x}_i = \gamma \, \tilde{x}_i \quad \text{(STE through } \gamma \text{ as needed).}

Intuitively, γ\gamma tracks typical magnitude without being dominated by a single outlier as harshly as a max-based scale; it yields a smooth, data-dependent dynamic range for the integer path while staying cheap to compute (one reduction pass over xx).

The composed linear output can be written as

yj=iWjix^i,y_j = \sum_i W_{ji} \, \hat{x}_i,

with both factors living in low-cardinality discrete spaces after quantization—this is what enables bit-packing, popcount-style or ternary-table lookups, and narrow accumulators in inference kernels.

Hardware path: BitNet.cpp and SIMD without GEMM semantics

BitNet.cpp is a reference C++ inference stack that does not rely on cuBLAS-style GEMM. Because the inner loop is not a general MAC on FP16/BF32 elements, the workload shifts from GPU-oriented tensor cores to CPU-oriented throughput on wide integer SIMD:

  • x86-64: AVX2 (__m256i) or AVX-512 (__m512i) for 8/16/32-bit lanes; VNNI can accelerate certain INT8 dot patterns, but ternary layouts often use custom packing (2 bits per weight or lookup tables) plus horizontal add trees.

  • ARM: NEON (int8x16_t, int32x4_t) with the same theme: vector loads, ternary expand, widening accumulate.

The absence of a hard dependency on GPU DRAM bandwidth and massive FP throughput changes the deployment envelope: a model that is 10× smaller in memory and compute-bound on integer SIMD can run fully on CPU RAM with competitive tokens/sec for small batch sizes—exactly the regime of local assistants and edge deployments.

Conceptual SIMD inner loop (ternary weights, scaled activations)

Weights are packed so that each SIMD lane can decode a small group of ternary values. Activations may remain int8/int16 after AbsMean scaling and clipping. The following pseudocode sketches horizontal accumulation for one output row; real kernels fuse multiple rows and use unrolled loads:

// Conceptual: one output neuron j, hidden dimension d, AVX2-friendly batching.
// ternary_w[k] in {-1,0,+1} ; x_q[k] fixed-point activation (e.g. int16).
 
int32_t acc = 0;
for (size_t k = 0; k < d; k += 8) {
  // Load 8 activations and 8 ternary weights (unpack from bit-packed storage).
  __m256i xv = load_x_q(&x_q[k]);
  __m256i wv = load_ternary(&ternary_w[k]); // {-1,0,1} lanes
 
  // Map ternary to signed multiply-free contribution: contrib = w * x
  __m256i contrib = mul_ternary_by_x(wv, xv); // implementation-specific
 
  acc += horizontal_sum_i32(contrib);
}
float y_j = dequantize(acc, scale_w, scale_x);

Here mul_ternary_by_x stands in for a blend of masks and adds/subtracts (or table lookups) rather than _mm256_mullo_epi16 on full-weight magnitudes. Horizontal sums can be implemented with shuffle/add idioms or narrowed to 64-bit chunk accumulators to reduce port pressure.

Architectural implications

Models that match FP16 baselines on zero-shot benchmarks while cutting weight memory by an order of magnitude change where inference can run without a discrete GPU:

  • Autonomous and embedded agents can colocate policy + world model + tool-use LLM on the same CPU + iGPU envelope, avoiding PCIe transfers and VRAM fragmentation.

  • Privacy-first and air-gapped deployments can standardize on CPU-only nodes with deterministic latency profiles driven by L3 cache residency of hot layers rather than GPU scheduler jitter.

  • Cost structure shifts from /GB/sofHBMto/GB/s of HBM** to **/core-hour of AVX-512/NEON throughput—a different point on the cloud/edge pricing curve, and often favorable at batch size 1.

The transition is not “CPUs are faster than GPUs in the abstract”; it is that ternary BitNet inference is algorithmically aligned with integer SIMD and modest memory footprints, whereas FP Transformers remain aligned with GEMM-centric accelerators. When the latter’s memory bandwidth and idle GPU capacity dominate the bill, BitNet.cpp-style stacks are a credible path to production-grade local inference without a GPU in the critical path.