Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Hi all,

I've been working on a pure-Swift port of Google's Gemma 4 text decoder
that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback
from the MLX team and the community before I propose anything upstream.

Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core

Why

As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box.
The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from
Gemma 3 in several structural places. The chat-template path through
swift-jinja 1.x also silently corrupts the prompt, so the model loads
but generates incoherent text.

What's in the package

  • A from-scratch Swift implementation of the Gemma 4 decoder
    (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer)
  • Per-Layer Embedding (PLE) support — the shared embedding table that
    feeds every decoder layer through a gated MLP as a third residual
  • KV sharing across the back half of the decoder, threaded through the
    forward pass via a donor table with a single global rope offset
  • A custom Gemma4ProportionalRoPE class for the partial-rotation rope
    type that initializeRope doesn't currently recognize
  • A chat-template bypass that builds the prompt as a literal string
    with the correct turn markers and encodes via tokenizer.encode(text:),
    matching Python mlx-lm's apply_chat_template byte-for-byte

Measured on iPhone (A-series, 7.4 GB RAM)

Model: mlx-community/gemma-4-e2b-it-4bit

  • Warm load: ~6 s
  • Memory after load: 341–392 MB
  • Time to first token (end-to-end, 333-token system prompt): 2.82 s
  • Generation throughput: 12–14 tok/s

What I'd love feedback on

  1. Is the sidecar registration pattern the right way to extend
    mlx-swift-lm with new model families, or is there a more idiomatic path I missed?
  2. The chat-template bypass works but feels like a workaround. Is the
    right long-term fix in swift-jinja, in the tokenizer, or somewhere
    else entirely?
  3. Anyone running into the same PLE / KV-sharing issues on other
    Gemma-family checkpoints? I'd like to make sure the implementation
    generalizes beyond E2B before tagging a 0.2.0.

Happy to open a PR against mlx-swift-lm if the maintainers think any
of this belongs upstream. Thanks for reading.

Solid work — 12-14 tok/s on A-series with 4-bit is respectable. 341-392 MB resident on 7.4 GB does leave thin margins though. Have you profiled whether MLX is placing any matmuls on ANE, or is this pure GPU? In my experience with Whisper-scale models the GPU path is more predictable but ANE helps with battery if the ops map cleanly.

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
 
 
Q