Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Question

Created 3d

Replies 1

Boosts 0

Participants 2

Hi all,

I've been working on a pure-Swift port of Google's Gemma 4 text decoder
that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback
from the MLX team and the community before I propose anything upstream.

Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core

Why

As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box.
The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from
Gemma 3 in several structural places. The chat-template path through
swift-jinja 1.x also silently corrupts the prompt, so the model loads
but generates incoherent text.

What's in the package

A from-scratch Swift implementation of the Gemma 4 decoder
(Configuration, Layers, Attention, MLP, RoPE, DecoderLayer)
Per-Layer Embedding (PLE) support — the shared embedding table that
feeds every decoder layer through a gated MLP as a third residual
KV sharing across the back half of the decoder, threaded through the
forward pass via a donor table with a single global rope offset
A custom Gemma4ProportionalRoPE class for the partial-rotation rope
type that initializeRope doesn't currently recognize
A chat-template bypass that builds the prompt as a literal string
with the correct turn markers and encodes via tokenizer.encode(text:),
matching Python mlx-lm's apply_chat_template byte-for-byte

Measured on iPhone (A-series, 7.4 GB RAM)

Model: mlx-community/gemma-4-e2b-it-4bit

Warm load: ~6 s
Memory after load: 341–392 MB
Time to first token (end-to-end, 333-token system prompt): 2.82 s
Generation throughput: 12–14 tok/s

What I'd love feedback on

Is the sidecar registration pattern the right way to extend
mlx-swift-lm with new model families, or is there a more idiomatic path I missed?
The chat-template bypass works but feels like a workaround. Is the
right long-term fix in swift-jinja, in the tokenizer, or somewhere
else entirely?
Anyone running into the same PLE / KV-sharing issues on other
Gemma-family checkpoints? I'd like to make sure the implementation
generalizes beyond E2B before tagging a 0.2.0.

Happy to open a PR against mlx-swift-lm if the maintainers think any
of this belongs upstream. Thanks for reading.

Boost

Answer 1

eddiewangyw OP

2d

Solid work — 12-14 tok/s on A-series with 4-bit is respectable. 341-392 MB resident on 7.4 GB does leave thin margins though. Have you profiled whether MLX is placing any matmuls on ANE, or is this pure GPU? In my experience with Whisper-scale models the GPU path is more predictable but ANE helps with battery if the ops map cleanly.

0