§ models · storyline

Adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models

Hugging Face Transformers v5.10.1 adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models, with Gemma 4 12B Unified notable for its encoder-free multimodal architecture.

Jun 3 · 17:37:41 · primary fetch2 sourcesupdated Jun 3 · 18:04:42

Release v5.10.1 v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!! New Model additions Gemma4 unified+ Gemma4 MTP Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance. Key differences from standard Gemma 4: No Vision Tower: Raw pixel patches are projected directly into LM space via a `Dense + LayerNorm` pipeline with factorized 2D positional embeddings, replacing the vision encoder.

No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple `RMSNorm → Linear` pipeline, replacing the mel spectrogram + Conformer encoder. Shared Multimodal Pipeline: Both vision and audio use the same `Gemma4UnifiedMultimodalEmbedder` (RMSNorm → Linear) for the final projection to text hidden space. You can find the original Gemma 4 12B Unified checkpoints under the…

read full article on github.com ↗

§ sources2 publications · timeline below

github.comTransformers: Release v5.10.1primary17:37:41
blog.googleGemma 4 12B: A unified, encoder-free multimodal model18:04:42

§ how this story moved

17:37:41primary — transformers — Releases publishes the launch post.
18:04:42HN Algolia — Front-Page AI picks up coverage.