Adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models
Hugging Face Transformers v5.10.1 adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models, with Gemma 4 12B Unified notable for its encoder-free multimodal architecture.
Release v5.10.1 v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!! New Model additions Gemma4 unified+ Gemma4 MTP Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance. Key differences from standard Gemma 4: No Vision Tower: Raw pixel patches are projected directly into LM space via a `Dense + LayerNorm` pipeline with factorized 2D positional embeddings, replacing the vision encoder.
No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple `RMSNorm → Linear` pipeline, replacing the mel spectrogram + Conformer encoder. Shared Multimodal Pipeline: Both vision and audio use the same `Gemma4UnifiedMultimodalEmbedder` (RMSNorm → Linear) for the final projection to text hidden space. You can find the original Gemma 4 12B Unified checkpoints under the…
§ how this story moved
- primary — transformers — Releases publishes the launch post.
- HN Algolia — Front-Page AI picks up coverage.