shipfeedAI news, curated daily

07:58:06 CET
24 JUN07:58:06shipfeed
pull to refreshlast sync
Just in — 30 new
§ models · storyline

Adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models

Hugging Face Transformers v5.10.1 adds Gemma 4 12B Unified, Sapiens2, and DeepSeek-OCR-2 models, with Gemma 4 12B Unified notable for its encoder-free multimodal architecture.

Jun 3 · · primary fetch2 sourcesupdated Jun 3 ·

Release v5.10.1 v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!! New Model additions Gemma4 unified+ Gemma4 MTP Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance. Key differences from standard Gemma 4: No Vision Tower: Raw pixel patches are projected directly into LM space via a `Dense + LayerNorm` pipeline with factorized 2D positional embeddings, replacing the vision encoder.

No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple `RMSNorm → Linear` pipeline, replacing the mel spectrogram + Conformer encoder. Shared Multimodal Pipeline: Both vision and audio use the same `Gemma4UnifiedMultimodalEmbedder` (RMSNorm → Linear) for the final projection to text hidden space. You can find the original Gemma 4 12B Unified checkpoints under the…

read full article on github.com
§ sources2 publications · timeline below
  1. github.comTransformers: Release v5.10.1primary
  2. blog.googleGemma 4 12B: A unified, encoder-free multimodal model

§ how this story moved

  1. primarytransformers — Releases publishes the launch post.
  2. HN Algolia — Front-Page AI picks up coverage.