Transformers v5.5.0
Release v5.5.0 New Model additions Gemma4 Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis. You can find all the original Gemma 4 checkpoints under the Gemma 4 release. The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens.
Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow: The total number of pixels must fit within a patch budget Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3) [!IMPORTANT] Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range)…
- github.comtransformers v5.5.0primary