Transformers v5.4.0
Hugging Face releases Transformers v5.4.0, adding VidEoMT for fast video segmentation, UVDoc for document image rectification, and Jina Embeddings v3 for multilingual text embedding.
New Model additions VidEoMT Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone. Links: Documentation | Paper Add VidEoMT (#44285) by @NielsRogge in #44285 UVDoc UVDoc is a machine learning model designed for document image rectification and correction.
The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images. Links: Documentation [Model] Add UVDoc Model Support (#43385) by @XingweiDeng in #43385 Jina Embeddings v3 The Jina-Embeddings-v3 is a…