I want to share the most recent model release we have prepared. It's a Vision-Language understanding Transformer.
It has 40% fewer parameters than vanilla CLIP while performing much better on text-to-image retrieval, where it's also beneficial that our output embeddings have 2x fewer dimensions (256 vs. 512).
Moreover, it supports 21 languages, including popular English, Hindi, Chinese, Arabic, and lower-resource languages like Ukrainian, Hebrew, and Armenian.
We have packed the library into ONNX and CoreML, providing PyTorch inference code for CPUs and GPUs and PopTorch code for Graphcore IPUs.
Demo: http://usearch-images.com/
Blog: https://www.unum.cloud/blog/2023-08-17-uform-graphcore
Looking forward to your feedback!
It seems clip performs better for prompts like "three birds", "man and woman"