Show HN: UForm v2 – tiny CLIP-like embeddings in 21 languages and Graphcore API

I want to share the most recent model release we have prepared. It's a Vision-Language understanding Transformer.

It has 40% fewer parameters than vanilla CLIP while performing much better on text-to-image retrieval, where it's also beneficial that our output embeddings have 2x fewer dimensions (256 vs. 512).

Moreover, it supports 21 languages, including popular English, Hindi, Chinese, Arabic, and lower-resource languages like Ukrainian, Hebrew, and Armenian.

We have packed the library into ONNX and CoreML, providing PyTorch inference code for CPUs and GPUs and PopTorch code for Graphcore IPUs.

Demo: http://usearch-images.com/ Blog: https://www.unum.cloud/blog/2023-08-17-uform-graphcore

Looking forward to your feedback!