Import AI 423: Multilingual CLIP; anti-drone tracking; and Huawei kernel design
Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.
Meta makes CLIP multilingual:
…Meta CLIP 2 will help AI systems reason about text and images in hundreds of languages…
Researchers with Meta, Princeton University, New York University have built Meta CLIP 2, a larger-scale, multilingual version of OpenAI's venerable CLIP model. CLIP, short for Contrastive Language-Image Pretraining (CLIP), is a way to train a pair of neural nets to understand images and text and being able to map between them. CLIP is a utility technology which is used for a vast range of downstream purposes, from image generation to image search and classification.
The original CLIP was trained to map English text to images. Meta CLIP 2 is a scaled up version which also maps non-English text to images. Along with releasing the model, Meta has also released a detailed paper going through "the first recipe training CLIP from scratch on worldwide web-scale image-text pairs".
Scale is all that matters: As usual, the main lesson here is one of scale. Earlier attempts to train versions of CLIP on multiple languages failed, leading to degraded performance relative to the original model. "We empirically show that the curse of multilinguality in CLIP is the consequence of insufficient scaling due to the lack of a proper recipe for worldwide data curation and model training".
To scale the system, Meta had to do three things: 1) it gathered large-scale multilingual metadata across 300+ languages, 2) it built its own curation algorithm to help it curate a representative multilingual dataset to train on, and 3) it figured out the right proportion and ordering of data to use when training its system.
To get an idea of scale, there were 12.8B pairs in the original OpenAI CLIP, and 29B in CLIP2.
The main training trick was "increasing the global training batch size, which encourages cross-lingual learning, and meanwhile keeping the other training hyperparameters unchanged. We choose a 2.3× scaling of global batch to reflect that English pairs constitute 44% of our training data".
Best results: Meta CLIP 2 beats its English-only counterpart by 0.8% on zero-shot image classification and by 0.7% on mSigLIP, and also sets new state-of-the-art scores on multilingual benchmarks like CVQA (57.4%), Babel-ImageNet (50.2%), and XM3600 (64.3%).
Why this matters - multilingual sensors for ...
This excerpt is provided for preview purposes. Full article content is available on the original publication.