What If We Recaption Billions of Web Images with LLaMA-3 ?

1UC Santa Cruz, 2University of Edinburgh, 3JHU, 4Adobe, 5UT Austin
alt text

Examples of the original caption and our recaption in DataComp-1B, and the word distributions.


Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. We release this dataset publicly at https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B.

Recaption Pipeline

Pipeline Overview. The illustration of our recaptioning pipeline on DataComp-1B. We use LLaMA-3-powered LLaVA to reception images, which enables us to train stronger CLIP models and Text-to-Image Diffusion models.

Recaption Model: LLaVA-LLaMA3-8B

Recaption Model Capabilities. We follow the setup of LLaVA-1.5. to build our captioner model, except that we use LLaMA-3-8B as the language decoder because of its superior performance.

Dataset Statistics

Recaption Data Statistics. Average length distributions of both the original captions and our recaptioned data in DataComp-1B.

Results of CLIP model Trained on Recaption-Datacomp-1B

Recap-CLIP Results. We set the training mixing ratio p of recaptioned data and the original data to 0.8 for recaption-based models. We report zero-shot top-1 accuracy on ImageNet-1K and top-1 recall on COCO and Flickr30K.

Results of Diffusion model Trained on Recaption-Datacomp-1B

Recap-Diffusion Results. Text-to-Image evaluation on COCO-30K results of DiT-BASE/4, trained with different mix ratios on Recap-DataComp-1B. Note for GPT-4V Score, we use a subset of 3K for the evaluation.

Examples of the Outputs of Recap-Diffusion Model

Results Examples. Visual comparison of results from DiT-L/2 and DiT-B/4 trained with the mixed raio p=0.0.

Dataset and Model Zoo

We are pleased to announce the release of our recaptioned datasets, including Recap-DataComp-1B and Recap-COCO-30K, as well as our caption model, LLaVA-1.5-LLaMA3-8B. Stay tuned for the upcoming release of our CLIP and T2I models!


Dataset Num. of Sample url
Recap-DataComp-1B 1.24B https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B
Recap-COCO-30K 30.5K https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K


Model Type url
LLaVA-1.5-LLaMA3-8B Our recaption model https://huggingface.co/tennant/llava-llama-3-8b-hqedit
Recap-CLIP CLIP https://huggingface.co/UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP
Recap-DiT Text2Image Coming Soon...

Poster on CVPR Workshop


This work is partially supported by a gift from Adobe, TPU Research Cloud (TRC) program, Google Cloud Research Credits program, AWS Cloud Credit for Research program, Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.


  title   = {What If We Recaption Billions of Web Images with LLaMA-3?},
  author  = {Li, Xianhang and Tu, Haoqin and Hui, Mude and Wang, Zeyu and Zhao, Bingchen and Xiao, Junfei and Mei, Jieru and Liu, Qing and Zheng, Huangjie and Zhou, Yuyin and Xie, Cihang},
  journal = {arXiv preprint arXiv:2406.08478},
  year    = {2024}