AI Writes Better Image Captions Using Rubric-Guided Learning

April 20, 20262 min read

TL;DR

RubiCap lets language models create scoring rubrics to train AI captioners, beating human experts and large models in benchmark tests.

Dense image captioning is essential for aligning vision and language in AI systems, enabling applications from pretraining to text-to-image generation, but obtaining high-quality annotations at scale is costly. Traditional s like supervised distillation from strong vision-language models often produce captions with limited diversity and poor generalization, while reinforcement learning has struggled in open-ended tasks due to the lack of deterministic reward signals. This bottleneck is addressed by RubiCap, a novel framework that leverages large language models to create fine-grained, sample-specific rubrics for guiding reinforcement learning, offering a solution to enhance caption quality without prohibitive human effort.

RubiCap operates through a multi-step process that begins by assembling a diverse committee of candidate captions for a given image. An LLM rubric writer then analyzes these captions to identify consensus strengths and diagnose deficiencies in the current policy, extracting insights that are converted into explicit evaluation criteria. These criteria enable an LLM judge to perform structured, multi-faceted evaluations, decomposing holistic quality assessment into detailed components and replacing coarse scalar rewards with nuanced feedback. This approach allows the reinforcement learning policy to receive tailored guidance, improving its ability to generate accurate and diverse captions over time.

In extensive benchmarks, RubiCap demonstrates superior performance, achieving the highest win rates on CapArena by outperforming supervised distillation, prior reinforcement learning s, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it shows exceptional word efficiency, with the 7B model matching the performance of Qwen2.5-VL-32B-Instruct and the 3B model surpassing its 7B counterpart. Notably, using the compact RubiCap-3B as a captioner produces stronger pretrained vision-language models than those trained on captions from proprietary models, highlighting its effectiveness in enhancing downstream tasks.

The significance of RubiCap lies in its ability to scale high-quality caption generation without relying on expensive human annotations, addressing a critical need in multimodal AI development. By integrating LLM-driven rubrics into reinforcement learning, it overcomes the limitations of previous s that depended on verifiable domains, making it applicable to open-ended creative tasks. This advancement supports the broader trend in AI research toward improving cross-modal alignment, as seen in related work like MobileCLIP2, which focuses on efficient image-text models, and studies on caption data in pretraining multimodal foundation models.

Despite its achievements, RubiCap has limitations acknowledged in the research, such as the reliance on LLMs for rubric creation and evaluation, which may introduce biases or inconsistencies based on the underlying models used. The framework's performance is also contingent on the quality and diversity of the initial candidate caption committee, potentially affecting its ability to generalize across varied image types. Future work could explore ways to mitigate these dependencies and further refine the rubric-generation process to enhance robustness and scalability.

Overall, RubiCap represents a ological breakthrough in dense image captioning, demonstrating how structured feedback from LLMs can elevate reinforcement learning in vision-language tasks. Its success in benchmarks and efficiency gains underscore the potential for rubric-guided approaches to transform how AI systems generate and evaluate textual descriptions of visual content, paving the way for more advanced multimodal applications.