A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization

1Anonymous
Main Image

A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization

Abstract

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed, context-rich descriptions. We introduce LOTUS, a unified leaderboard for evaluating such detailed captions, addressing three main gaps in existing evaluation approaches: lack of standardized criteria, absence of bias-aware assessments, and evaluations that disregard user preferences. LOTUS offers a comprehensive evaluation across various aspects, including caption quality (eg, alignment, descriptiveness), potential risks (eg, hallucination), and societal biases (eg, gender bias), and supports preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of state-of-the-art LVLMs using LOTUS uncovers previously overlooked insights, revealing no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on specific user priorities, which we validate through simulated user scenarios. Additionally, our investigation of hallucination mitigation methods suggests they can reduce both hallucinations and gender biases, though performance varies by language.

Leaderboard

Unified evaluation of LVLM captioners on LOTUS with CLIPScore (CLIP-S), CapScore (CapS_S, CapS_A), CLIP recall (recall), noun/verb coverage (noun, verb), syntactic and semantic complexities (syn, sem), CHAIR_s (CH_s), FaithScore (FS, FS_s), and existence of NSFW words (harm). Values in bold and underline indicate the best and second-best, respectively. All metrics are scaled by 100.

Model Alignment ↑ Descriptiveness ↑ Complexity ↑ Side effects
CLIP-S CapS_S CapS_A N-avg Recall Noun Verb N-avg Syn Sem N-avg CH_s ↓ FS ↑ FS_s ↑ Harm ↓ N-avg ↑
MiniGPT-4 60.8 33.0 35.9 0.19 75.3 33.0 34.7 0.22 8.0 32.6 0.38 37.8 55.0 37.6 0.31 0.18
InstructBLIP 59.9 36.0 35.5 0.18 82.1 34.2 34.7 0.40 7.7 46.0 0.41 58.5 62.4 43.3 0.10 0.66
LLaVA-1.5 60.1 38.5 45.0 0.67 80.5 32.5 31.0 0.11 7.1 39.6 0.08 49.0 65.7 41.6 0.12 0.71
mPLUG-Owl2 59.7 39.7 40.0 0.49 83.3 35.0 32.8 0.34 7.4 45.6 0.28 59.1 62.0 41.3 0.08 0.58
Qwen2-VL 61.8 37.3 43.2 0.82 90.4 45.9 36.9 1.00 8.3 75.7 1.00 26.8 54.2 41.7 0.28 0.46

Bias-aware Evaluation

Bias-aware evaluation of LVLM captioners on LOTUS. Language discrepancy evaluation cannot be applicable to InstructBLIP due to a lack of Japanese support. Values in bold and underline indicate the best and second-best, respectively. All metrics are scaled by 100.

Model Alignment Descriptiveness Complexity Side effects
CLIP-S CapS_S CapS_A Recall Noun Verb Syn Sem CH_s FS FS_s Harm N-avg ↑
Gender bias
MiniGPT-4 0.3 0.9 1.1 7.8 1.7 2.6 6.3 3.2 4.8 6.3 4.0 1.64 0.51
InstructBLIP 0.8 2.7 1.2 8.4 1.9 3.3 1.0 0.1 6.8 3.8 5.0 0.72 0.40
LLaVA-1.5 0.7 2.2 0.7 9.5 2.2 4.1 1.5 0.2 7.6 3.8 3.7 0.39 0.46
mPLUG-Owl2 0.6 2.2 1.2 9.1 2.3 3.5 1.6 0.0 7.2 3.1 5.8 0.33 0.40
Qwen2-VL 0.2 0.7 0.5 6.3 0.1 3.6 13.5 2.5 4.4 0.9 5.7 1.77 0.63
Skin tone bias
MiniGPT-4 0.8 1.5 0.8 4.8 0.2 2.3 19.4 0.2 2.0 0.9 0.5 0.09 0.55
InstructBLIP 0.5 1.4 0.2 8.4 1.9 1.1 6.8 0.1 4.0 2.4 1.1 0.09 0.51
LLaVA-1.5 0.4 1.3 0.7 4.0 0.2 1.0 5.3 0.6 2.7 1.4 1.3 0.18 0.67
mPLUG-Owl2 0.6 1.9 0.5 5.1 0.8 2.2 7.6 0.4 1.7 0.1 0.4 0.00 0.67
Qwen2-VL 0.2 1.1 1.5 2.3 0.5 1.3 14.9 2.3 2.7 3.1 1.8 0.09 0.50
Language discrepancy
MiniGPT-4 0.8 1.5 3.9 2.3 4.3 5.2 52.2 5.0 5.4 5.6 3.4 0.10 0.40
InstructBLIP - - - - - - - - - - - - -
LLaVA-1.5 0.4 0.8 2.0 1.1 1.1 1.8 11.4 1.8 4.7 2.0 1.6 0.06 0.95
mPLUG-Owl2 1.4 1.6 4.9 1.5 1.1 3.7 37.5 8.4 17.0 6.3 1.3 0.02 0.57
Qwen2-VL 0.2 3.6 6.7 1.9 3.9 3.8 90.8 26.2 6.4 7.5 2.1 0.14 0.28

BibTeX

@article{XXXX,
  title={A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization},
  author={authors},
  journal={arXiv},
  year={2024}
}