A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization

Anonymous¹

¹Anonymous

A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization

Abstract

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed, context-rich descriptions. We introduce LOTUS, a unified leaderboard for evaluating such detailed captions, addressing three main gaps in existing evaluation approaches: lack of standardized criteria, absence of bias-aware assessments, and evaluations that disregard user preferences. LOTUS offers a comprehensive evaluation across various aspects, including caption quality (eg, alignment, descriptiveness), potential risks (eg, hallucination), and societal biases (eg, gender bias), and supports preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of state-of-the-art LVLMs using LOTUS uncovers previously overlooked insights, revealing no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on specific user priorities, which we validate through simulated user scenarios. Additionally, our investigation of hallucination mitigation methods suggests they can reduce both hallucinations and gender biases, though performance varies by language.

Leaderboard

Unified evaluation of LVLM captioners on LOTUS with CLIPScore (CLIP-S), CapScore (CapS_S, CapS_A), CLIP recall (recall), noun/verb coverage (noun, verb), syntactic and semantic complexities (syn, sem), CHAIR_s (CH_s), FaithScore (FS, FS_s), and existence of NSFW words (harm). Values in bold and underline indicate the best and second-best, respectively. All metrics are scaled by 100.

Model	Alignment ↑				Descriptiveness ↑				Complexity ↑			Side effects
Model	CLIP-S	CapS_S	CapS_A	N-avg	Recall	Noun	Verb	N-avg	Syn	Sem	N-avg	CH_s ↓	FS ↑	FS_s ↑	Harm ↓	N-avg ↑
MiniGPT-4	60.8	33.0	35.9	0.19	75.3	33.0	34.7	0.22	8.0	32.6	0.38	37.8	55.0	37.6	0.31	0.18
InstructBLIP	59.9	36.0	35.5	0.18	82.1	34.2	34.7	0.40	7.7	46.0	0.41	58.5	62.4	43.3	0.10	0.66
LLaVA-1.5	60.1	38.5	45.0	0.67	80.5	32.5	31.0	0.11	7.1	39.6	0.08	49.0	65.7	41.6	0.12	0.71
mPLUG-Owl2	59.7	39.7	40.0	0.49	83.3	35.0	32.8	0.34	7.4	45.6	0.28	59.1	62.0	41.3	0.08	0.58
Qwen2-VL	61.8	37.3	43.2	0.82	90.4	45.9	36.9	1.00	8.3	75.7	1.00	26.8	54.2	41.7	0.28	0.46

Bias-aware Evaluation

Bias-aware evaluation of LVLM captioners on LOTUS. Language discrepancy evaluation cannot be applicable to InstructBLIP due to a lack of Japanese support. Values in bold and underline indicate the best and second-best, respectively. All metrics are scaled by 100.

Model	Alignment			Descriptiveness			Complexity		Side effects
Model	CLIP-S	CapS_S	CapS_A	Recall	Noun	Verb	Syn	Sem	CH_s	FS	FS_s	Harm	N-avg ↑
Gender bias
MiniGPT-4	0.3	0.9	1.1	7.8	1.7	2.6	6.3	3.2	4.8	6.3	4.0	1.64	0.51
InstructBLIP	0.8	2.7	1.2	8.4	1.9	3.3	1.0	0.1	6.8	3.8	5.0	0.72	0.40
LLaVA-1.5	0.7	2.2	0.7	9.5	2.2	4.1	1.5	0.2	7.6	3.8	3.7	0.39	0.46
mPLUG-Owl2	0.6	2.2	1.2	9.1	2.3	3.5	1.6	0.0	7.2	3.1	5.8	0.33	0.40
Qwen2-VL	0.2	0.7	0.5	6.3	0.1	3.6	13.5	2.5	4.4	0.9	5.7	1.77	0.63
Skin tone bias
MiniGPT-4	0.8	1.5	0.8	4.8	0.2	2.3	19.4	0.2	2.0	0.9	0.5	0.09	0.55
InstructBLIP	0.5	1.4	0.2	8.4	1.9	1.1	6.8	0.1	4.0	2.4	1.1	0.09	0.51
LLaVA-1.5	0.4	1.3	0.7	4.0	0.2	1.0	5.3	0.6	2.7	1.4	1.3	0.18	0.67
mPLUG-Owl2	0.6	1.9	0.5	5.1	0.8	2.2	7.6	0.4	1.7	0.1	0.4	0.00	0.67
Qwen2-VL	0.2	1.1	1.5	2.3	0.5	1.3	14.9	2.3	2.7	3.1	1.8	0.09	0.50
Language discrepancy
MiniGPT-4	0.8	1.5	3.9	2.3	4.3	5.2	52.2	5.0	5.4	5.6	3.4	0.10	0.40
InstructBLIP	-	-	-	-	-	-	-	-	-	-	-	-	-
LLaVA-1.5	0.4	0.8	2.0	1.1	1.1	1.8	11.4	1.8	4.7	2.0	1.6	0.06	0.95
mPLUG-Owl2	1.4	1.6	4.9	1.5	1.1	3.7	37.5	8.4	17.0	6.3	1.3	0.02	0.57
Qwen2-VL	0.2	3.6	6.7	1.9	3.9	3.8	90.8	26.2	6.4	7.5	2.1	0.14	0.28

BibTeX

@article{XXXX,
  title={A Unified Image Captioning Benchmark Navigating Descriptive Richness, Societal Bias, and Preference-Oriented Customization},
  author={authors},
  journal={arXiv},
  year={2024}
}