AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this, we introduce AICA-Bench, a comprehensive benchmark comprising three core tasks: Emotion Understanding (EU), Reasoning (ER), and Generation (EGCG). We evaluate 23 VLMs, revealing critical gaps: models struggle with intensity calibration and suffer from descriptive shallowness in open-ended tasks. To bridge these gaps, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that integrates visual scaffolding with hierarchical reasoning. Experiments show that GAT effectively corrects intensity errors and significantly enhances descriptive depth, establishing a robust baseline for future affective multimodal research.

Overview

AICA-Bench extends affective evaluation beyond recognition alone by organizing Affective Image Content Analysis into a unified understanding, reasoning, and generation pipeline. The benchmark is built in two stages: first, 8,086 affective images are curated from 9 public datasets through automatic selection and human inspection; second, GPT-4o is used to construct 18,124 structured instructions spanning Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-guided Content Generation (EGCG).

Figure 1. AICA-Bench first filters emotionally clear samples from 9 open-source affective datasets, then automatically generates benchmark instructions across EU, ER, and EGCG with both label-based and open-ended formats.

Why AICA-Bench

Compared with prior emotion benchmarks for VLMs and MLLMs, AICA-Bench is the only one in our comparison that jointly covers Emotion Understanding, Emotion Reasoning, and Emotion-guided Content Generation, while also spanning 9 affective datasets and providing 18,124 benchmark instructions with both Basic and CoT prompting.

Benchmark	Model	Tasks	#Datasets	#AICA Datasets	#Instr.	#Models	Prompt
EVE	VLM	EU	5	5	8,009	7	B+CoT
AffectGPT	MLLM	EU	9	3	-	17	B
EEmo-Bench	MLLM	EU	1	1	6,773	19	B
EmoBench-M	MLLM	EU	13	-	6,226	20	B
MOSABench	MLLM	EU	1	1	1,000	8	B
AICA-Bench (Ours)	VLM	EU, ER, EGCG	9	9	18,124	23	B+CoT

Adapted from Table 1 in the introduction. EU denotes Emotion Understanding, ER denotes Emotion Reasoning, EGCG denotes Emotion-guided Content Generation, and CoT denotes Chain-of-Thought prompting.

Two-stage construction

Stage 1 selects representative affective images and removes emotionally ambiguous or unsafe samples through trained human inspection. Stage 2 generates task-specific prompts with consistent structure and scalable coverage.

Task coverage

EU tests expressed and evoked emotion prediction, ER requires grounded causal explanation, and EGCG evaluates whether a model can generate emotionally aligned descriptions conditioned on the source image and target emotion.

Benchmark

We benchmark 23 open- and closed-source VLMs under a unified zero-shot protocol. EU is scored with classification metrics, while ER and EGCG are evaluated by an AICA-Bench scoring model trained on 10K human-annotated samples to capture emotional alignment, descriptiveness, and causal soundness.

7 closed-source VLMs 12 open-source VLMs (>6B) Best overall: Gemini-2.5-Pro 73.49 Best open-source: Qwen2.5-VL-7B 65.78

Main quantitative comparison

Model	EU Basic	EU CoT	EU Avg.	ER Avg.	EGCG Avg.	Overall Avg. (%)
Closed-source Models
Gemini-2.5-Pro	66.97	67.57	67.27	79.08	74.13	73.49
Qwen-VL-Max	64.07	65.98	65.02	77.75	75.93	72.90
ChatGPT-4o	64.44	65.42	64.93	77.81	75.73	72.82
Gemini-2.5-Flash	68.05	69.32	68.68	76.55	68.19	71.14
ChatGPT-4o-mini	60.15	63.68	61.91	76.45	74.09	70.81
Qwen-VL-Plus	60.04	67.81	63.92	72.39	66.86	67.73
Gemini-2.0-Flash	67.16	68.98	68.07	71.05	63.93	67.68
Open-source Models (Size > 6B)
Qwen2.5-VL-7B	56.43	57.25	56.84	74.50	66.00	65.78
Ovis2-16B	54.38	54.70	54.54	68.24	71.56	64.78
Ovis2-8B	53.63	52.73	53.18	68.89	70.81	64.29
InternVL3-14B	52.91	52.04	52.47	68.27	66.50	62.41
InternVL3-8B	52.18	52.98	52.58	67.21	67.27	62.35
InternVL2.5-8B	51.89	51.03	51.46	66.48	68.86	62.27
MiniCPM-O-2.6	52.73	48.65	50.69	70.16	64.98	61.94
Qwen2-VL-7B	53.52	55.19	54.36	65.23	64.76	61.45
LLaVA-1.6-13B	36.78	46.82	41.80	73.57	64.51	59.96
LLaVA-1.6-7B	36.58	50.22	43.40	73.81	59.58	58.93
MiniCPM-V-2.6	43.70	47.25	45.48	65.77	63.00	58.08
LLaVA-OneVision	54.02	53.25	53.64	63.78	54.18	57.20

Table adapted from the paper's main results section. EU denotes Emotion Understanding, ER denotes Emotion Reasoning, and EGCG denotes Emotion-guided Content Generation.

Understanding remains the bottleneck

Across nearly all models, EU scores lag far behind ER and EGCG, indicating that affective perception from raw pixels is still weaker than downstream language-heavy reasoning.

Reasoning gap is already narrowing

Top open-source systems remain behind proprietary models overall, but they are much closer on reasoning-heavy ER than on pure visual understanding.

Scaling alone is not enough

Larger parameter counts do not guarantee stronger affective performance, suggesting that alignment quality and visual grounding matter more than raw size.

Contact

Zhanpeng Jin

zjin@scut.edu.cn

Yang Gao

gaoyang2025@scut.edu.cn

Dong She

ftdshe@mail.scut.edu.cn

BibTeX

@misc{she2026aicabenchholisticallyexaminingcapabilities,
  title         = {AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis},
  author        = {Dong She and Xianrong Yao and Liqun Chen and Jinghe Yu and Yang Gao and Zhanpeng Jin},
  year          = {2026},
  eprint        = {2604.05900},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2604.05900}
}