AICA-Bench

Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She*, Xianrong Yao*, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin†

* Equal contribution. †Corresponding to: zjin@scut.edu.cn

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this, we introduce AICA-Bench, a comprehensive benchmark comprising three core tasks: Emotion Understanding (EU), Reasoning (ER), and Generation (EGCG). We evaluate 23 VLMs, revealing critical gaps: models struggle with intensity calibration and suffer from descriptive shallowness in open-ended tasks. To bridge these gaps, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that integrates visual scaffolding with hierarchical reasoning. Experiments show that GAT effectively corrects intensity errors and significantly enhances descriptive depth, establishing a robust baseline for future affective multimodal research.

Overview

AICA-Bench extends affective evaluation beyond recognition alone by organizing Affective Image Content Analysis into a unified understanding, reasoning, and generation pipeline. The benchmark is built in two stages: first, 8,086 affective images are curated from 9 public datasets through automatic selection and human inspection; second, GPT-4o is used to construct 18,124 structured instructions spanning Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-guided Content Generation (EGCG).

Overview of AICA-Bench instruction curation and task construction

Figure 1. AICA-Bench first filters emotionally clear samples from 9 open-source affective datasets, then automatically generates benchmark instructions across EU, ER, and EGCG with both label-based and open-ended formats.

Why AICA-Bench

Compared with prior emotion benchmarks for VLMs and MLLMs, AICA-Bench is the only one in our comparison that jointly covers Emotion Understanding, Emotion Reasoning, and Emotion-guided Content Generation, while also spanning 9 affective datasets and providing 18,124 benchmark instructions with both Basic and CoT prompting.

Benchmark Model Tasks #Datasets #AICA Datasets #Instr. #Models Prompt
EVE VLM EU 5 5 8,009 7 B+CoT
AffectGPT MLLM EU 9 3 - 17 B
EEmo-Bench MLLM EU 1 1 6,773 19 B
EmoBench-M MLLM EU 13 - 6,226 20 B
MOSABench MLLM EU 1 1 1,000 8 B
AICA-Bench (Ours) VLM EU, ER, EGCG 9 9 18,124 23 B+CoT

Adapted from Table 1 in the introduction. EU denotes Emotion Understanding, ER denotes Emotion Reasoning, EGCG denotes Emotion-guided Content Generation, and CoT denotes Chain-of-Thought prompting.

Two-stage construction

Stage 1 selects representative affective images and removes emotionally ambiguous or unsafe samples through trained human inspection. Stage 2 generates task-specific prompts with consistent structure and scalable coverage.

Task coverage

EU tests expressed and evoked emotion prediction, ER requires grounded causal explanation, and EGCG evaluates whether a model can generate emotionally aligned descriptions conditioned on the source image and target emotion.

Benchmark

We benchmark 23 open- and closed-source VLMs under a unified zero-shot protocol. EU is scored with classification metrics, while ER and EGCG are evaluated by an AICA-Bench scoring model trained on 10K human-annotated samples to capture emotional alignment, descriptiveness, and causal soundness.

7 closed-source VLMs 12 open-source VLMs (>6B) Best overall: Gemini-2.5-Pro 73.49 Best open-source: Qwen2.5-VL-7B 65.78

Main quantitative comparison

Model EU Basic EU CoT EU Avg. ER Avg. EGCG Avg. Overall Avg. (%)
Closed-source Models
Gemini-2.5-Pro 66.97 67.57 67.27 79.08 74.13 73.49
Qwen-VL-Max 64.07 65.98 65.02 77.75 75.93 72.90
ChatGPT-4o 64.44 65.42 64.93 77.81 75.73 72.82
Gemini-2.5-Flash 68.05 69.32 68.68 76.55 68.19 71.14
ChatGPT-4o-mini 60.15 63.68 61.91 76.45 74.09 70.81
Qwen-VL-Plus 60.04 67.81 63.92 72.39 66.86 67.73
Gemini-2.0-Flash 67.16 68.98 68.07 71.05 63.93 67.68
Open-source Models (Size > 6B)
Qwen2.5-VL-7B 56.43 57.25 56.84 74.50 66.00 65.78
Ovis2-16B 54.38 54.70 54.54 68.24 71.56 64.78
Ovis2-8B 53.63 52.73 53.18 68.89 70.81 64.29
InternVL3-14B 52.91 52.04 52.47 68.27 66.50 62.41
InternVL3-8B 52.18 52.98 52.58 67.21 67.27 62.35
InternVL2.5-8B 51.89 51.03 51.46 66.48 68.86 62.27
MiniCPM-O-2.6 52.73 48.65 50.69 70.16 64.98 61.94
Qwen2-VL-7B 53.52 55.19 54.36 65.23 64.76 61.45
LLaVA-1.6-13B 36.78 46.82 41.80 73.57 64.51 59.96
LLaVA-1.6-7B 36.58 50.22 43.40 73.81 59.58 58.93
MiniCPM-V-2.6 43.70 47.25 45.48 65.77 63.00 58.08
LLaVA-OneVision 54.02 53.25 53.64 63.78 54.18 57.20

Table adapted from the paper's main results section. EU denotes Emotion Understanding, ER denotes Emotion Reasoning, and EGCG denotes Emotion-guided Content Generation.

Understanding remains the bottleneck

Across nearly all models, EU scores lag far behind ER and EGCG, indicating that affective perception from raw pixels is still weaker than downstream language-heavy reasoning.

Reasoning gap is already narrowing

Top open-source systems remain behind proprietary models overall, but they are much closer on reasoning-heavy ER than on pure visual understanding.

Scaling alone is not enough

Larger parameter counts do not guarantee stronger affective performance, suggesting that alignment quality and visual grounding matter more than raw size.

BibTeX

@misc{she2026aicabenchholisticallyexaminingcapabilities,
  title         = {AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis},
  author        = {Dong She and Xianrong Yao and Liqun Chen and Jinghe Yu and Yang Gao and Zhanpeng Jin},
  year          = {2026},
  eprint        = {2604.05900},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2604.05900}
}