Aligned with the broad field of Affective Computing research, this work examines the capacity of Multimodal Large Language Models to elicit attentional patterns and cognitive states in viewers as they engage with visuals related to a specific domain: corporate sustainability reporting. Sustainability reports are documents officially released by companies and organisations to communicate actions to achieve sustainability goals. They are often lengthy documents, which pose challenges in capturing, orienting and retaining stakeholder attention. In this context, visual elements are commonly used not only to complement textual content but also to act as entry points that guide the reader’s initial focus and/or to convey information. This study examines the eye movements of 41 observers during a webcam-based eye-tracking session as they view pictures from sustainability reports and their corresponding AI-generated versions with zero-shot prompting. Both visuals (original and AI-generated) are presented in the form of A/B tests. First, images are sourced from publicly available sustainability reports and captioned using Contrastive Language–Image Pretraining, a Vision-Language Model trained on image-text pairs, integrated within GPT-4o. Captions serve as zero-shot prompts provided to DALL-E3 for generating images from text. Perceptual features recorded include Time to First Fixation, Time to First Gaze, average gaze duration, number of fixations, total gaze count, and K-coefficient by looking into two time ranges: [0–3] and [0–10] seconds. Last, Wilcoxon signed-rank and paired t-tests are used to assess the statistical significance of similarity and divergence in attention and cognitive dynamics between the two conditions.
Original Versus Zero-Shot Prompted AI Visuals: Examining Human Feedback to Corporate Sustainability Reporting, 2025.
Original Versus Zero-Shot Prompted AI Visuals: Examining Human Feedback to Corporate Sustainability Reporting
Alessandro Bruno;Chintan Bhatt;Federica Ricceri
2025-01-01
Abstract
Aligned with the broad field of Affective Computing research, this work examines the capacity of Multimodal Large Language Models to elicit attentional patterns and cognitive states in viewers as they engage with visuals related to a specific domain: corporate sustainability reporting. Sustainability reports are documents officially released by companies and organisations to communicate actions to achieve sustainability goals. They are often lengthy documents, which pose challenges in capturing, orienting and retaining stakeholder attention. In this context, visual elements are commonly used not only to complement textual content but also to act as entry points that guide the reader’s initial focus and/or to convey information. This study examines the eye movements of 41 observers during a webcam-based eye-tracking session as they view pictures from sustainability reports and their corresponding AI-generated versions with zero-shot prompting. Both visuals (original and AI-generated) are presented in the form of A/B tests. First, images are sourced from publicly available sustainability reports and captioned using Contrastive Language–Image Pretraining, a Vision-Language Model trained on image-text pairs, integrated within GPT-4o. Captions serve as zero-shot prompts provided to DALL-E3 for generating images from text. Perceptual features recorded include Time to First Fixation, Time to First Gaze, average gaze duration, number of fixations, total gaze count, and K-coefficient by looking into two time ranges: [0–3] and [0–10] seconds. Last, Wilcoxon signed-rank and paired t-tests are used to assess the statistical significance of similarity and divergence in attention and cognitive dynamics between the two conditions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



