Benchmarking Visual Language Models on Standardized Visualization Literacy Tests

dc.contributor.authorPandey, Saugaten_US
dc.contributor.authorOttley, Alvittaen_US
dc.contributor.editorAigner, Wolfgangen_US
dc.contributor.editorAndrienko, Nataliaen_US
dc.contributor.editorWang, Beien_US
dc.date.accessioned2025-05-26T06:38:48Z
dc.date.available2025-05-26T06:38:48Z
dc.date.issued2025
dc.description.abstractThe increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first-ofits- kind evaluation of four leading VLMs (GPT-4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability - a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76-96% accuracy) and detecting hierarchical structures (80-100% accuracy), but consistent difficulties with data-dense visualizations involving multiple encodings (bubble charts: 18.6-61.4%) and anomaly detection (25-30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7-8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks. To promote reproducibility, encourage further research, and facilitate benchmarking of future VLMs, our complete evaluation framework, including code, prompts, and analysis scripts, is available at https://github.com/washuvis/VisLit-VLM-Eval.en_US
dc.description.sectionheadersAI-Enhanced Visualization
dc.description.seriesinformationComputer Graphics Forum
dc.identifier.doi10.1111/cgf.70137
dc.identifier.issn1467-8659
dc.identifier.pages12 pages
dc.identifier.urihttps://doi.org/10.1111/cgf.70137
dc.identifier.urihttps://diglib.eg.org/handle/10.1111/cgf70137
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.en_US
dc.rightsAttribution 4.0 International License
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectCCS Concepts: Human-centered computing → Information visualization
dc.subjectHuman centered computing → Information visualization
dc.titleBenchmarking Visual Language Models on Standardized Visualization Literacy Testsen_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
cgf70137.pdf
Size:
2.2 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
1011-file-i8.zip
Size:
11.94 MB
Format:
Zip file
Collections