Polish Cultural Vision Benchmark (PCVB)
Evaluating Vision-Language Models on Polish Cultural Heritage
A specialized evaluation dataset designed to assess vision-language models' understanding of Polish cultural heritage, history, geography, and traditions. This benchmark addresses the critical gap in multilingual and culturally-specific evaluation of multimodal AI systems.
Benchmark Scope:
- Domain: Polish Cultural Knowledge
- Modality: Vision + Language
- Task Type: Visual Recognition and Cultural Classification
- Dataset Size: ~220 curated image-text pairs across 11 subcategories
Categories Evaluated:
- π Art & Entertainment: Movies, Art, Theatre
- ποΈ Culture & Tradition: Food, Folk Culture, Traditions
- πΊοΈ Geography: Cities, Nature, Architecture
- π History: Historical Figures, Historical Sites
Help us develop Polish Large Language Model Bielik by using Arena.
We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016951.
Model | Params | Avg | Avg (object) | Avg (country) | History (object) | History (country) | Geography (object) | Geography (country) | Art & Entertainment (object) | Art & Entertainment (country) | Culture & Tradition (object) | Culture & Tradition (country) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Anthropic Claude 3.7 Sonnet | 400 | 49.760000000000005 | 37.52 | 76.77 | 47.5 | 77.5 | 71.67 | 88.33 | 34.48 | 62.07 | 18.33 | 41.67 |
Benchmark Details:
Methodology: Each test item consists of carefully selected and manually verified images that represent authentic Polish cultural elements. Models are prompted to identify specific cultural objects, landmarks, foods, or personalities shown in images, along with their country of origin.
Evaluation Protocol: Responses are evaluated for both object accuracy and geographical attribution using binary scoring (correct/incorrect) across all categories.
Unique Value Proposition:
- Cultural Specificity: Tests deep understanding of Polish heritage beyond generic object recognition
- Multimodal Integration: Requires both visual processing and cultural knowledge
- Bias Detection: Reveals potential Western-centric biases in vision-language models
- Real-world Relevance: Evaluates practically useful cultural knowledge for Polish applications
This benchmark is maintained as a private evaluation suite to ensure result integrity and prevent training data contamination.