Polish Cultural Vision Benchmark (PCVB)

Evaluating Vision-Language Models on Polish Cultural Heritage

A specialized evaluation dataset designed to assess vision-language models' understanding of Polish cultural heritage, history, geography, and traditions. This benchmark addresses the critical gap in multilingual and culturally-specific evaluation of multimodal AI systems.

Benchmark Scope:

Domain: Polish Cultural Knowledge
Modality: Vision + Language
Task Type: Visual Recognition and Cultural Classification
Dataset Size: ~220 curated image-text pairs across 11 subcategories

Categories Evaluated:

🎭 Art & Entertainment: Movies, Art, Theatre
🏛️ Culture & Tradition: Food, Folk Culture, Traditions
🗺️ Geography: Cities, Nature, Architecture
📚 History: Historical Figures, Historical Sites

Help us develop Polish Large Language Model Bielik by using Arena.

We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016951.

Model	Params	Avg	Avg (object)	Avg (country)	History (object)	History (country)	Geography (object)	Geography (country)	Art & Entertainment (object)	Art & Entertainment (country)	Culture & Tradition (object)	Culture & Tradition (country)
Anthropic Claude 3.7 Sonnet	400	49.760000000000005	37.52	76.77	47.5	77.5	71.67	88.33	34.48	62.07	18.33	41.67

Model	Params	Avg	Avg (object)	Avg (country)	History (object)	History (country)	Geography (object)	Geography (country)	Art & Entertainment (object)	Art & Entertainment (country)	Culture & Tradition (object)	Culture & Tradition (country)
Google Gemini 2.5 Pro	nan	59.88	43.00	76.77	47.50	85.00	71.67	90.00	34.48	62.07	18.33	70.00
Google Gemini 2.5 Flash	nan	52.59	37.52	67.66	32.50	77.50	68.33	88.33	27.59	44.83	21.67	60.00
Anthropic Claude 3.7 Sonnet	nan	49.76	37.06	62.46	52.50	80.00	58.33	83.33	22.41	44.83	15.00	41.67
Qwen 2.5 VL 72B	72	37.71	23.91	51.51	35.00	70.00	31.67	71.67	18.97	31.03	10.00	33.33
OpenAI GPT-4o	nan	35.72	28.94	42.49	30.00	37.50	45.00	55.00	22.41	24.14	18.33	53.33
Qwen 2.5 VL 32B	32	35.53	22.27	48.80	30.00	67.50	28.33	66.67	22.41	31.03	8.33	30.00
Qwen 2.5 VL 7B	7	33.17	21.62	44.72	32.50	65.00	28.33	66.67	18.97	15.52	6.67	31.67
Mistral Medium 3	nan	31.72	17.45	45.99	12.50	65.00	31.67	56.67	18.97	18.97	6.67	43.33
Google Gemma 3 27B	27	31.45	19.14	43.76	12.50	52.50	28.33	48.33	22.41	25.86	13.33	48.33
Meta Llama 4 Maverick	400	30.23	17.49	42.98	17.50	52.50	20.00	50.00	24.14	32.76	8.33	36.67
Google Gemma 3 12B	12	26.55	13.06	40.04	10.00	42.50	15.00	46.67	17.24	29.31	10.00	41.67
Google Gemma 3 4B	4	22.78	9.72	35.84	5.00	47.50	8.33	38.33	17.24	25.86	8.33	31.67
Mistral Small 3.1 24B	24	19.29	12.41	26.17	7.50	37.50	21.67	36.67	13.79	15.52	6.67	15.00

Plot

Benchmark Details:

Methodology: Each test item consists of carefully selected and manually verified images that represent authentic Polish cultural elements. Models are prompted to identify specific cultural objects, landmarks, foods, or personalities shown in images, along with their country of origin.

Evaluation Protocol: Responses are evaluated for both object accuracy and geographical attribution using binary scoring (correct/incorrect) across all categories.

Unique Value Proposition:

Cultural Specificity: Tests deep understanding of Polish heritage beyond generic object recognition
Multimodal Integration: Requires both visual processing and cultural knowledge
Bias Detection: Reveals potential Western-centric biases in vision-language models
Real-world Relevance: Evaluates practically useful cultural knowledge for Polish applications

This benchmark is maintained as a private evaluation suite to ensure result integrity and prevent training data contamination.