🥇 VideoEval-Pro Leaderboard

A More Robust and Realistic QA Evaluation benchmark of Multi-modal LLMs in long video understanding

Introduction

Do existing long video benchmarks faithfully reflect model's real capacity to understand long video content? Do the gains reported by newer models genuinely translate into stronger long video comprehension capability, or are they illusional? To probe these questions, we present VideoEval-Pro, a more robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. To construct VideoEval-Pro, we source the questions from four existing long video understanding MCQ benchmarks, and reformat these questions into free-form questions. We apply a series of filtering methods based on video duration, question and answer type, answerability and QA difficulty to ensure the quality of our benchmark. Our final benchmark contains a total of 1,289 short-answer questions based on 465 videos, with an average duration of 38 minutes.

Minimum number of parameters (B)

0 10

Maximum number of parameters (B)

0 10

Select Model Type

Rank	Models	Model Size(B)	Type	Frames	LP_Open	LR_Open	HP_Open	HR_Open	Overall_Open
10	gemini-2.0-flash	unknown	Proprietary	512	47.2	35.4	41.3	34.5	44.2

Rank	Models	Model Size(B)	Type	Frames	LP_Open	LR_Open	HP_Open	HR_Open	Overall_Open
1	gemini-2.5-pro	unknown	Proprietary	512	47.2	35.4	41.3	42	44.2
2	GPT-4.1	unknown	Proprietary	256	47.2	29.9	28.1	34.5	40.8
3	GPT-4.1-mini	unknown	Proprietary	256	46	32	27.3	32.6	39.9
4	Gemini-1.5-Pro	unknown	Proprietary	512	43.7	32.7	35.5	31.8	39.3
5	gemini-2.0-flash	unknown	Proprietary	512	43.6	27.9	27.3	30.7	37.6
6	Gemini-2.5-Flash	unknown	Proprietary	256	42.4	30.6	25.6	26.9	36.3
7	Gemini-1.5-Flash	unknown	Proprietary	512	41.5	25.9	27.3	25.8	35.1
8	GPT-4o	unknown	Proprietary	256	39.4	23.1	26.4	29.2	34.2
9	MiMo-VL-RL	7	Open-source	512	35.5	18.4	28.1	18.9	29.5
10	MiMo-VL-SFT	7	Open-source	512	34.7	19	26.4	19.7	29.1
11	Video-XL-2	8	Open-source	512	33.3	25.2	21.5	20.5	28.6
12	Qwen2.5-VL	7	Open-source	512	33.9	15.6	24.8	17.8	27.7
13	InternVideo2.5	8	Open-source	512	33.6	17	19.8	18.2	27.2
14	VideoChat-Flash	7	Open-source	512	33.3	16.3	21.5	17.4	27
15	Qwen2-VL	7	Open-source	512	31.7	14.3	21.5	20.5	26.5
16	InternVL3	8	Open-source	64	30.3	17	24	13.3	24.7
17	InternVL2.5	8	Open-source	64	28.8	19.7	21.5	16.7	24.6
18	LLaVA-Video	7	Open-source	64	28.5	13.6	20.7	19.3	24.2
19	Vamba	10	Open-source	512	28.1	10.9	21.5	12.5	22.3
20	LongVU	7	Open-source	512	25.9	12.9	19.8	17.4	22.1
21	Video-XL	7	Open-source	512	22.3	15	18.2	10.2	18.6
22	LongLLaVA	9	Open-source	512	21.7	15	14	10.2	17.8
23	Phi-4-Mini	5.6	Open-source	128	19.2	12.9	18.2	10.2	16.5
24	LongVA	7	Open-source	64	20.5	6.8	19	9.5	16.5
25	Mantis-Idefics2	8	Open-source	24	17.8	9.5	16.5	8.3	14.8
26	Video-LLaVA	8	Open-source	8	13.2	6.1	14	6.1	11

Models are ranked based on Overall_Open.

🥇 VideoEval-Pro Leaderboard

A More Robust and Realistic QA Evaluation benchmark of Multi-modal LLMs in long video understanding

Introduction

Dataset Statistics and Tasks Info

Submit on VideoEval-Pro Leaderboard Introduction

The evaluattion model should be used is GPT-4o-0806

⚠ Please note that you need to submit the JSON file with the following format: