🥇 VideoEval-Pro Leaderboard

A More Robust and Realistic QA Evaluation benchmark of Multi-modal LLMs in long video understanding

Introduction

Do existing long video benchmarks faithfully reflect model's real capacity to understand long video content? Do the gains reported by newer models genuinely translate into stronger long video comprehension capability, or are they illusional? To probe these questions, we present VideoEval-Pro, a more robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. To construct VideoEval-Pro, we source the questions from four existing long video understanding MCQ benchmarks, and reformat these questions into free-form questions. We apply a series of filtering methods based on video duration, question and answer type, answerability and QA difficulty to ensure the quality of our benchmark. Our final benchmark contains a total of 1,289 short-answer questions based on 465 videos, with an average duration of 38 minutes.

| 📈Overview | 👨‍💻Github | 📖VideoEval-Pro Paper | 🤗HuggingFace |

0 10
0 10
Select Model Type
Select tasks to Display
Rank
Models
Model Size(B)
Type
Frames
LP_Open
LR_Open
HP_Open
HR_Open
Overall_Open
10
unknown
Proprietary
256
47.2
29.9
28.1
34.5
40.8

Models are ranked based on Overall_Open.