🥇 VideoEval-Pro Leaderboard
A More Robust and Realistic QA Evaluation benchmark of Multi-modal LLMs in long video understanding
Introduction
Do existing long video benchmarks faithfully reflect model's real capacity to understand long video content? Do the gains reported by newer models genuinely translate into stronger long video comprehension capability, or are they illusional? To probe these questions, we present VideoEval-Pro, a more robust and realistic long video understanding benchmark containing open-ended, short-answer QA problems. To construct VideoEval-Pro, we source the questions from four existing long video understanding MCQ benchmarks, and reformat these questions into free-form questions. We apply a series of filtering methods based on video duration, question and answer type, answerability and QA difficulty to ensure the quality of our benchmark. Our final benchmark contains a total of 1,289 short-answer questions based on 465 videos, with an average duration of 38 minutes.
| 📈Overview | 👨💻Github | 📖VideoEval-Pro Paper | 🤗HuggingFace |
Rank | Models | Model Size(B) | Type | Frames | LP_Open | LR_Open | HP_Open | HR_Open | Overall_Open |
---|---|---|---|---|---|---|---|---|---|
10 | unknown | Proprietary | 256 | 47.2 | 29.9 | 28.1 | 34.5 | 40.8 |
Models are ranked based on Overall_Open.
Dataset Statistics and Tasks Info
- Local Perception (LP): LP focuses on identifying and retrieving visual elements or actions from a short video clip in a long video. Subtypes in this category include Segment QA, Needle-InA-Haystack (NIAH) QA, Attribute Perception, Action Recognition, Object Recognition, Entity Recognition, Key Information Retrieval and a combined Other subtype.
- Local Reasoning (LR): LR focuses on reasoning within short temporal windows, such as inferring causality, temporal order, or changes that happen over a local sequence of events. The four subtypes in this category are Egocentric Video Reasoning, Object Reasoning, Temporal Reasoning and Action Reasoning.
- Holistic Perception (HP): HP involves a global and holistic understanding of statistical, structural, or spatial information, typically requiring visual aggregation. In VIDEOEVAL-PRO, HP is comprised of Visual Counting problems.
- Holistic Reasoning (HR): HR requires abstract or high-level understanding of long videos across events or scenes, often involving narrative or intent understanding. The two subtypes for HR are Event Understanding and Plot Reasoning.
Submit on VideoEval-Pro Leaderboard Introduction
The evaluattion model should be used is GPT-4o-0806
⚠ Please note that you need to submit the JSON file with the following format:
[
{
"Models": "<Model Name>",
"Model Size(B)": "100 or -",
"Frames": "<Number of Frames>",
"Type": "Proprietary or Open-source",
"URL": "<Model URL>" or null,
"LP_Open": 50.0 or null,
"LP_MCQ": 50.0 or null,
"LR_Open": 50.0 or null,
"LR_MCQ": 50.0 or null,
"HP_Open": 50.0 or null,
"HP_MCQ": 50.0 or null,
"HR_Open": 50.0 or null,
"HR_MCQ": 50.0 or null,
"Overall_Open": 50.0,
"Overall_MCQ": 50.0,
},
]
You may refer to the GitHub page for instructions about evaluating your model.
Please send us an email at tonyyyma@gmail.com, attaching the JSON file. We will review your submission and update the leaderboard accordingly.