LongBench v2

Benchmarking Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Introduction

LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1) Length: Context length ranging from 8k to 2M words, with the majority under 128k. (2) Difficulty: Challenging enough that even human experts, using search tools within the document, cannot answer correctly in a short time. (3) Coverage: Cover various realistic scenarios. (4) Reliability: All in a multiple-choice question format for reliable evaluation. To elaborate, LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%.

🔍 With LongBench v2, we are eager to find out how scaling inference-time compute will affect deep understanding and reasoning in long-context scenarios.

Leaderboard

📢 The leaderboard is canstantly updating as we are welcoming new submissions!

We consider two test settings: w/o CoT and w/ CoT.

Short: 0 ~ 32k words          Medium: 32k ~ 128k words          Long: 128k ~ 2M words

By default, this leaderboard is sorted by results without CoT. To view other sorted results, please click on the corresponding cell.

# Model Params Context Date Overall (%) Easy (%) Hard (%) Short (%) Medium (%) Long (%)
w/ CoT w/ CoT w/ CoT w/ CoT w/ CoT w/ CoT
o1-preview

OpenAI

- 128k 2024-09-12 57.7 56.2 66.8 58.9 52.1 54.6 62.6 64.6 53.5 50.2 58.1 54.3
Human N/A N/A N/A 53.7 53.7 100 100 25.1 25.1 47.2 47.2 59.1 59.1 53.7 53.7
GPT-4o

OpenAI

- 128k 2024-08-06 50.1 51.2 57.4 57.9 45.6 47.1 53.3 53.9 52.4 50.7 40.2 47.7
GLM-4-Plus

Zhipu AI & Tsinghua

- 128k 2024-10-11 44.3 46.1 47.4 52.1 42.4 42.4 50.0 53.3 46.5 44.7 30.6 37.0
Claude 3.5 Sonnet

Anthropic

- 200k 2024-10-22 41.0 46.7 46.9 55.2 37.3 41.5 46.1 53.9 38.6 41.9 37.0 44.4
Qwen2.5-72B

Alibaba

72B 128k 2024-09-19 39.4 38.8 43.8 42.2 36.7 36.7 44.4 50.0 34.0 28.8 41.7 39.8
o1-mini

OpenAI

- 128k 2024-09-12 37.8 38.9 38.9 42.6 37.1 36.6 48.6 48.9 33.3 32.9 28.6 34.3
Mistral Large 24.11

Mistral AI

123B 128k 2024-11-24 34.4 39.6 38.0 43.8 32.2 37.0 41.7 46.1 30.7 34.9 29.6 38.0
Llama 3.1 70B

Meta

70B 128k 2024-07-23 31.6 36.2 32.3 35.9 31.2 36.3 41.1 45.0 27.4 34.0 24.1 25.9
Nemotron 70B

Nvidia

70B 128k 2024-10-15 31.0 35.2 32.8 37.0 29.9 34.1 38.3 46.7 27.9 29.8 25.0 26.9
GLM-4-9B

Zhipu AI & Tsinghua

9B 128k 2024-06-05 30.2 30.8 30.7 34.4 29.9 28.6 33.9 35.0 29.8 30.2 25.0 25.0
Llama 3.1 8B

Meta

8B 128k 2024-07-23 30.0 30.4 30.7 36.5 29.6 26.7 35.0 34.4 27.9 31.6 25.9 21.3
Llama 3.3 70B

Meta

70B 128k 2024-12-06 29.8 36.2 34.4 38.0 27.0 35.0 36.7 45.0 27.0 33.0 24.1 27.8
GPT-4o mini

OpenAI

- 128k 2024-07-18 29.3 32.4 31.1 32.6 28.2 32.2 31.8 34.8 28.6 31.6 26.2 29.9
Command R+

Cohere

104B 128k 2024-08-30 27.8 31.6 30.2 34.4 26.4 29.9 36.7 39.4 23.7 24.2 21.3 33.3
Qwen2.5-7B

Alibaba

7B 128k 2024-09-19 27.0 29.8 29.2 30.7 25.7 29.3 36.1 35.6 23.7 26.5 18.5 26.9
Mistral Large 2

Mistral AI

123B 128k 2024-07-24 26.6 33.6 29.7 34.4 24.8 33.1 37.8 41.1 19.5 31.2 22.2 25.9
Random N/A N/A N/A 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0 25.0

Green date indicates the newly added/updated models          - indicates closed-source models

1*. Human accuracy is based on their performance within a 15-minute time limit, after which they are allowed to respond with "I don’t know the answer". This occurred for 8% of the total test data.
2*. Models do not show lower scores on subsets with longer length ranges because the distribution of tasks differs significantly across each length range.

Benchmark

Data Collection

grade-lv

Data collection pipeline of LongBench v2. The annotator first uploads the document(s) and proposes a multiple-choice question based on the content. After that, automated and manual reviews will be conducted to ensure the data meets our requirements. Only data that passes these reviews is eligible for annotation rewards, meaning the annotator must revise the data until it passes all review stages.

Benchmark Statistics

grade-lv

(Left) Length distribution of each task category;
(Right) Human expert solving time distribution.

grade-lv

Tasks and data statistics in LongBench v2. 'Length' is the median of the number of words. 'Expert Acc' and 'Expert Time' refer to the average accuracy and the median time spent on answering the question by human experts.

Experiment Results

Citation


      @article{bai2024longbench2,
        title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, 
        author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
        journal={arXiv preprint arXiv:2412.15204},
        year={2024}
      }