📢 The leaderboard is canstantly updating as we are welcoming new submissions!
We consider two test settings: w/o CoT and w/ CoT.
Short: 0 ~ 32k words          Medium: 32k ~ 128k words          Long: 128k ~ 2M words
By default, this leaderboard is sorted by results without CoT. To view other sorted results, please click on the corresponding cell.
# | Model | Params | Context | Date | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
w/ CoT | w/ CoT | w/ CoT | w/ CoT | w/ CoT | w/ CoT | |||||||||||
o1-preview
OpenAI |
- | 128k | 2024-09-12 | 57.7 | 56.2 | 66.8 | 58.9 | 52.1 | 54.6 | 62.6 | 64.6 | 53.5 | 50.2 | 58.1 | 54.3 | |
Human | N/A | N/A | N/A | 53.7 | 53.7 | 100 | 100 | 25.1 | 25.1 | 47.2 | 47.2 | 59.1 | 59.1 | 53.7 | 53.7 | |
GPT-4o
OpenAI |
- | 128k | 2024-08-06 | 50.1 | 51.2 | 57.4 | 57.9 | 45.6 | 47.1 | 53.3 | 53.9 | 52.4 | 50.7 | 40.2 | 47.7 | |
GLM-4-Plus
Zhipu AI & Tsinghua |
- | 128k | 2024-10-11 | 44.3 | 46.1 | 47.4 | 52.1 | 42.4 | 42.4 | 50.0 | 53.3 | 46.5 | 44.7 | 30.6 | 37.0 | |
Claude 3.5 Sonnet
Anthropic |
- | 200k | 2024-10-22 | 41.0 | 46.7 | 46.9 | 55.2 | 37.3 | 41.5 | 46.1 | 53.9 | 38.6 | 41.9 | 37.0 | 44.4 | |
Qwen2.5-72B
Alibaba |
72B | 128k | 2024-09-19 | 39.4 | 38.8 | 43.8 | 42.2 | 36.7 | 36.7 | 44.4 | 50.0 | 34.0 | 28.8 | 41.7 | 39.8 | |
o1-mini
OpenAI |
- | 128k | 2024-09-12 | 37.8 | 38.9 | 38.9 | 42.6 | 37.1 | 36.6 | 48.6 | 48.9 | 33.3 | 32.9 | 28.6 | 34.3 | |
Mistral Large 24.11
Mistral AI |
123B | 128k | 2024-11-24 | 34.4 | 39.6 | 38.0 | 43.8 | 32.2 | 37.0 | 41.7 | 46.1 | 30.7 | 34.9 | 29.6 | 38.0 | |
Llama 3.1 70B
Meta |
70B | 128k | 2024-07-23 | 31.6 | 36.2 | 32.3 | 35.9 | 31.2 | 36.3 | 41.1 | 45.0 | 27.4 | 34.0 | 24.1 | 25.9 | |
Nemotron 70B
Nvidia |
70B | 128k | 2024-10-15 | 31.0 | 35.2 | 32.8 | 37.0 | 29.9 | 34.1 | 38.3 | 46.7 | 27.9 | 29.8 | 25.0 | 26.9 | |
GLM-4-9B
Zhipu AI & Tsinghua |
9B | 128k | 2024-06-05 | 30.2 | 30.8 | 30.7 | 34.4 | 29.9 | 28.6 | 33.9 | 35.0 | 29.8 | 30.2 | 25.0 | 25.0 | |
Llama 3.1 8B
Meta |
8B | 128k | 2024-07-23 | 30.0 | 30.4 | 30.7 | 36.5 | 29.6 | 26.7 | 35.0 | 34.4 | 27.9 | 31.6 | 25.9 | 21.3 | |
Llama 3.3 70B
Meta |
70B | 128k | 2024-12-06 | 29.8 | 36.2 | 34.4 | 38.0 | 27.0 | 35.0 | 36.7 | 45.0 | 27.0 | 33.0 | 24.1 | 27.8 | |
GPT-4o mini
OpenAI |
- | 128k | 2024-07-18 | 29.3 | 32.4 | 31.1 | 32.6 | 28.2 | 32.2 | 31.8 | 34.8 | 28.6 | 31.6 | 26.2 | 29.9 | |
Command R+
Cohere |
104B | 128k | 2024-08-30 | 27.8 | 31.6 | 30.2 | 34.4 | 26.4 | 29.9 | 36.7 | 39.4 | 23.7 | 24.2 | 21.3 | 33.3 | |
Qwen2.5-7B
Alibaba |
7B | 128k | 2024-09-19 | 27.0 | 29.8 | 29.2 | 30.7 | 25.7 | 29.3 | 36.1 | 35.6 | 23.7 | 26.5 | 18.5 | 26.9 | |
Mistral Large 2
Mistral AI |
123B | 128k | 2024-07-24 | 26.6 | 33.6 | 29.7 | 34.4 | 24.8 | 33.1 | 37.8 | 41.1 | 19.5 | 31.2 | 22.2 | 25.9 | |
Random | N/A | N/A | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
Green date indicates the newly added/updated models          - indicates closed-source models
1*. Human accuracy is based on their performance within a 15-minute time limit, after which they are allowed to respond with "I don’t know the answer". This occurred for 8% of the total test data. 2*. Models do not show lower scores on subsets with longer length ranges because the distribution of tasks differs significantly across each length range.