MERA_CODE
Описание
Языки
- Python54,3%
- Jupyter Notebook41%
- Shell4,7%
MERA Code
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks.
🚀 About
MERA Code brings together a rich collection of code-focused evaluation tasks—both private and public—under one roof. Built on top of the Language Model Evaluation Harness (v0.4.9), it enables researchers and practitioners to:
- Compare models on identical tasks and metrics
- Reproduce results with fixed prompts and few-shot settings
- Submit standardized ZIP archives for leaderboard integration
🔍 Datasets Overview
| Set | Task Name | Language | Metrics | Size | Prompts | Skills |
|---|---|---|---|---|---|---|
| Private | ruCodeEval | Python | pass@k | 164 | 10 | Instruction Following, Code Perception, Completion, Algorithms & Data Structures |
| RuCodeReviewer | Java, Scala, Go, Python | Judge@k, BLEU, chrF | 689 | 10 | Instruction Following, Code Perception, Review, Simulation, Explanation, Design Patterns, Style Guides | |
| CodeLinterEval | Python | pass@k | 110 | 10 | Instruction Following, Code Perception, Style Guides, Review, Editing | |
| Public | ruHumanEval | Python | pass@k | 164 | 10 | Instruction Following, Code Perception, Completion |
| StRuCom | Python, Java, Go, C#, JavaScript | chrF | 500 | 10 | Instruction Following, Code Perception, Simulation, Documentation | |
| UnitTests | Python, Java, Go, C#, JavaScript | CodeBLEU | 2500 | 20 | Instruction Following, Code Perception, Synthesis, Testing, Long Context Comprehension | |
| CodeCorrectness | Python, Java, Go | EM | 1361 | 11 | Instruction Following, Code Perception, Simulation, Error Classification | |
| RealCode | Python | pass@k | 802 | 10 | Instruction Following, Code Perception, Completion | |
| RealCodeJava | Java | pass@k | 298 | 10 | Instruction Following, Code Perception, Completion | |
| JavaTestGen | Java | pass@k, compile@k | 227 | 10 | Instruction Following, Code Perception, Completion, Testing | |
| YABLoCo | C, C++ | pass@k, EM | 208 | 11 | Instruction Following, Code Perception, Completion, Long Context Comprehension |
🛠 Getting Started
First, you need to clone the MERA_CODE repository and load the submodule:
Now, you can choose one of two evaluation regimes, depending on whether you want to obtain the metrics for public tasks locally or intend to use our remote scoring via the website.
Remote Scoring
Remote Scoring (default): quick setup for cloud-based scoring — install only core dependencies, run the evaluation, and submit the resulting ZIP archive to our website to get the score.
You will not get the metrics even for public datasets (for each dataset, you will see a "bypass" placeholder instead of actual metrics) in the terminal.
Details on Remote Scoring
Install only those libraries that are required to get the model's generations (answers for the queries of each task).
How it works inside
You may also need additional libraries for model inference or evaluation. Use lm-eval compatible libraries and their versions:
Local Scoring
Local Scoring (optional): full setup for on-premise evaluation — install extra dependencies with metrics and run Docker containers. Available only for Public sets.
Ensure you have a stable internet connection, sufficient disk space, and adequate CPU resources.
Details on Local Scoring
Evaluation of RealCode, RealCodeJava, and JavaTestGen assumes running hundreds of Docker containers. YABLoCo also requires a significant amount of resources and time.
If you are running the evaluation from inside the Docker container, the integrity of the local scoring is not guaranteed (and this is also not recommended at all).
Even without the Docker-in-Docker issue, being short in resources means that although you would get the metrics, they would be lower than those computed in the environment that fits the scoring in terms of resources.
How it works inside
Now, proceed to the evaluations, but with the flag that enables local metric computation.
More details on usage may be obtained by:
📁 Repository Structure
💪 How to Join the Leaderboard
Follow these steps to see your model on the Leaderboard:
- Run Remote Scoring Evaluate the benchmark in the Remote Scoring regime (see 🛠 Getting Started above). You may run Local Scoring, but you will have to wait twice for submission scoring.
You’ll end up with a logs folder and a ready-to-submit zip archive like
.Qwen2.5-0.5B-Instruct_submission.zip
-
Submit on the website Head over to Create Submission, upload the archive, and move on to the form.
-
Fill in Model Details Provide accurate information about the model and evaluation. These details are crucial for reproducibility—if something is missing, administrators may ping you (or your Submission might be rejected).
-
Wait for Scoring ⏳ Scoring usually wraps up in ~2 hours. There is a progress bar to track the scoring process.
Keep in mind that if you submit more than one archive, they are scored sequentially, one after another (not in parallel).
- Publish your result Once scoring finishes, click "Submit for moderation". After approval, your model goes Public and appears on the Leaderboard.
Good luck, and happy benchmarking! 🎉
🤝 Contributing
We are interested in improving the MERA Code and invite the community to contribute to the development of new, complex tasks and the project's codebase.
Steps to Add a New Task:
- Develop a dataset (on the contributor's side; see task requirements)
- Convert the dataset to MERA format (guide)
- Upload the dataset to 🤗HF Hub (guide)
- Submit the dataset for MERA organizer review (guide)
- Write evaluation code using lm-harness (guide)
- Benchmark state-of-the-art baseline models on the dataset
- Final moderation, and your dataset is officially added!
📝 License
Distributed under the MIT License. See LICENSE for details.
📑 Cite as
@misc{chervyakov2025meracodeunifiedframework,
title={MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks},
author={Artem Chervyakov and
Alexander Kharitonov and
Pavel Zadorozhny and
Adamenko Pavel and
Rodion Levichev and
Dmitrii Vorobev and
Dmitrii Salikhov and
Aidar Valeev and
Alena Pestova and
Maria Dziuba and
Ilseyar Alimova and
Artem Zavgorodnev and
Aleksandr Medvedev and
Stanislav Moiseev and
Elena Bruches and
Daniil Grebenkin and
Roman Derunets and
Vikulov Vladimir and
Anton Emelyanov and
Dmitrii Babaev and
Vladimir V. Ivanov and
Valentin Malykh and
Alena Fenogenova},
year={2025},
eprint={2507.12284},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2507.12284},
}
Read the paper on arXiv