datasets

Форк
0

README.md

Metric Card for Competition MATH

Metric description

This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset.

It first canonicalizes the inputs (e.g., converting 1/2 to \\frac{1}{2}) and then computes accuracy.

How to use

This metric takes two arguments:

predictions: a list of predictions to score. Each prediction is a string that contains natural language and LaTeX.

references: list of reference for each prediction. Each reference is a string that contains natural language and LaTeX.

>>> from datasets import load_metric
>>> math = load_metric("competition_math")
>>> references = ["\\frac{1}{2}"]
>>> predictions = ["1/2"]
>>> results = math.compute(references=references, predictions=predictions)

N.B. To be able to use Competition MATH, you need to install the math_equivalence dependency using pip install git+https://github.com/hendrycks/math.git.

Output values

This metric returns a dictionary that contains the accuracy after canonicalizing inputs, on a scale between 0.0 and 1.0.

The original MATH dataset paper reported accuracies ranging from 3.0% to 6.9% by different large language models.

More recent progress on the dataset can be found on the dataset leaderboard.

Examples

Maximal values (full match):

>>> from datasets import load_metric
>>> math = load_metric("competition_math")
>>> references = ["\\frac{1}{2}"]
>>> predictions = ["1/2"]
>>> results = math.compute(references=references, predictions=predictions)
>>> print(results)
{'accuracy': 1.0}

Minimal values (no match):

>>> from datasets import load_metric
>>> math = load_metric("competition_math")
>>> references = ["\\frac{1}{2}"]
>>> predictions = ["3/4"]
>>> results = math.compute(references=references, predictions=predictions)
>>> print(results)
{'accuracy': 0.0}

Partial match:

>>> from datasets import load_metric
>>> math = load_metric("competition_math")
>>> references = ["\\frac{1}{2}","\\frac{3}{4}"]
>>> predictions = ["1/5", "3/4"]
>>> results = math.compute(references=references, predictions=predictions)
>>> print(results)
{'accuracy': 0.5}

Limitations and bias

This metric is limited to datasets with the same format as the Mathematics Aptitude Test of Heuristics (MATH) dataset, and is meant to evaluate the performance of large language models at solving mathematical problems.

N.B. The MATH dataset also assigns levels of difficulty to different problems, so disagregating model performance by difficulty level (similarly to what was done in the original paper can give a better indication of how a given model does on a given difficulty of math problem, compared to overall accuracy.

Citation

@article{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks
and Collin Burns
and Saurav Kadavath
and Akul Arora
and Steven Basart
and Eric Tang
and Dawn Song
and Jacob Steinhardt},
journal={arXiv preprint arXiv:2103.03874},
year={2021}
}

Further References

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.