quanto
Quanto generation benchmark
This repository contains scripts to evaluate the performances of quantized models using three metrics:
latency.py
evaluates the latency per generated token,prediction.py
evaluates the accuracy when predicting the last token of prompts from the Lambada dataset,perplexity.py
evaluates the perplexity of the model on the WikiText dataset, as defined in the transformers documentation.
A evaluate_model.py
utility script is also provided to evaluate the metrics on a specific model for several quantization configurations, and output the result to a png
barchart and/or a json
file.
The paragraphs below display results for some popular models on a NVIDIA A100 GPU.