google-research

Форк
0

..
/
scaling_transformer_inference_efficiency 
2 года назад
6 месяцев назад
2 года назад
README.md

Scaling Transformer Inference Efficiency

This repo includes

To replicate the head-to-head benchmarks from the paper at 540B scale

  • Ensure you are running on 64 TPUv4 chips, smaller numbers would be better suited for smaller models
python3 run_benchmark.py

This generates the latency and MFU numbers for the PALM and MT-NLG implementations in the following plot from the paper. The FastertTransformer baseline numbers are drawn from NVIDIA's repo.

To generate text

python3 run_generation.py --model 540b --quantized False

The current weight paths only load internal PaLM weights, which are unavailable externally. Using this externally will require modification of the checkpoint paths and transformer layer def to suit your own models. Text generation currently uses the pjit based code paths, updating to the faster xmap based code paths is in progress and should be done by next week.

TODO:

  • Insert table from benchmarks run
  • Include benchmark at larger setpoints
  • Update text generation to xmap code path
  • Include helper scripts for running TPU pod slices
  • Update this documentation

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.