text-generation-inference

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

Get Started
Optimized architectures
Run Mistral
- Run
- Quantization
Develop
Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with :
- bitsandbytes
- GPT-Q
- EETQ
- AWQ
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

And then you can make requests like

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the

--gpus all

flag and add

--disable-custom-kernels

, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use

docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model

instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the

text-generation-inference

REST API using the

/docs

route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the

HUGGING_FACE_HUB_TOKEN

environment variable for configuring the token employed by

text-generation-inference

. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go to https://huggingface.co/settings/tokens
Copy your cli READ token
Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

A note on Shared Memory (shm)

NCCL

is a communication framework used by

PyTorch

to do distributed training/inference.

text-generation-inference

make use of

NCCL

to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a

NCCL

group,

NCCL

might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add

--shm-size 1g

on the above command.

If you are running

text-generation-inference

inside

Kubernetes

. You can also add Shared Memory to the container by creating a volume with:

and mounting it to

/dev/shm

Finally, you can also disable SHM sharing by using the

NCCL_SHM_DISABLE=1

environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference

is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the

--otlp-endpoint

argument.

Architecture

TGI architecture

Local install

You can also opt to install

text-generation-inference

locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using

conda

You may also need to install Protoc.

On Linux:

On MacOS, using Homebrew:

Then run:

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing

--quantize bitsandbytes-nf4

--quantize bitsandbytes-fp4

as a command line argument to

text-generation-launcher

text-generation-inference

Описание

Языки

drbhfix: correctly index into mask when applying grammar (#1618)2 года назад7dbaf9eНе верифицирован

Text Generation Inference

Table of contents

Hardware support

Get Started

Docker

API documentation

Using a private or gated model

A note on Shared Memory (shm)

Distributed Tracing

Architecture

Local install

Optimized architectures

Run locally

Run

Quantization

Develop

Testing

drbh
fix: correctly index into mask when applying grammar (#1618)
2 года назад
7dbaf9e
Не верифицирован