Accelerating Large Language Models with TensorRT-LLM
In our previous discussion, we explored the capabilities of NVIDIA TensorRT for accelerating deep learning inference. Today, we extend that exploration to TensorRT-LLM, a specialized open-source library designed to optimize and accelerate Large Language Model (LLM) inference on NVIDIA GPUs.
TensorRT-LLM builds on TensorRT’s performance advantages to enhance LLM deployments, enabling powerful models like Meta’s Llama and Mistral to run efficiently not only in data centers but also on Windows PCs with RTX GPUs. This library offers up to 8x faster inference performance through advanced techniques such as in-flight batching and tensor parallelism, making it possible to run cutting-edge AI applications locally.
By leveraging TensorRT-LLM, developers can reduce their reliance on cloud infrastructure, resulting in cost savings and improved data privacy, all while achieving top-notch performance. With upcoming updates poised to offer even greater integration with popular models and tools, TensorRT-LLM is set to make high-performance LLM capabilities accessible across a range of platforms.
Code Time
Installing TensorRT-LLM and Testing the Installation
Here we first update our system packages and install essential dependencies like Python 3.10, pip, OpenMPI, Git, and wget. We then install TensorRT-LLM (version 0.8.0) via pip, sourcing it from NVIDIA's PyPI repository to ensure compatibility with Tensor Core GPUs.
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget
pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com
After installing TensorRT-LLM, we quickly verify the setup by importing the library. If the installation is successful, the output should confirm the version of TensorRT-LLM as 0.8.0.
python3 -c "import tensorrt_llm"
We should get : [TensorRT-LLM] TensorRT-LLM version: 0.8.0
Installing Requirements and Cloning a Model Example
Next, we clone the TensorRT-LLM repository at version 0.8.0 and navigate to the example directory for a supported model, such as GPT-2. Although GPT-2 is used here as an example, TensorRT-LLM supports various models, and you can substitute this with any other supported model of your choice. We then install the necessary dependencies from the requirements.txt
file, preparing the environment for working with the selected model in TensorRT-LLM.
git clone --branch v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/gpt
pip install -r requirements.txt
Here we clone the GPT-2 model repository from Hugging Face directly into the TensorRT-LLM example directory. We then clean up unnecessary files, keeping only the pytorch_model.bin file (if available), which is needed for conversion to the TensorRT-LLM format.
cd TensorRT-LLM/examples/gpt/
rm -rf gpt2 && git clone https://huggingface.co/gpt2 gpt2
cd gpt2
rm model.safetensors
cd ..
Handling Missing .bin
Models
In cases where the GPT-2 model repository lacks a pytorch_model.bin
, we use a script to generate it. The script loads the model and tokenizer using the Transformers library and saves the model state in .bin
format. It also removes .safetensors
files if present, ensuring compatibility with the conversion process.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
import glob
def save_model_as_bin(model_name, save_directory):
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Create save directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)
# Save the model and tokenizer
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
# Save the model in .bin format
torch.save(model.state_dict(), os.path.join(save_directory, "pytorch_model.bin"))
# Remove .safetensors files if they exist
for safetensors_file in glob.glob(os.path.join(save_directory, "*.safetensors")):
os.remove(safetensors_file)
# Usage
save_model_as_bin("model_id", "TensorRT-LLM/examples/....")
Converting Weights to TensorRT-LLM Format
The conversion step transforms the Hugging Face model weights into the format required by TensorRT-LLM. This involves specifying the input directory, output path, tensor parallelism (1 in this case), and using float16 for storage, optimizing inference performance.
python3 TensorRT-LLM/examples/gpt/hf_gpt_convert.py -i TensorRT-LLM/examples/gpt/gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16
Building the TensorRT Engine
Now, we build the TensorRT engine, which is a highly optimized representation of the model ready for fast inference. This script uses the converted model files and specifies options like using the GPT attention plugin and removing input padding, further enhancing performance.
python3 TensorRT-LLM/examples/gpt/build.py --model_dir=./c-model/gpt2/1-gpu --use_gpt_attention_plugin --remove_input_padding
Inference Using Llama-index
We can use the Llama-index library to perform inference with the optimized TensorRT engine.
pip install llama-index llama-index-llms-nvidia-tensorrt
After installing the necessary Llama-index packages, the code instantiates a LocalTensorRTLLM
object, providing the paths to the engine, tokenizer, and other parameters.
from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM
llm = LocalTensorRTLLM(
model_path="./engine_outputs",
engine_name="gpt_float16_tp1_rank0.engine",
tokenizer_dir="gpt2",
max_new_tokens=15,
)
It then demonstrates a simple completion task, prompting the model with "once upon a time" and printing the generated text...
resp = llm.complete("once upon a time")
print(str(resp))
Deploy with Triton Inference Server
Here, we start by cloning the TensorRT-LLM backend repository specific to version 0.9.0 and copying the model files from the c-model/gpt2/1-gpu
directory into the Triton backend model directory.
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cp c-model/gpt2/1-gpu/* tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
This prepares the backend with the necessary model files for deployment.
Update some configuration
Next, we configure TensorRT-LLM to support in-flight batching by creating configuration files for different components. This includes filling in templates for preprocessing, postprocessing, and the main model configuration using the fill_template.py script. In-flight batching enables the processing of requests together to improve throughput and reduce latency.
HF_LLAMA_MODEL=/content/TensorRT-LLM/examples/gpt/gpt2
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt \
tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/postprocessing/config.pbtxt \
tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt \
triton_max_batch_size:64,decoupled_mode:True,bls_instance_count:1,accumulate_tokens:False
python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt \
triton_max_batch_size:64
python3 tensorrtllm_backend/tools/fill_template.py -i tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt \
triton_max_batch_size:64,decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
Configuration settings such as triton_max_batch_size
, decoupled_mode
(For streaming), and batching_strategy
are specified to optimize performance.
Serving the Model
To serve the model, we run the Triton Inference Server within a Docker container. The docker run command sets up the server with GPU support, mounts the current directory to the container, and specifies the working directory.
docker run -it --rm --gpus all --network host --shm-size=1g -v $(pwd):/tensorrtllm_backend --workdir /tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
We can launch the Triton server using a provided Python script, setting the model repository path and world size to configure the server for model inference.
python3 tensorrtllm_backend/scripts/launch_triton_server.py \
--model_repo tensorrtllm_backend/all_models/inflight_batcher_llm \
--world_size 1
Testing the Deployment
Here we test the deployed model by sending a request to the Triton server's API endpoint. The Python script constructs a payload with input text and parameters, sends a POST request to the model endpoint, and prints the response.
import requests
url = "http://localhost:8000/v2/models/ensemble/generate"
payload = {
"text_input": "Say this is a test.",
"parameters": {
"max_tokens": 128,
"stop_words": [""]
}
}
response = requests.post(url, json=payload)
if response.status_code == 200:
print("Response:", response.json())
else:
print(f"Error {response.status_code}: {response.text}")
This verifies that the model is correctly served and can generate responses as expected.
Streaming Test :
import requests
url = "http://localhost:8000/v2/models/ensemble/generate_stream"
payload = {
"text_input": "Tell me a short joke about llamas",
"parameters": {
"max_tokens": 128,
"stop_words": [""],
"stream": True
}
}
# Send a POST request with streaming enabled
response = requests.post(url, json=payload, stream=True)
if response.status_code == 200:
# Process the streaming response
for line in response.iter_lines():
if line:
# Print or process each line of the stream
print(line.decode('utf-8'))
else:
print(f"Error {response.status_code}: {response.text}")
Setting an OpenAI-compatible API
To integrate TensorRT-LLM with OpenAI-compatible APIs, we first clone the openai_trtllm repository, which provides a bridge between OpenAI’s API format and the Triton Inference Server.
git clone --recursive https://github.com/npuichigo/openai_trtllm.git
Ensure Rust is installed, then build the project using Cargo.
cargo run --release
This generates the necessary binary files for running the API server.
Running the OpenAI-Compatible API
After building the source code, we run the openai_trtllm binary to start the API server. We configure the server with options such as host, port, Triton endpoint...
./target/release/openai_trtllm --help
Usage: openai_trtllm [OPTIONS]
Options:
-H, --host <HOST>
Host to bind to [default: 0.0.0.0]
-p, --port <PORT>
Port to bind to [default: 3000]
-t, --triton-endpoint <TRITON_ENDPOINT>
Triton gRPC endpoint [default: http://localhost:8001]
-o, --otlp-endpoint <OTLP_ENDPOINT>
Endpoint of OpenTelemetry collector
--history-template <HISTORY_TEMPLATE>
Template for converting OpenAI message history to prompt
--history-template-file <HISTORY_TEMPLATE_FILE>
File containing the history template string
--api-key <API_KEY>
Api Key to access the server
-h, --help
Print help
The openai_trtllm --help command displays available options for customizing the server's behavior.
Testing the OpenAI-Compatible API
./target/release/openai_trtllm
Openai Client :
import pprint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="test")
result = client.completions.create(
model="ensemble",
prompt="Say this is a test",
)
pprint.pprint(result)
Streaming Test :
from sys import stdout
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="test")
response = client.completions.create(
model="ensemble",
prompt="This is a story of a hero who went",
stream=True,
max_tokens=50,
)
for event in response:
if not isinstance(event, dict):
event = event.model_dump()
event_text = event["choices"][0]["text"]
stdout.write(event_text)
stdout.flush()
Langchain ChatOpenAI :
from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage, SystemMessage
chat = ChatOpenAI(openai_api_base="http://localhost:3000/v1",
openai_api_key="test", model_name="ensemble",
max_tokens=100)
messages = [
SystemMessage(content="You're a helpful assistant"),
HumanMessage(content="What is the purpose of model regularization?"),
]
result = chat.invoke(messages)
print(result.content)
Deploying TensorRT-LLM with Triton Inference Server and setting up an OpenAI-compatible API enables efficient, high-performance inference for large language models. By leveraging in-flight batching and optimizing model configurations, you can achieve significant improvements in throughput and latency. The integration with Triton Inference Server allows for seamless model serving, while the OpenAI-compatible API facilitates easy integration with existing applications. This setup not only enhances performance but also provides flexibility in deploying and interacting with advanced AI models locally or in production environments.