ArticleTechnical Deep Dives

How To Deploy NVIDIA Triton Inference Server on GCP

A step-by-step guide to deploying NVIDIA Triton Inference Server on Google Cloud with Debian and a T4 GPU, covering drivers, storage, container toolkit setup, and inference validation.

Published: October 22, 2025Reading time: 8 minAuthor: Pavel Gulin

We build AI systems that process complex financial and accounting documents for SMEs. Some of those workloads depend on multiple deep learning models for OCR, classification, and semantic understanding, which means inference speed and reproducibility matter.

As part of that work, we tested NVIDIA Triton Inference Server on a Google Cloud VM running Debian with a T4 GPU. This guide walks through the setup we used and explains why each step matters for stability, performance, and repeatability.

Confirm the GPU and operating system

Start by verifying that the VM can see the NVIDIA hardware and that the operating system is what you expect:

lspci | grep -i nvidia
cat /etc/os-release | grep PRETTY_NAME

This gives you two useful checks immediately:

the T4 GPU is visible to the VM
the Debian version is clear before installing kernel headers and drivers

On GCP, I prefer the cloud-optimized kernel for better compatibility with GPU drivers:

sudo apt install -y linux-image-cloud-amd64 linux-headers-cloud-amd64
sudo init 6

After the reboot, the machine is ready for driver installation on a stable kernel base.

Install NVIDIA drivers with an explicit version

For inference workloads, I prefer version pinning over a generic package install. That makes it easier to reproduce the same environment across development, testing, and production.

curl -O https://storage.googleapis.com/nvidia-drivers-us-public/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
chmod +x NVIDIA-Linux-x86_64-550.90.12.run
sudo ./NVIDIA-Linux-x86_64-550.90.12.run --silent

The --silent flag is especially useful when the setup needs to be automated for image baking or repeated environment provisioning.

Mount separate storage for models and runtime data

Model repositories, logs, and intermediate artifacts can grow quickly. Instead of overloading the root disk, it is often cleaner and cheaper to attach a separate persistent disk.

That approach helps in three ways:

storage can be expanded independently from the VM
the disk can be detached or replaced without rebuilding the instance
model data survives instance recreation more cleanly

Here is the basic setup pattern:

sudo mkfs.ext4 /dev/sdb
sudo mkdir /mnt/data
sudo mount /dev/sdb /mnt/data
sudo rsync -av /home/ /mnt/data/home/
sudo rsync -av /var/lib/ /mnt/data/var_lib/

Then add persistent mounts in /etc/fstab:

/dev/sdb  /mnt/data  ext4  defaults  0 0
/mnt/data/home     /home     none   bind  0 0
/mnt/data/var_lib  /var/lib  none   bind  0 0

This layout is useful when model repositories or datasets need to scale independently from the boot volume.

Install the NVIDIA Container Toolkit

Triton is usually run in a container, so Docker needs GPU access through the current NVIDIA Container Toolkit stack rather than older nvidia-docker2 patterns.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo apt-get install -y \
  nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
  libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Validate GPU access from inside Docker:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

If the T4 appears correctly there, the container runtime is ready for GPU-backed workloads.

Run Triton with explicit container versions

I prefer explicit image tags instead of floating versions so that upgrades happen intentionally.

docker pull nvcr.io/nvidia/tritonserver:25.09-py3

Run the server with a mapped model repository:

docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p8000:8000 -p8001:8001 -p8002:8002 \
  -v /home/user/server/docs/examples/model_repository:/models \
  nvcr.io/nvidia/tritonserver:25.09-py3 tritonserver --model-repository=/models

The most important parameters here are:

--shm-size=1g for frameworks that need more shared memory
--ulimit memlock=-1 to reduce memory locking issues during inference
--ulimit stack=67108864 for deeper model stacks
ports 8000, 8001, and 8002 for REST, gRPC, and metrics

Once the container starts, confirm readiness:

curl -v localhost:8000/v2/health/ready

Test inference with the Triton SDK container

To validate the setup end to end, use NVIDIA's SDK image, which includes the client tools.

docker pull nvcr.io/nvidia/tritonserver:25.09-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.09-py3-sdk

Inside that container, test inference with a sample model:

/workspace/install/bin/image_client -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
python /workspace/install/python/image_client.py -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
exit

That gives you a quick end-to-end validation path for both the serving layer and the GPU runtime.

Lessons learned

This setup worked well, but a few practices mattered more than others:

pin both driver and container versions for reproducibility
use a dedicated disk for model repositories and related runtime data
prefer the modern NVIDIA Container Toolkit over deprecated runtime patterns
expose Triton metrics on port 8002 so monitoring can plug into Prometheus and Grafana cleanly

Final recommendation

If you are deploying Triton for serious experimentation or early production workloads, the most important thing is not just getting the container to run once. It is creating a setup that you can rebuild consistently, observe properly, and scale without surprises. On GCP, that usually means explicit versions, clean storage separation, and a GPU runtime configuration you can verify step by step.