How To Deploy NVIDIA Triton Inference Server on GCP
A step-by-step guide to deploying NVIDIA Triton Inference Server on Google Cloud with Debian and a T4 GPU, covering drivers, storage, container toolkit setup, and inference validation.
We build AI systems that process complex financial and accounting documents for SMEs. Some of those workloads depend on multiple deep learning models for OCR, classification, and semantic understanding, which means inference speed and reproducibility matter.
As part of that work, we tested NVIDIA Triton Inference Server on a Google Cloud VM running Debian with a T4 GPU. This guide walks through the setup we used and explains why each step matters for stability, performance, and repeatability.
Confirm the GPU and operating system
Start by verifying that the VM can see the NVIDIA hardware and that the operating system is what you expect:
lspci | grep -i nvidia
cat /etc/os-release | grep PRETTY_NAME
This gives you two useful checks immediately:
- the T4 GPU is visible to the VM
- the Debian version is clear before installing kernel headers and drivers
On GCP, I prefer the cloud-optimized kernel for better compatibility with GPU drivers:
sudo apt install -y linux-image-cloud-amd64 linux-headers-cloud-amd64
sudo init 6
After the reboot, the machine is ready for driver installation on a stable kernel base.
Install NVIDIA drivers with an explicit version
For inference workloads, I prefer version pinning over a generic package install. That makes it easier to reproduce the same environment across development, testing, and production.
curl -O https://storage.googleapis.com/nvidia-drivers-us-public/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
chmod +x NVIDIA-Linux-x86_64-550.90.12.run
sudo ./NVIDIA-Linux-x86_64-550.90.12.run --silent
The --silent flag is especially useful when the setup needs to be automated for image baking or repeated environment provisioning.
Mount separate storage for models and runtime data
Model repositories, logs, and intermediate artifacts can grow quickly. Instead of overloading the root disk, it is often cleaner and cheaper to attach a separate persistent disk.
That approach helps in three ways:
- storage can be expanded independently from the VM
- the disk can be detached or replaced without rebuilding the instance
- model data survives instance recreation more cleanly
Here is the basic setup pattern:
sudo mkfs.ext4 /dev/sdb
sudo mkdir /mnt/data
sudo mount /dev/sdb /mnt/data
sudo rsync -av /home/ /mnt/data/home/
sudo rsync -av /var/lib/ /mnt/data/var_lib/
Then add persistent mounts in /etc/fstab:
/dev/sdb /mnt/data ext4 defaults 0 0
/mnt/data/home /home none bind 0 0
/mnt/data/var_lib /var/lib none bind 0 0
This layout is useful when model repositories or datasets need to scale independently from the boot volume.
Install the NVIDIA Container Toolkit
Triton is usually run in a container, so Docker needs GPU access through the current NVIDIA Container Toolkit stack rather than older nvidia-docker2 patterns.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Validate GPU access from inside Docker:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
If the T4 appears correctly there, the container runtime is ready for GPU-backed workloads.
Run Triton with explicit container versions
I prefer explicit image tags instead of floating versions so that upgrades happen intentionally.
docker pull nvcr.io/nvidia/tritonserver:25.09-py3
Run the server with a mapped model repository:
docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
-p8000:8000 -p8001:8001 -p8002:8002 \
-v /home/user/server/docs/examples/model_repository:/models \
nvcr.io/nvidia/tritonserver:25.09-py3 tritonserver --model-repository=/models
The most important parameters here are:
--shm-size=1gfor frameworks that need more shared memory--ulimit memlock=-1to reduce memory locking issues during inference--ulimit stack=67108864for deeper model stacks- ports
8000,8001, and8002for REST, gRPC, and metrics
Once the container starts, confirm readiness:
curl -v localhost:8000/v2/health/ready
Test inference with the Triton SDK container
To validate the setup end to end, use NVIDIA's SDK image, which includes the client tools.
docker pull nvcr.io/nvidia/tritonserver:25.09-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:25.09-py3-sdk
Inside that container, test inference with a sample model:
/workspace/install/bin/image_client -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
python /workspace/install/python/image_client.py -m inception_onnx -s INCEPTION /workspace/images/mug.jpg
exit
That gives you a quick end-to-end validation path for both the serving layer and the GPU runtime.
Lessons learned
This setup worked well, but a few practices mattered more than others:
- pin both driver and container versions for reproducibility
- use a dedicated disk for model repositories and related runtime data
- prefer the modern NVIDIA Container Toolkit over deprecated runtime patterns
- expose Triton metrics on port
8002so monitoring can plug into Prometheus and Grafana cleanly
Final recommendation
If you are deploying Triton for serious experimentation or early production workloads, the most important thing is not just getting the container to run once. It is creating a setup that you can rebuild consistently, observe properly, and scale without surprises. On GCP, that usually means explicit versions, clean storage separation, and a GPU runtime configuration you can verify step by step.