Deploy Open-Source LLMs (Llama 3 & Mistral) on a Dedicated GPU Server

A complete, production-ready walkthrough for deploying self-hosted large language models using vLLM, Ollama, and HuggingFace Transformers β€” with real benchmark numbers, systemd setup, and API integration.

Deploy LLM on GPU Server
Home  /  Tutorials  /  Deploy Open-Source LLMs on a Dedicated Server

What You'll Learn

Prerequisites

Ensure all of the following are in place before running any commands.

Requirement Details LeoServers GPU?
GPU Server NVIDIA RTX 4090 (24 GB) minimum for 7B models in FP16 Pre-configured βœ“
Operating System Ubuntu 22.04 LTS recommended Ubuntu 22.04 βœ“
NVIDIA Driver Driver 535+, CUDA 12.1 or higher CUDA 12.1 βœ“
Python Python 3.10 or higher + pip + git Python 3.11 βœ“
HuggingFace Account Free account; Llama 3 requires accepting Meta license at hf.co User action needed
Storage 50+ GB free disk space for model weights Check your plan
ℹ️ LeoServers users: Your GPU instance ships with CUDA 12.1, NVIDIA drivers 535, and Python 3.11 pre-installed. You can skip driver setup and jump directly to Step 2.

Step 0: Choose the Right Model for Your VRAM

Match the model to your GPU's available VRAM before downloading anything. Loading a model that is too large causes CUDA out-of-memory errors at inference time, not at startup.

Model Params VRAM FP16 VRAM 4-bit Best for License
Mistral 7B v0.3 7B ~14 GB ~4.5 GB Fast inference, local dev Apache 2.0
Llama 3.1 8B Instruct 8B ~16 GB ~5 GB General chat, RAG Meta License
Llama 3.1 70B Instruct 70B ~140 GB ~40 GB Production quality Meta License
Mistral 8x7B (MoE) 47B active ~90 GB ~26 GB Quality + speed balance Apache 2.0

πŸ’‘ Tip: On a single RTX 4090 (24 GB VRAM), Mistral 7B in FP16 fits comfortably. For Llama 3.1 70B, you need an A100 80 GB or two A100 40 GB GPUs with tensor parallelism. AWQ 4-bit quantization cuts VRAM roughly in half with minimal quality loss.

Step 1: Verify GPU & Driver Setup

SSH into your server and confirm the driver stack is ready before installing anything else.

bash

# Verify NVIDIA driver version and CUDA runtime
nvidia-smi

# Expected output (abbreviated):
# +----------------------------------------------------------------------+
# | NVIDIA-SMI 535.104   Driver Version: 535.104   CUDA Version: 12.2    |
# | GPU 0: NVIDIA RTX 4090   |   0%   35C   P8 ...                       |

# Confirm CUDA compiler version
nvcc --version

# Show VRAM availability
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv
                            
⚠️ Driver mismatch? If nvidia-smi fails or shows CUDA < 11.8, reinstall: sudo apt install nvidia-driver-535 cuda-12-2 then reboot.

Step 2: Set up an Isolated Python Environment

Always isolate LLM dependencies in a virtual environment to prevent conflicts with system Python packages.

bash

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install venv and git if not present
sudo apt install -y python3-venv python3-pip git

# Create the virtual environment
python3 -m venv ~/llm-env
source ~/llm-env/bin/activate

# Upgrade pip inside the venv
pip install --upgrade pip setuptools wheel

# Your shell prompt should now show:
(llm-env) user@gpu-server:~$
                            

Step 3: Method A β€” Deploy with Ollama (Fastest Start)

Ollama is the fastest path to running Llama 3 or Mistral locally. It handles model downloading, quantization, and serving behind a single unified CLI. This method is best for development, internal tools, and single-user APIs.

3.1 Install Ollama

bash

# One-line installer for Linux
curl -fsSL https://ollama.com/install.sh | sh

# Verify the installation
ollama --version
# β†’ ollama version 0.3.x

# Start the Ollama server (listens on port 11434)
ollama serve &
                            

3.2 Pull and Run Models

bash

# Pull Llama 3.1 8B (quantized, approx. 4.7 GB download)
ollama pull llama3.1:8b

# Pull Mistral 7B Instruct
ollama pull mistral:7b

# Pull Mistral with a specific quantization level
ollama pull mistral:7b-instruct-q4_K_M

# Start an interactive chat session
ollama run llama3.1:8b

# List all locally downloaded models
ollama list
                            

3.3 Call Ollama via REST API (curl)

bash

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Explain quantum entanglement in one paragraph.",
    "stream": false
  }'
                            

3.4 OpenAI-Compatible Python Client

Ollama exposes an OpenAI-compatible endpoint, so existing OpenAI SDK code works with a one-line base_url change:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Hello! Tell me about GPU servers."}
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
                            

Step 4: Method B β€” Deploy with vLLM (Production)

vLLM is the industry standard for high-throughput LLM serving. It implements PagedAttention for near-optimal GPU memory utilization and continuous batching β€” achieving 10–20Γ— higher throughput than naive HuggingFace inference. Use this for any production workload serving concurrent users.

4.1 Install vLLM

bash

# Ensure venv is active
source ~/llm-env/bin/activate

# Install vLLM β€” auto-detects CUDA version
pip install vllm

# Verify install
python -c "import vllm; print(vllm.__version__)"
# β†’ 0.4.x

# Install HuggingFace CLI for model management
pip install huggingface_hub
huggingface-cli login    # Enter your HF token when prompted
                            

4.2 Launch vLLM OpenAI-Compatible Server

bash

# Serve Mistral 7B with AWQ 4-bit quantization
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --quantization awq \
  --max-model-len 4096 \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.90

# For Llama 3.1 70B across two A100 40GB GPUs (tensor parallelism)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 8192
                            

4.3 Test the vLLM Endpoint

python

import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

payload = {
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention and why does it matter?"}
    ],
    "temperature": 0.7,
    "max_tokens": 512
}

resp = requests.post(url, json=payload, headers=headers)
print(resp.json()["choices"][0]["message"]["content"])
                            

Step 5: Method C β€” HuggingFace Transformers (Full Control)

For custom pipelines, fine-tuning workflows, or research, loading models directly via Transformers gives you maximum control over every inference detail.

5.1 Install Dependencies

bash

pip install transformers accelerate bitsandbytes torch sentencepiece
                            

5.2 Llama 3.1 8B with 4-Bit Quantization

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# 4-bit quantization: reduces VRAM from 16 GB to approx. 5 GB
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",     # auto-distribute across available GPUs
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain FP16 vs BF16 in LLM training."},
]

# Apply the model-specific chat template
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
                            

Step 6: Run as a Systemd Service (Production Persistence)

For a persistent, auto-restarting LLM API server, create a systemd unit file. This ensures your endpoint survives reboots and restarts automatically on failure without manual intervention.

6.1 Create the Service File

bash

sudo nano /etc/systemd/system/vllm.service
                            

6.2 Service Unit Contents

ini

[Unit]
Description=vLLM OpenAI-compatible API Server
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment="PATH=/home/ubuntu/llm-env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
Environment="HUGGING_FACE_HUB_TOKEN=YOUR_HF_TOKEN_HERE"
ExecStart=/home/ubuntu/llm-env/bin/python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.3 \
    --quantization awq \
    --host 0.0.0.0 \
    --port 8000
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
                            

6.3 Enable and Start

bash

# Reload systemd configuration
sudo systemctl daemon-reload

# Enable auto-start on boot
sudo systemctl enable vllm

# Start the service
sudo systemctl start vllm

# Check status
sudo systemctl status vllm

# Stream live logs
sudo journalctl -u vllm -f
                            

Step 7: Secure with Nginx Reverse Proxy + SSL

Never expose your LLM server port (8000) directly to the internet. Use Nginx as a reverse proxy with SSL termination and an API key header check to protect your inference endpoint.

7.1 Install Nginx and Certbot

bash

sudo apt install -y nginx certbot python3-certbot-nginx

# Obtain a free Let's Encrypt SSL certificate
sudo certbot --nginx -d api.yourdomain.com
                            

7.2 Nginx Site Configuration

nginx

server {
    listen 443 ssl;
    server_name api.yourdomain.com;

    # Enforce API key via request header
    if ($http_x_api_key != "your-secret-key-here") {
        return 403;
    }

    location / {
        proxy_pass         http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 300s;  # LLM inference can take time
    }
}
                            

πŸ” Security: Replace 'your-secret-key-here' with a cryptographically strong random string. Generate one with: openssl rand -hex 32. Never store API keys in version control or expose them in logs.

Step 8: Monitor GPU Performance

Keep a close eye on VRAM usage, GPU utilization, and temperature to prevent throttling and out-of-memory errors during inference, especially under concurrent request load.

bash

# Live GPU stats refreshed every second
watch -n 1 nvidia-smi

# Install and run nvtop for a more detailed TUI dashboard
sudo apt install nvtop && nvtop

# Query specific metrics in CSV format
nvidia-smi \
  --query-gpu=timestamp,name,utilization.gpu,memory.used,memory.free,temperature.gpu \
  --format=csv \
  --loop-ms=2000

# Log GPU metrics to file for later analysis
nvidia-smi \
  --query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu \
  --format=csv \
  --loop-ms=5000 >> gpu_metrics.csv
                            

Step 9: Real-World Benchmark Results

The following throughput and latency measurements were recorded on LeoServers GPU instances. All tests used a single-request batch with 512-token output length, calling the vLLM OpenAI-compatible endpoint directly.

Model GPU Quantization Tokens/sec First token (ms) VRAM used
Mistral 7B Instruct RTX 4090 24 GB FP16 78 t/s 210 ms 14.1 GB
Mistral 7B Instruct RTX 4090 24 GB AWQ 4-bit 94 t/s 180 ms 4.8 GB
Llama 3.1 8B Instruct RTX 4090 24 GB FP16 71 t/s 230 ms 15.8 GB
Llama 3.1 70B Instruct A100 80 GB FP16 22 t/s 890 ms 76 GB
Llama 3.1 70B Instruct 2x A100 40 GB AWQ 4-bit 34 t/s 610 ms 38 GB total

πŸ’‘ Key insight: AWQ 4-bit quantization on Mistral 7B actually increases throughput by ~20% on the RTX 4090 because the smaller memory footprint allows the GPU memory bandwidth to be used more efficiently, and larger concurrent batches become possible.

Step 10: Troubleshooting Common Issues

Out-of-Memory (CUDA OOM) Errors

bash

# Reduce GPU memory utilization fraction
--gpu-memory-utilization 0.80   # was 0.90, try lower

# Reduce the maximum sequence length (KV cache grows with context)
--max-model-len 2048

# Force 4-bit quantization
--quantization awq

# Find and kill processes holding GPU memory
sudo fuser -v /dev/nvidia*
kill -9 <PID>
                            

Slow First-Token Latency

bash

# Make sure CUDA graph optimization is enabled (it is by default)
# Do NOT set --enforce-eager unless debugging

# Pre-warm the model with a dummy request after startup:
curl -s http://localhost:8000/v1/chat/completions \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
        "messages":[{"role":"user","content":"hi"}],"max_tokens":1}'
                            

HuggingFace 401 / Model Access Denied

bash

# Re-authenticate HuggingFace CLI
huggingface-cli login

# Or set token as environment variable
export HUGGING_FACE_HUB_TOKEN="hf_your_token_here"

# For Meta Llama 3 models, you must accept the license at:
# https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
# (approval is usually instant for personal use)
                            

Which Deployment Method Should You Choose?

Method Best for Throughput Setup time
Ollama Local dev, demos, single-user tools Moderate < 5 minutes
vLLM Production APIs, concurrent users Very High (10-20x) ~15 minutes
Transformers Research, fine-tuning, custom pipelines Lower (no batching) ~10 minutes

βœ… Our Recommendation: Start with Ollama. Migrate to vLLM once you need to handle concurrent requests, achieve production throughput SLAs, or serve external users at scale. Check out our Leo Servers Dedicated GPU Plans today to power your AI models instantly.

Discover Leo Servers Dedicated Server Locations

Leo Servers servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.