What You'll Learn
Prerequisites
Step 0: Choose the Right Model
Step 1: Verify GPU & Driver Setup
Step 2: Python Environment
Step 3: Deploy with Ollama (Fastest)
Step 4: Deploy with vLLM (Production)
Step 5: HuggingFace Transformers
Step 6: Systemd Service Setup
Step 7: Nginx Reverse Proxy + SSL
Step 8: Monitor Performance
Step 9: Real-World Benchmarks
Step 10: Troubleshooting
Recommendation
Prerequisites
Ensure all of the following are in place before running any commands.
| Requirement | Details | LeoServers GPU? |
|---|---|---|
| GPU Server | NVIDIA RTX 4090 (24 GB) minimum for 7B models in FP16 | Pre-configured β |
| Operating System | Ubuntu 22.04 LTS recommended | Ubuntu 22.04 β |
| NVIDIA Driver | Driver 535+, CUDA 12.1 or higher | CUDA 12.1 β |
| Python | Python 3.10 or higher + pip + git | Python 3.11 β |
| HuggingFace Account | Free account; Llama 3 requires accepting Meta license at hf.co | User action needed |
| Storage | 50+ GB free disk space for model weights | Check your plan |
Step 0: Choose the Right Model for Your VRAM
Match the model to your GPU's available VRAM before downloading anything. Loading a model that is too large causes CUDA out-of-memory errors at inference time, not at startup.
| Model | Params | VRAM FP16 | VRAM 4-bit | Best for | License |
|---|---|---|---|---|---|
| Mistral 7B v0.3 | 7B | ~14 GB | ~4.5 GB | Fast inference, local dev | Apache 2.0 |
| Llama 3.1 8B Instruct | 8B | ~16 GB | ~5 GB | General chat, RAG | Meta License |
| Llama 3.1 70B Instruct | 70B | ~140 GB | ~40 GB | Production quality | Meta License |
| Mistral 8x7B (MoE) | 47B active | ~90 GB | ~26 GB | Quality + speed balance | Apache 2.0 |
π‘ Tip: On a single RTX 4090 (24 GB VRAM), Mistral 7B in FP16 fits comfortably. For Llama 3.1 70B, you need an A100 80 GB or two A100 40 GB GPUs with tensor parallelism. AWQ 4-bit quantization cuts VRAM roughly in half with minimal quality loss.
Step 1: Verify GPU & Driver Setup
SSH into your server and confirm the driver stack is ready before installing anything else.
# Verify NVIDIA driver version and CUDA runtime
nvidia-smi
# Expected output (abbreviated):
# +----------------------------------------------------------------------+
# | NVIDIA-SMI 535.104 Driver Version: 535.104 CUDA Version: 12.2 |
# | GPU 0: NVIDIA RTX 4090 | 0% 35C P8 ... |
# Confirm CUDA compiler version
nvcc --version
# Show VRAM availability
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv
nvidia-smi fails or shows CUDA < 11.8, reinstall: sudo apt install nvidia-driver-535 cuda-12-2 then reboot.
Step 2: Set up an Isolated Python Environment
Always isolate LLM dependencies in a virtual environment to prevent conflicts with system Python packages.
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install venv and git if not present
sudo apt install -y python3-venv python3-pip git
# Create the virtual environment
python3 -m venv ~/llm-env
source ~/llm-env/bin/activate
# Upgrade pip inside the venv
pip install --upgrade pip setuptools wheel
# Your shell prompt should now show:
(llm-env) user@gpu-server:~$
Step 3: Method A β Deploy with Ollama (Fastest Start)
Ollama is the fastest path to running Llama 3 or Mistral locally. It handles model downloading, quantization, and serving behind a single unified CLI. This method is best for development, internal tools, and single-user APIs.
3.1 Install Ollama
# One-line installer for Linux
curl -fsSL https://ollama.com/install.sh | sh
# Verify the installation
ollama --version
# β ollama version 0.3.x
# Start the Ollama server (listens on port 11434)
ollama serve &
3.2 Pull and Run Models
# Pull Llama 3.1 8B (quantized, approx. 4.7 GB download)
ollama pull llama3.1:8b
# Pull Mistral 7B Instruct
ollama pull mistral:7b
# Pull Mistral with a specific quantization level
ollama pull mistral:7b-instruct-q4_K_M
# Start an interactive chat session
ollama run llama3.1:8b
# List all locally downloaded models
ollama list
3.3 Call Ollama via REST API (curl)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.1:8b",
"prompt": "Explain quantum entanglement in one paragraph.",
"stream": false
}'
3.4 OpenAI-Compatible Python Client
Ollama exposes an OpenAI-compatible endpoint, so existing OpenAI SDK code works with a one-line base_url change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by SDK but not validated
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "user", "content": "Hello! Tell me about GPU servers."}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
Step 4: Method B β Deploy with vLLM (Production)
vLLM is the industry standard for high-throughput LLM serving. It implements PagedAttention for near-optimal GPU memory utilization and continuous batching β achieving 10β20Γ higher throughput than naive HuggingFace inference. Use this for any production workload serving concurrent users.
4.1 Install vLLM
# Ensure venv is active
source ~/llm-env/bin/activate
# Install vLLM β auto-detects CUDA version
pip install vllm
# Verify install
python -c "import vllm; print(vllm.__version__)"
# β 0.4.x
# Install HuggingFace CLI for model management
pip install huggingface_hub
huggingface-cli login # Enter your HF token when prompted
4.2 Launch vLLM OpenAI-Compatible Server
# Serve Mistral 7B with AWQ 4-bit quantization
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--quantization awq \
--max-model-len 4096 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.90
# For Llama 3.1 70B across two A100 40GB GPUs (tensor parallelism)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 8192
4.3 Test the vLLM Endpoint
import requests
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is PagedAttention and why does it matter?"}
],
"temperature": 0.7,
"max_tokens": 512
}
resp = requests.post(url, json=payload, headers=headers)
print(resp.json()["choices"][0]["message"]["content"])
Step 5: Method C β HuggingFace Transformers (Full Control)
For custom pipelines, fine-tuning workflows, or research, loading models directly via Transformers gives you maximum control over every inference detail.
5.1 Install Dependencies
pip install transformers accelerate bitsandbytes torch sentencepiece
5.2 Llama 3.1 8B with 4-Bit Quantization
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# 4-bit quantization: reduces VRAM from 16 GB to approx. 5 GB
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto", # auto-distribute across available GPUs
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain FP16 vs BF16 in LLM training."},
]
# Apply the model-specific chat template
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Step 6: Run as a Systemd Service (Production Persistence)
For a persistent, auto-restarting LLM API server, create a systemd unit file. This ensures your endpoint survives reboots and restarts automatically on failure without manual intervention.
6.1 Create the Service File
sudo nano /etc/systemd/system/vllm.service
6.2 Service Unit Contents
[Unit]
Description=vLLM OpenAI-compatible API Server
After=network.target
[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu
Environment="PATH=/home/ubuntu/llm-env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
Environment="HUGGING_FACE_HUB_TOKEN=YOUR_HF_TOKEN_HERE"
ExecStart=/home/ubuntu/llm-env/bin/python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--quantization awq \
--host 0.0.0.0 \
--port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
6.3 Enable and Start
# Reload systemd configuration
sudo systemctl daemon-reload
# Enable auto-start on boot
sudo systemctl enable vllm
# Start the service
sudo systemctl start vllm
# Check status
sudo systemctl status vllm
# Stream live logs
sudo journalctl -u vllm -f
Step 7: Secure with Nginx Reverse Proxy + SSL
Never expose your LLM server port (8000) directly to the internet. Use Nginx as a reverse proxy with SSL termination and an API key header check to protect your inference endpoint.
7.1 Install Nginx and Certbot
sudo apt install -y nginx certbot python3-certbot-nginx
# Obtain a free Let's Encrypt SSL certificate
sudo certbot --nginx -d api.yourdomain.com
7.2 Nginx Site Configuration
server {
listen 443 ssl;
server_name api.yourdomain.com;
# Enforce API key via request header
if ($http_x_api_key != "your-secret-key-here") {
return 403;
}
location / {
proxy_pass http://127.0.0.1:8000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s; # LLM inference can take time
}
}
π Security: Replace 'your-secret-key-here' with a cryptographically strong random string. Generate one with: openssl rand -hex 32. Never store API keys in version control or expose them in logs.
Step 8: Monitor GPU Performance
Keep a close eye on VRAM usage, GPU utilization, and temperature to prevent throttling and out-of-memory errors during inference, especially under concurrent request load.
# Live GPU stats refreshed every second
watch -n 1 nvidia-smi
# Install and run nvtop for a more detailed TUI dashboard
sudo apt install nvtop && nvtop
# Query specific metrics in CSV format
nvidia-smi \
--query-gpu=timestamp,name,utilization.gpu,memory.used,memory.free,temperature.gpu \
--format=csv \
--loop-ms=2000
# Log GPU metrics to file for later analysis
nvidia-smi \
--query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu \
--format=csv \
--loop-ms=5000 >> gpu_metrics.csv
Step 9: Real-World Benchmark Results
The following throughput and latency measurements were recorded on LeoServers GPU instances. All tests used a single-request batch with 512-token output length, calling the vLLM OpenAI-compatible endpoint directly.
| Model | GPU | Quantization | Tokens/sec | First token (ms) | VRAM used |
|---|---|---|---|---|---|
| Mistral 7B Instruct | RTX 4090 24 GB | FP16 | 78 t/s | 210 ms | 14.1 GB |
| Mistral 7B Instruct | RTX 4090 24 GB | AWQ 4-bit | 94 t/s | 180 ms | 4.8 GB |
| Llama 3.1 8B Instruct | RTX 4090 24 GB | FP16 | 71 t/s | 230 ms | 15.8 GB |
| Llama 3.1 70B Instruct | A100 80 GB | FP16 | 22 t/s | 890 ms | 76 GB |
| Llama 3.1 70B Instruct | 2x A100 40 GB | AWQ 4-bit | 34 t/s | 610 ms | 38 GB total |
π‘ Key insight: AWQ 4-bit quantization on Mistral 7B actually increases throughput by ~20% on the RTX 4090 because the smaller memory footprint allows the GPU memory bandwidth to be used more efficiently, and larger concurrent batches become possible.
Step 10: Troubleshooting Common Issues
Out-of-Memory (CUDA OOM) Errors
# Reduce GPU memory utilization fraction
--gpu-memory-utilization 0.80 # was 0.90, try lower
# Reduce the maximum sequence length (KV cache grows with context)
--max-model-len 2048
# Force 4-bit quantization
--quantization awq
# Find and kill processes holding GPU memory
sudo fuser -v /dev/nvidia*
kill -9 <PID>
Slow First-Token Latency
# Make sure CUDA graph optimization is enabled (it is by default)
# Do NOT set --enforce-eager unless debugging
# Pre-warm the model with a dummy request after startup:
curl -s http://localhost:8000/v1/chat/completions \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.3",
"messages":[{"role":"user","content":"hi"}],"max_tokens":1}'
HuggingFace 401 / Model Access Denied
# Re-authenticate HuggingFace CLI
huggingface-cli login
# Or set token as environment variable
export HUGGING_FACE_HUB_TOKEN="hf_your_token_here"
# For Meta Llama 3 models, you must accept the license at:
# https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
# (approval is usually instant for personal use)
Which Deployment Method Should You Choose?
| Method | Best for | Throughput | Setup time |
|---|---|---|---|
| Ollama | Local dev, demos, single-user tools | Moderate | < 5 minutes |
| vLLM | Production APIs, concurrent users | Very High (10-20x) | ~15 minutes |
| Transformers | Research, fine-tuning, custom pipelines | Lower (no batching) | ~10 minutes |
β Our Recommendation: Start with Ollama. Migrate to vLLM once you need to handle concurrent requests, achieve production throughput SLAs, or serve external users at scale. Check out our Leo Servers Dedicated GPU Plans today to power your AI models instantly.
Discover Leo Servers Dedicated Server Locations
Leo Servers servers are available around the world, providing diverse options for hosting websites. Each region offers unique advantages, making it easier to choose a location that best suits your specific hosting needs.
