How to Run a Private AI Coding Assistant on Your Own GPU Infrastructure

You want your developers to use AI coding assistants. You do not want your proprietary source code flowing through ChatGPT, Claude, or any other public AI API.

The solution: run an open-source coding model on infrastructure you control, connected to your developers through a private network.

This article explains how. It also explains where it gets complicated — because that part matters when you are deciding whether to build it yourself or hire someone who has done it before.

The architecture at a glance

A private AI coding assistant has five components:

Developer laptop
→ local coding assistant (e.g., OpenCode via ACP in IntelliJ)
→ local bridge (OpenAI-compatible Go client on localhost)
→ private mesh network (“VPN on Steroids” Headscale/Tailscale)
→ customer-controlled AI gateway (on your private VPN server)
→ GPU server or rented GPU pod
→ open-source coding model (e.g., Qwen 3.6 27B)

Plus private search indexes:

Proprietary source-code search
Confidential company documentation search
Technical-library documentation search
Redacted web search

All running on open-source software inside your private Tailscale network, available via MCP to your coding agent.

Let’s break down each component.

Component 1: The local coding assistant

Your developers need a tool on their laptop that can interact with the AI model. Options include:

OpenCode (via ACP in IntelliJ, or standalone)
Continue.dev (VS Code extension or for IntelliJ)
Any OpenAI-compatible client pointed at your local bridge

The key requirement: the client must support custom API endpoints. You point it at localhost:8080 (or whatever port your local bridge uses), not at api.openai.com.

What you need:

Install the coding assistant on each developer laptop
Configure it to point at the local bridge
Ensure developers understand they are using a private model, not ChatGPT

Where it gets tricky: Some assistants hardcode OpenAI endpoints or add friction when using custom endpoints. OpenCode and Continue.dev both support custom endpoints cleanly.

Component 2: The local bridge

The local bridge is a lightweight HTTP server running on the developer’s laptop. It receives requests from the coding assistant and forwards them through the private network to the AI gateway.

Why not connect directly to the GPU server? Two reasons:

Network isolation: The GPU server should not be directly accessible from developer laptops. The bridge routes traffic through the private mesh.
Request shaping: The bridge can add context, handle retries, and manage connection pooling.

A Go-based OpenAI-compatible bridge works well. It exposes localhost:8080/v1/chat/completions (or similar) and forwards requests to the AI gateway’s private Tailscale IP.

What you need:

A bridge binary for each developer platform (macOS, Linux, Windows)
Configuration pointing at the AI gateway’s Tailscale IP
Basic authentication or API key validation

Where it gets tricky: The bridge must handle streaming responses correctly. If the bridge buffers the entire response before sending it to the client, the developer experience is poor (long waits with no visible progress). Proper streaming support requires careful HTTP handling.

Component 3: The private mesh network

This is the component that makes the setup private. Instead of sending requests over the public internet to an OpenAI endpoint, requests travel through a private mesh VPN.

Tailscale (managed) or Headscale (self-hosted) both work. They create a private network with private IP addresses. Devices on the network can communicate with each other. Devices not on the network cannot see the traffic.

Architecture choice:

Option	Pros	Cons
Tailscale (managed)	Easier setup, stable, well-documented	Data plane goes through Tailscale’s coordination server (encrypted, but some teams prefer full self-hosting)
Headscale (self-hosted)	Full control, no third-party coordination server	More setup work, you manage the control plane

For most teams, Tailscale is fine. For teams with strict self-hosting requirements, Headscale is the answer.

What you need:

A Tailscale account (free for small teams) or a Headscale server
Install Tailscale/Headscale client on each developer laptop
Install Tailscale/Headscale on the AI gateway server
Install Tailscale/Headscale on the GPU server
Configure ACLs so developer laptops can reach the AI gateway but not the GPU server directly

Where it gets tricky: ACL configuration. You want developers to reach the AI gateway, the AI gateway to reach the GPU server, but developers should not reach the GPU server directly. Getting the ACL rules right requires understanding Tailscale’s policy syntax.

Component 4: The AI gateway

The AI gateway is a server (typically a small VPS) that sits between the developer laptops and the GPU server. It handles:

Request routing: Forward requests to the GPU server running the model
Private MCP server: Exposes tools to the coding assistant (code search, documentation search, web search)
SearXNG: Guarded internet search through a controlled path
Meilisearch: Private code and documentation indexes
Pod lease logic: Start/stop GPU pods on demand to save costs when not in use
Audit logging: Track usage patterns without logging code content

A small VPS (2 GiB RAM, 1 vCPU, around $12/month) is sufficient for the gateway. It does not run the model — it routes requests and serves indexes.

What you need:

A VPS running Linux
Tailscale/Headscale installed and connected to the private network
Docker or similar for running MCP, SearXNG, Meilisearch
Configuration for routing requests to the GPU server’s Tailscale IP

Where it gets tricky: The gateway is the orchestration layer. Getting MCP, search, indexing, and model routing to work together requires integration work. Each piece is simple. Making them all work as a coherent system takes time.

Component 5: The GPU server and model

This is where the actual AI inference happens. Options:

Option	Pros	Cons
Rented GPU pod (RunPod, etc.)	No hardware investment, scale up/down, start/stop on demand	Recurring cost, not “your” hardware, but in secure cloud
Your own GPU server	Full control, no recurring GPU cost (except electricity)	Upfront hardware cost, driver/CUDA/maintenance complexity

Model selection:

Qwen 3.6 27B is a strong private-coding candidate. It requires:

GPU: NVIDIA RTX 4090 (24 GB VRAM) or equivalent for comfortable inference
VRAM: 16 GB minimum, 24 GB recommended for full context length
Inference server: llama.cpp’s llama-server or vLLM for OpenAI-compatible serving

What you need:

GPU server with sufficient VRAM
CUDA drivers installed and working
llama.cpp or vLLM compiled and configured
Model weights downloaded (available from HuggingFace)
Tailscale/Headscale installed and connected to the private network
OpenAI-compatible API endpoint exposed on the private network

Where it gets tricky: This is where most DIY attempts stall.

CUDA version compatibility: The model, the inference server, and the GPU drivers must all agree on CUDA versions. They frequently do not.
Driver issues: NVIDIA drivers on Linux are notoriously finicky. A kernel update can break CUDA. A CUDA update can break the driver.
Memory management: Large models need careful memory allocation. OOM errors are common on first setup.
Quantization choices: Running a model in full precision (FP16) requires more VRAM than quantized versions (Q4_K_M, Q8_0). Choosing the right quantization affects quality and speed.
Context length: Longer context windows require more VRAM. Configuring context length for your GPU size requires calculation.

Component 6: Private codebase indexing (RAG)

Large language models are not trained on your private source code. They are weak on your internal architecture, proprietary libraries, and niche frameworks.

The solution: private retrieval-augmented generation (RAG). You index your codebase and documentation. When the assistant needs context, it searches the index first, then sends relevant snippets to the model.

What you need:

Meilisearch or similar search engine running on the AI gateway
Indexing pipeline for your repositories (clone, parse, chunk, index)
Indexing pipeline for your documentation (markdown, HTML, PDF)
MCP server exposing search tools to the coding assistant
Separation between customer code indexes and public/technical doc indexes

Where it gets tricky:

Chunking strategy: How do you split code into searchable chunks? Too large and the model gets irrelevant context. Too small and you lose structural information.
Index updates: Code changes. The index needs to stay current. Automated re-indexing on push is ideal but requires CI integration.
Scope boundaries: Which repositories get indexed? Which branches? How do you handle monorepos with multiple projects?

The full setup checklist

If you are building this yourself, here is what you need to do:

Set up Tailscale/Headscale network — install clients, configure ACLs
Provision AI gateway VPS — install Docker, build the gateway, Tailscale, configure routing
Provision GPU server — install CUDA, drivers, Tailscale
Download and configure model — choose model, download weights, configure llama.cpp/vLLM
Build local bridge — compile or download bridge binary, configure for each platform
Install coding assistant — configure OpenCode/Continue.dev on developer laptops
Set up codebase indexing — clone repos, build index, configure Meilisearch
Set up documentation indexing — parse docs, build index
Configure MCP server — expose search tools to coding assistant
Set up SearXNG — configure guarded web search
Test end-to-end — verify developer can query private codebase through assistant
Document the setup — write operating notes for your team
Handle edge cases — streaming, timeouts, error handling, reconnection

Estimated time for someone who has done it before: 2–3 days.

Estimated time for someone doing it the first time: 4-8 weeks, assuming no major CUDA/driver issues.

Where it gets complicated enough to hire help

Each individual component is well-documented. The complexity is in the integration:

CUDA/driver compatibility is the most common failure point. A version mismatch between your GPU driver, CUDA toolkit, and inference server can waste days.
Building the gateway that integrates indexes, redacted web search and secure communication between laptop and GPU requires proper software engineering
Network configuration requires understanding Tailscale ACLs, private IP routing, and firewall rules.
MCP integration is new. Documentation is sparse. Getting the coding assistant to properly use private search tools requires trial and error.
Streaming support must work end-to-end: from the model, through the gateway, through the bridge, to the client. A bug in any layer breaks the developer experience.
Codebase indexing requires decisions about chunking, scope, updates, and separation that affect quality.

If you have a platform engineer with GPU experience and two months to spare, build it yourself. The components are all open-source.

If you want a working setup in 5–7 days without the integration pain, that is what we do.

What a starter installation includes

Private AI Coding Assistant Starter Installation — from EUR 2,000

One bounded codebase
Private model route through customer-controlled infrastructure
One private codebase and one documentation index
Local developer assistant with private network connection
Redacted web-search path
Developer handover session and operating notes

What you provide:

A technical contact
Repository boundary
Two VPS (Headscale, AI gateway) or we guide you through provisioning
Read-only access or sanitized repository bundle or you upload it later yourself
Infrastructure decision: customer server or customer-controlled GPU pod
Allowed model budget and usage expectations

What is not included:

Production database access
Secret ingestion
Enterprise SSO
Guaranteed replacement for Claude/GPT in every task
Unlimited model benchmarking
Full managed operations, security and indexes update unless a support plan from 100 EUR per month is added

Timeline: 5–7 working days after access is confirmed.

The incentive argument (why this setup is structurally private)

The technical architecture matters. But the reason this setup is private is not just technical. It is economic.

Closed-source AI providers (ChatGPT, Claude, Gemini, Grok) have every incentive to log your traffic. Your code trains their next model. That is their business model.
GPU hosting providers have no incentive to log your traffic. They sell compute by the hour. They have no proprietary model to train. Your code is not their product.
Open-source models are already public. Nobody is extracting proprietary training data from your usage because the model weights are freely available.

This is incentive alignment. The GPU provider makes money from compute. The model is open. Your code stays yours.

The first step

If you want to build this yourself, this article gives you the architecture. Expect 2 months of integration work, with CUDA/driver issues as the most common blocker.

If you want a working private AI coding assistant in 5–7 days, the next step is a 20-minute fit call.

We check your code privacy constraints, current developer workflow, preferred infrastructure path, and whether a starter installation is realistic for your codebase.

Book a private AI Fit Call →