← Back to Blog

I run all of my production systems -- SCAFU, Nuculair, AURA, and supporting services -- on a Docker Swarm cluster across 3 bare-metal nodes. Not Kubernetes. Not managed cloud containers. Docker Swarm on machines I control. This is a deliberate architectural choice that I have revisited quarterly for two years, and every time the calculus comes out the same: for AI-heavy workloads at my scale, Swarm provides better GPU utilization, simpler operations, and lower cost than any alternative. Let me explain why, and how the rest of the stack fits together.

Why AI Workloads Break Traditional Infrastructure Assumptions

Traditional web applications have a predictable resource profile. CPU utilization scales linearly with request count. Memory usage is relatively stable. Response times are measured in low-digit milliseconds. Horizontal scaling (add more instances) is the universal answer to capacity problems. Infrastructure orchestration tools were designed for this profile.

AI workloads violate every one of these assumptions:

GPU as the primary resource constraint. A single LLM inference call on a 7B parameter model running through Ollama consumes 6-8GB of VRAM for 2-15 seconds. During that time, no other inference can use those memory blocks. GPU memory is the bottleneck, not CPU or network bandwidth. Traditional orchestration tools that schedule based on CPU and memory have no concept of VRAM allocation.

Highly variable request duration. A simple LLM classification task completes in 800ms. A complex multi-step reasoning task takes 12-18 seconds. This 15-22x variance in request duration makes load balancing fundamentally harder. Round-robin distribution, the default for most load balancers, produces severe imbalance when one request takes 800ms and the next takes 15 seconds.

Model loading overhead. Loading a 7B parameter model into VRAM takes 3-8 seconds. Loading a 13B model takes 8-15 seconds. If your orchestration system treats model-serving containers as stateless and freely restarts them, you pay this loading penalty repeatedly. A Kubernetes pod restart that takes 200ms for a traditional web service takes 10-15 seconds for a model-serving container. Rolling deployments that take 30 seconds for a web application take 5 minutes for model infrastructure.

Memory residency matters. Once a model is loaded into VRAM, inference is fast. The critical optimization is keeping hot models resident and only swapping when necessary. This is the opposite of the stateless paradigm that cloud-native infrastructure assumes. Your model-serving infrastructure is essentially a caching layer where the cache is a multi-gigabyte neural network in GPU memory.

Docker Swarm vs. Kubernetes for AI Workloads

I evaluated both extensively. Here is the honest comparison at my scale (3 nodes, 4 GPUs, approximately 20 services).

Kubernetes advantages: Better ecosystem for GPU scheduling with the NVIDIA device plugin and time-slicing support. More mature monitoring and observability tooling. Better suited for large teams with dedicated DevOps engineers. Superior for multi-tenant environments where different teams share GPU resources.

Docker Swarm advantages at my scale: Dramatically simpler configuration. A complete Swarm cluster with GPU support requires approximately 200 lines of configuration. The equivalent Kubernetes setup with NVIDIA device plugin, GPU operator, and custom scheduler requires 1,200-1,500 lines across dozens of YAML files. GPU pass-through with --gpus flag is straightforward in Swarm and requires operator installation plus resource quota configuration in Kubernetes. Swarm's overlay networking is simpler and has lower overhead for service-to-service communication (approximately 0.3ms additional latency vs. 0.8-1.2ms for kube-proxy in iptables mode). Finally, I can understand the entire Swarm codebase. When something breaks at 3 AM, I can diagnose it. Kubernetes has a complexity surface area that requires dedicated expertise to operate reliably.

The honest trade-off: Kubernetes would give me better GPU time-slicing, which means multiple models could share a single GPU more efficiently. Swarm's GPU allocation is all-or-nothing per container. I compensate by running multiple models within a single Ollama instance rather than one model per container, which works but is less isolated. If I were running 10+ nodes with 20+ GPUs, Kubernetes would be the clear choice. At 3 nodes with 4 GPUs, Swarm's simplicity wins.

Ollama in Production: What Works and What Hurts

Ollama has become the de facto standard for running open-weight models locally, and for good reason. The developer experience is excellent: ollama pull llama3:8b and you have a model running with an OpenAI-compatible API endpoint in under a minute. But running Ollama in production at scale has revealed several operational realities that the getting-started guides do not cover.

What Works Exceptionally Well

Model management. Ollama's model library and Modelfile system are genuinely excellent for managing multiple model versions across environments. I run 6 different models simultaneously (ranging from 1.5B parameter models for classification tasks to 13B models for complex reasoning), and Ollama handles model loading, unloading, and VRAM management automatically. The hot/warm/cold model hierarchy means frequently used models stay in VRAM while less-used models are paged out gracefully.

API compatibility. The OpenAI-compatible API means switching between local and cloud models requires changing a base URL and model name. No other code changes. This is crucial for the dual-model architecture I described in the architecture deep-dive, where security-sensitive operations use local models and non-sensitive operations can optionally use cloud models.

Quantization support. Running a 13B parameter model in 4-bit quantization (Q4_K_M) requires approximately 8GB of VRAM and produces output quality that is within 5-8% of full-precision inference for most tasks. This means a single RTX 4090 (24GB VRAM) can simultaneously serve a 13B model for complex reasoning and a 7B model for routine classification, with room to spare for a small embedding model.

What Hurts in Production

Concurrent request handling. Ollama processes requests sequentially per model by default. If three requests arrive for the same model simultaneously, the second and third wait in queue. With the parallel request feature (introduced in late 2025), concurrent processing is possible but requires careful VRAM budgeting because each concurrent request needs its own KV cache allocation. On a 13B Q4 model, each concurrent request slot consumes approximately 1.2GB of additional VRAM. I cap concurrency at 3 per model to avoid VRAM pressure.

No built-in request routing. Ollama serves whatever model you specify in the API call. It has no concept of routing logic: send classification requests to the small model, send reasoning requests to the large model. I built a routing layer (a 300-line FastAPI service) that examines each incoming request, determines the appropriate model based on task type and complexity estimation, and forwards to the correct Ollama endpoint. This routing layer also implements retry logic, fallback chains (if the primary model is overloaded, try the secondary), and request queuing with priority levels.

Health monitoring gaps. Ollama exposes minimal health information. You can check if the service is running and which models are loaded. You cannot easily determine VRAM utilization percentage, inference queue depth, average response time per model, or error rates. I supplement this with nvidia-smi polling (every 5 seconds), custom Prometheus metrics exported from the routing layer, and Grafana dashboards that visualize GPU utilization, inference latency percentiles, and queue depth over time.

Polyglot Persistence: Choosing the Right Database for Each Job

Running multiple databases is an operational burden. I run four: PostgreSQL, Neo4j, Redis, and ChromaDB. Each one exists because it is genuinely the best tool for a specific data pattern, not because I enjoy managing database clusters.

PostgreSQL handles all structured, transactional data: user accounts, scan configurations, audit logs, API key management, and application state. It is the source of truth for anything that needs ACID guarantees. Version 16 with JSONB columns provides enough schema flexibility for semi-structured data without needing a document database. Approximately 85% of all database reads and writes go to PostgreSQL.

Neo4j handles graph data exclusively for the Nuculair OSINT platform. Entity relationships, connection traversal, and pattern matching queries that would be multi-join nightmares in PostgreSQL are native Cypher operations. The Neo4j instance handles approximately 47 million relationship edges with sub-100ms query times for 3-hop traversals. Running Neo4j adds operational complexity (JVM tuning, page cache sizing, transaction log management), but the performance difference for graph queries is 40-100x faster than PostgreSQL equivalents.

Redis serves three purposes: session caching (with 24-hour TTL), inter-service message queuing (using Redis Streams), and the shared context store for the multi-agent coordination layer. Redis Streams replaced RabbitMQ in the stack six months ago because it eliminated an entire service while providing sufficient message ordering guarantees for the use cases. At peak load, Redis handles approximately 2,000 operations per second, well within single-instance capacity.

ChromaDB handles vector embeddings for semantic search across the OSINT knowledge base and for RAG-based document retrieval in the report generation pipeline. It stores approximately 2.8 million embedding vectors (1536 dimensions each, using OpenAI's embedding model for non-sensitive content and a local embedding model for sensitive content). Query latency for top-10 nearest neighbor search is 8-15ms, which is fast enough for interactive use.

The Routing Layer: Making Intelligent Decisions About Model Usage

The routing layer is a component I have not seen discussed much in AI infrastructure writing, but it is one of the most impactful optimizations in the stack. Every LLM request passes through a FastAPI service that makes three decisions:

Model selection. Based on the task type (extracted from a header or request body field), the router selects the appropriate model. Security analysis goes to the 13B model. Simple classification goes to the 3B model. Text generation for reports goes to either local or cloud, depending on content sensitivity. This routing reduces average inference cost by approximately 60% compared to sending everything to the largest model, with no measurable quality difference because small models perform equivalently on simple tasks.

Priority queuing. Requests are assigned priority levels: CRITICAL (security scan in progress, user waiting), HIGH (background scan processing), NORMAL (report generation), LOW (analytics and batch processing). When GPU capacity is saturated, lower-priority requests are queued while higher-priority requests proceed. This ensures that interactive scan sessions are never blocked by batch processing jobs.

Failover logic. If the primary model for a task is unavailable (Ollama is restarting, VRAM is fully allocated), the router has a fallback chain for each task type. Security analysis falls back from 13B local to 7B local (never to cloud). Report generation falls back from local to cloud (content is pre-sanitized). Classification falls back from 3B local to a simple heuristic function that handles the most common cases without any LLM call. This failover logic reduced "LLM unavailable" errors from approximately 3% of requests to 0.1%.

Monitoring AI Infrastructure: What to Measure

Standard infrastructure monitoring (CPU, memory, disk, network) is necessary but insufficient for AI workloads. The additional metrics I track, all visualized in Grafana dashboards:

  • GPU VRAM utilization (percentage, per-GPU): The primary capacity indicator. When average utilization exceeds 85%, response latency begins to increase as models compete for VRAM.
  • Inference latency percentiles (p50, p95, p99, per-model): p50 tells you typical performance. p99 tells you worst-case performance. A growing gap between p50 and p99 indicates queuing congestion.
  • Model load/unload frequency: If a model is being loaded and unloaded frequently, it means VRAM pressure is causing thrashing. Each load costs 3-15 seconds of latency for the first request after loading.
  • Tokens per second (generation speed, per-model): Degradation here indicates thermal throttling or VRAM fragmentation. A 13B Q4 model should generate at 35-45 tokens/second on an RTX 4090. Sustained drops below 25 indicate a problem.
  • Queue depth (per-priority-level): Tells you whether capacity is adequate for current demand. A growing queue at NORMAL priority while CRITICAL queue is empty is fine. A growing queue at CRITICAL priority requires immediate attention.
  • Routing decision distribution: What percentage of requests go to each model and each priority level. Unexpected shifts in distribution can indicate changes in workload patterns or routing misconfigurations.

n8n Workflow Automation: The Glue Layer

The final architectural component worth discussing is n8n, the workflow automation platform that connects services that were not designed to talk to each other. I run a self-hosted n8n instance that handles approximately 40 automated workflows including:

Scan result processing: When a SCAFU scan completes, an n8n workflow picks up the results, formats them into Markdown reports, sends notification emails to configured recipients, pushes summary metrics to the monitoring stack, and archives raw results to long-term storage. This workflow replaced 200 lines of custom Python code with a visual pipeline that non-engineers can understand and modify.

Data source health checks: Every 30 minutes, an n8n workflow pings each of Nuculair's 312 data sources with a lightweight test query. Sources that fail are flagged in the health dashboard, and repeated failures trigger a notification to investigate. This workflow was the fastest to build (45 minutes from concept to production) and one of the highest-value in terms of operational awareness.

Cross-platform synchronization: When a new vulnerability finding is confirmed in SCAFU, n8n creates a corresponding entry in the project management system, updates the client-facing dashboard, and logs the finding in the compliance audit trail. Three systems updated atomically from a single trigger event.

n8n occupies a layer in the architecture that I call the "integration tier." It sits between the core services (which communicate through direct API calls and Redis Streams) and the operational services (monitoring, notification, archival, reporting) that need to react to events without being tightly coupled to the core. If n8n goes down, core functionality is unaffected. Notifications stop, reports are delayed, and health checks pause, but scans continue and data flows normally.

Building AI infrastructure in 2026 requires accepting that the rules have changed. GPU memory matters more than CPU. Stateful model serving matters more than stateless scaling. Intelligent routing matters more than round-robin load balancing. And operational simplicity -- choosing Docker Swarm over Kubernetes at the right scale, choosing n8n over custom code for integration logic -- is an architectural decision, not a concession. The goal is infrastructure that supports the work, not infrastructure that becomes the work.

← Back to Blog