Local AI runtime notes

Running Qwen3.6 27B INT4 on a 5090

A practical field note on making a local vision-language Critic run on a 32 GB RTX 5090: quantization, vLLM startup, CUDA/Blackwell gotchas, structured output, and why the app must own the runtime instead of relying on terminal heroics.

AI Business Consultant Engineering Lead Local VLM deployment

Running a serious local vision-language model on consumer hardware is not a one-line install story. The model may fit on paper and still fail during CUDA graph profiling, FlashInfer JIT, KV-cache allocation, multimodal warmup, or co-residency with another heavy runtime.

The useful question is not just "can it start?" The useful question is whether it can become a product runtime: scripted, health-checked, reproducible, observable, and safe to run without a human watching Docker logs.

In this project, Qwen3.6 27B INT4 became the local Critic runtime for reviewing generated product concepts. The target machine was a single RTX 5090 with 32 GB VRAM. The working shape used vLLM, an AutoRound INT4 quant, and an app-managed runtime profile.

Model Qwen3.6 27B vision-language model, AutoRound INT4 quantization.

Runtime vLLM 0.23.0 with an OpenAI-compatible multimodal request path.

Hardware One RTX 5090 with 32 GB VRAM.

Observed VRAM About 28.5 to 29.0 GiB resident after boot in the working profile.

Why INT4 Was The Practical Path

A 27B vision-language model is large before runtime overhead enters the picture. Full precision is not the useful local path for a 32 GB card when the same machine also needs a web app, storage services, model caches, and image-generation workflows.

INT4 made the model single-card-friendly enough to test. The first smoke test used a smaller context window to validate startup. The app-managed profile then moved to a 32768-token context because Critic jobs need multi-image requests, prompt overhead, per-image JSON, and room for useful explanations.

That context length is not a trophy number. It is a measured operating point. The next larger setting should be tested only after measuring startup reliability, VRAM pressure, KV-cache behavior, and throughput on real Critic batches.

The Blackwell/CUDA Details Matter

The 5090 path needed a modern CUDA and PyTorch stack. In the working image, the runtime used Python 3.12, vLLM 0.23.0, torch 2.11 with CUDA 13 packages, and explicit CUDA path fixes for FlashInfer.

The failure mode was not subtle. FlashInfer JIT could not find a compatible CUDA compiler/runtime layout. The fix was not to hope the local machine happened to be patched correctly. The fix was to bake the path assumptions into a reproducible Docker image.

Runtime image requirements:
- CUDA runtime visible from Python packages
- CUDA library path available where JIT expects it
- libcudart symlinked consistently
- FlashInfer cache cleared after failed builds
- repo-owned Dockerfile updated when dependencies change

This is the point where local AI work often becomes fragile. A manually patched container can work today and be impossible to explain tomorrow. For a product runtime, the Docker image recipe is part of the system.

Do Not Source Dotenv Blindly

Authenticated model downloads mattered. The working path used a Hugging Face token and disabled the slower opaque download route. But there was an operational detail worth remembering: application dotenv files are not always valid shell scripts.

Values that are fine for an app dotenv parser can contain spaces or quoting that breaks `source .env.local`. A runtime script should read only the keys it needs with a parser or a narrow extraction pattern, and it should never print secrets into logs.

Good runtime habit:
- read only HF_TOKEN
- export HF_TOKEN and HUGGING_FACE_HUB_TOKEN
- avoid printing either value
- avoid source .env.local as a shell script

Startup Is A User Experience

A warm-cache startup still had several meaningful phases: container creation, model configuration, distributed initialization, FlashInfer sampler setup, checkpoint loading, compile cache work, engine profiling, multimodal warmup, and HTTP readiness.

On the local machine, those phases added up to several minutes. That is long enough for the UI to lie if it only shows "starting" or "idle." The product should surface the actual phase, elapsed time, and likely next milestone. Otherwise the operator interrupts work that is merely warming, or assumes a failed state while the model is still progressing.

A local model runtime is not done when a command works once. It is done when the app can start it, explain it, use it, stop it, and recover from it.

ComfyUI And Qwen Compete For The Same Card

The 5090 could not comfortably host the heavy image-generation runtime and the Qwen/vLLM Critic profile at the same time. The Qwen runtime alone occupied most of the card after boot. That makes scheduling a product requirement, not an operator preference.

The queue should own runtime transitions. If a Critic job needs Qwen, it may need to stop or unload ComfyUI first. If an image-generation job needs ComfyUI, it may need to stop Qwen and verify VRAM release before starting. The system should preserve healthy warm state when it matches the active profile, but recreate stale or unhealthy containers when the runtime no longer matches.

Structured Output Is Part Of The Runtime

The local Critic was not a chat demo. It had to produce decisions that the app could store and compare. That means OpenAI-compatible vision requests were only the transport. The important contract was structured output.

A Critic request should use schema-constrained output with the repo-owned JSON schema. Loose JSON prompting is not enough for unattended review jobs. If the runtime cannot support structured output, that is a compatibility problem to fix in the runtime image, not a reason to retry with a weaker prompt and pretend the run succeeded.

The run metadata should keep the raw response, parsed decision, schema version, image object keys, prompt profile, timings, usage, and errors. If parsing or image-number matching fails, the queue job should fail visibly and leave the evidence behind.

Chunking Is Not A Quality Fallback

Large Critic jobs need planning. The worker should estimate prompt overhead, visual tokens, output-token budget, JSON size, and safety margin before deciding how many images to review in one request.

If the full set is too large, deterministic chunking is the correct scripted path. It is not a degraded mode. It preserves complete batch coverage while keeping each model request inside the active hardware profile.

Critic batch plan:
- one queue job for the generation batch
- ordered chunks when needed
- chunk metadata stored with the run
- raw response preserved per failure
- no silent retry with looser output rules

The Practical Lesson

Local AI deployment is product engineering, not just model loading. Quantization made Qwen3.6 27B feasible on a 5090, but the real work was the runtime contract: reproducible image, explicit profile, visible startup phases, GPU scheduling, health checks, structured output, and durable failure metadata.

A single 5090 can do serious local AI work when the workflow respects the card. The mistake is treating one successful terminal run as the finish line. The finish line is a system the app can operate without hidden human judgment: start, check, run, parse, store, stop, and tell the truth when any part fails.