ColdVault← 9robots.ai

ColdVault Platform — Architecture

How encrypted AI inference works at 9robots. Confidential VMs, self-hosted open models, attested per request, with optional bypass to partner providers when encryption is not required.

Last updated: 2026-05-29 · Living document

This document covers the ColdVault Platform — the confidential-AI inference layer. The ColdVault React client at coldvault.ai is built on top of this platform; its UI architecture is intentionally out of scope here. Everything below describes what the platform itself does, what it sees, and what it never sees.

1. Overview

ColdVault Platform is a shared AI inference gateway. It accepts inference requests from any client product (including the ColdVault chat UI, our agentic tooling, and direct API users) and routes each request to the right model, in the right mode, with the right encryption posture.

Two inference modes coexist behind a single API:

Vault Mode — the request and response are processed end-to-end on hardware-encrypted infrastructure: AMD SEV-SNP Confidential VMs for orchestration and NVIDIA Confidential Computing GPUs for model execution. Plaintext exists only inside the silicon — never in RAM, on the bus, in HBM, or across the links between CPU and GPU. This is the mode designed for regulated workloads.

Partner Mode — the request is routed to an established proprietary provider. The payload is not encrypted from the model's perspective (the provider can see it). This mode is for workloads where the customer is fine with the provider seeing the content, or where only the proprietary provider has the capability required. Providers are not named here by design; mode-and-routing is declared per request.

Both modes can run inside a single multi-model debate — the platform routes each member of the panel to the right mode based on whether the model is one of ours (Vault) or a partner provider (Partner). The verdict comes back as one response.

2. The trust model — Confidential VMs

The platform's control plane runs inside AMD SEV-SNP Confidential VMs. SEV-SNP encrypts the VM's memory in hardware — RAM contents are encrypted with a key the hypervisor cannot see. Hardware integrity checks (page-table protection, reverse-map table) prevent the hypervisor or another tenant from tampering with the VM's memory without detection.

On the GPU side, NVIDIA Confidential Computing extends the same principle to HBM: tensor data inside the GPU is encrypted between the GPU and the rest of the system. Together, this means the only place where customer plaintext exists is inside the CPU and GPU silicon themselves. In RAM, on PCIe, across NVLink, on the wire — your data is never in plaintext.

The platform's TLS termination is consistent with this trust model: TLS is decrypted inside the Confidential VM, not at the load balancer. We explain why in the next section.

3. Network — why L4, not L7

A typical web service uses an L7 (application-layer) load balancer that terminates TLS at the LB and forwards plaintext HTTP to backends. This is convenient — the LB can do WAF, path routing, caching, request shaping — but it requires the LB to hold a TLS private key and to see plaintext requests in memory.

For confidential inference, that is the wrong tradeoff. The cloud-managed L7 LB runs outside our Confidential VM boundary, so any plaintext it sees is, by definition, plaintext that escaped the encrypted enclave. We do not want the LB to ever see the request.

We use an L4 Network Passthrough Load Balancer instead. The L4 LB forwards encrypted TCP bytes; it does not hold a TLS key, it does not parse HTTP, and it does not see plaintext. TLS terminates inside the Confidential VM, in an nginx process that lives in encrypted memory. The certificate is pulled from Secret Manager at VM boot, written to disk, and held inside the encrypted RAM for the lifetime of the VM.

Client
  │  HTTPS (TCP 443, encrypted bytes)
  ▼
L4 Network Passthrough LB        ← no TLS key, no plaintext, no HTTP parsing
  │  TCP forwarded as-is
  ▼
Confidential VM (AMD SEV-SNP)    ← encrypted memory, hardware integrity
  ├─ nginx :443                  ← TLS terminates HERE
  │    cert from Secret Manager
  │    proxy_pass 127.0.0.1:8080
  ▼
  uvicorn :8080                  ← FastAPI, inside the same encrypted VM
  │
  ▼
  Model providers
    ├─ self-hosted vLLM (Vault Mode — encrypted GPU TEE)
    └─ partner providers (Partner Mode — proprietary endpoint)

The tradeoff: we give up the convenience features of the L7 LB (managed WAF, path-level routing at the LB). We accept this as the cost of never having plaintext outside the encrypted enclave. WAF and DDoS posture move into our own configuration on the VM side, and into the L4 LB's native DDoS protection plane.

4. Compute fabric

The control plane runs on a Managed Instance Group (MIG) of AMD-EPYC-based Confidential VMs. Today the typical production instance is a small CPU-only VM (n2d-standard-4 class — 4 vCPU, 16 GB RAM), running our FastAPI orchestrator. The instance template carries the boot script, startup wiring, and the version of our container image to run. MIG autoscaling brings new VMs in under load and drains them when load falls. Auto-healing replaces any VM that fails health checks.

Rolling updates are deployed one VM at a time (max-surge=1, max-unavailable=0). Each new VM pulls the latest container, the latest TLS cert from Secret Manager, and joins the LB's backend pool only after passing health checks. Certificate rotation is automated: a Cloud Scheduler job runs an ACME DNS-01 renewal, writes the new cert to Secret Manager, and an Eventarc-triggered Cloud Workflow drives a MIG rolling restart so every VM picks up the new cert at boot — no human in the loop at the 60-day renewal mark.

5. Vault Mode — encrypted GPU inference

Vault Mode runs on our own GPU infrastructure. Open-source models (Qwen, GLM, Kimi, and similar) are hosted on H100-class GPUs (80 GB) inside Confidential VMs, with vLLM serving an OpenAI-compatible HTTP API on an internal VPC port. The platform's router calls vLLM the same way it would call any other OpenAI-compatible endpoint, except that the call never leaves our encrypted infrastructure.

NVIDIA Confidential Computing on the GPU encrypts HBM and the PCIe link, so model weights and activations stay encrypted outside the GPU silicon, matching what SEV-SNP provides on the CPU side. The GPU performs inference; plaintext exists only on the GPU's execution units, never on the wire.

Multi-GPU tensor-parallel deployments for larger models run on 8 × H100 a3-highgpu-8g clusters. We have built and validated this end to end. Cluster bring-up is automated: cold-start to serving traffic within hours, on demand.

vLLM model lifecycle

Adding or replacing a Vault model is a 2-file code change plus one environment variable: the model registry entry, the adapter's endpoint map, and the secret holding the vLLM endpoint URL. Lifecycle scripts manage GPU bring-up, model deploy, and tear-down with progress-based health monitoring (stall detection, preemption detection).

6. Partner Mode — proprietary providers

When a request asks for a proprietary model, the platform routes through a managed provider channel. From the user's perspective, the API surface is identical; the difference is where the inference actually executes and what the encryption posture is. Partner Mode payloads are visible to the partner provider — by design — and the partner mode is offered for cases where the customer has accepted that.

We do not name partner providers in public documentation. The mode and the per-request capability matrix are exposed via the API (model discovery returns the model's supported capabilities — images, tools, etc.); the routing path is an implementation detail.

7. Multi-model debate

The platform exposes a debate engine as a first-class inference mode. The caller sends a normal prompt; the platform handles debate mechanics internally. The default debate panel comprises six expert models (a mix of Vault and Partner) and a moderator. In round one, each reviewer responds with FOR and AGAINST argumentation and a confidence score. In round two, the moderator synthesizes the panel into a verdict (Confirmed, Rejected, Needs Human Review) for each finding.

Because the panel can span Vault and Partner models in the same call, the debate output reflects multiple perspectives across both proprietary and open-source families. The moderator's verdict is the structured output; the underlying FOR/AGAINST arguments are preserved alongside.

A second style — divergent brainstorm — is available where preserving diverse alternatives matters more than landing on a verdict. Same API, different prompt pair.

8. Hardware attestation

The platform exposes an authenticated GET endpoint for hardware attestation. The caller passes a nonce (8 to 128 characters) and a bearer token; the endpoint returns a Google-signed OIDC JWT that proves the request was processed inside an AMD SEV-SNP Confidential VM. The token's claims include the hardware model (AMD SEV-SNP), the nonce echoed back, and other VM identity fields.

The verifier checks the token signature against Google's OIDC keys, confirms claims.hwmodel == "AMD SEV-SNP" and claims.nonce == <your-client-nonce>, and is now cryptographically convinced that whatever response came back from the platform was produced inside the attested enclave.

Endpoint rate limiting is 10 requests per 60 seconds per IP. The attestation flow is the same one a QP, CSV, or external auditor can run against any production request.

9. Models

The model registry contains thirteen canonical entries spanning seven expert and six light/fast variants. Canonical names are version-free ({vendor}-{family} for flagships, {vendor}-{family}-{variant} for variants). Per-model capabilities — text, images, tool calls — are surfaced through model discovery and are enforced at routing time: a call asking a non-image model to process an image returns HTTP 400 with a pointer to the capability matrix.

The thirteen-entry registry is the technical surface — every model the platform can route to. The 9 Robots Model Council is a smaller curated production set of 9 models drawn from the registry, selected to span the workloads we care about (debate quality, fast classification, multimodal grounding, code assistance), with the Vault tier handling encrypted inference and the Partner tier filling capabilities the Vault tier does not yet have at parity.

10. Benchmark

Where each model ranks on aggregation, debate, and multi-model coverage is published openly at benchmark.coldvault.ai. The benchmark is updated continuously as new models join the council and as the workloads evolve.

Notes

This page is a living document. It is updated when the platform's architecture changes. The "Last updated" date at the top reflects the most recent material edit. If you are evaluating the platform for a regulated workload and want a point-in-time PDF for your binder, the simplest path today is to print this page — versioned snapshots may be added later if there is demand.

This page describes the architecture of the ColdVault Platform — what we have built and what the platform can do. It is not an operations status page; which specific models route through Vault vs Partner at this exact moment is an operations question, surfaced through model discovery on the API. The architecture describes the design and the capability.

Questions about the architecture, or a request for a deeper briefing on a specific section, go through the contact form on 9robots.ai/contact.