Building an Air-Gapped AI Security Platform from Scratch

thomasbolton-braml
Apr 24
4 min read

Digital brain model with glowing nodes and connections on a blurred orange background, conveying a sense of technology and innovation.

Most AI-powered security tooling assumes an internet connection. That works fine until you're operating in an environment where outbound traffic is a risk, not a feature. I needed a platform that worked with zero external connectivity, fully air-gapped, fully self-contained, and built for offensive security workflows.

So I built one.

What is SKYNET? (Yes, that SKYNET.) It’s my internal name for a fully air-gapped AI penetration testing platform.

SKYNET is a collection of six interconnected tools, all containerised with Docker and running on a server with dual NVIDIA RTX 3090 GPUs. Every model, every embedding, every knowledge base, local. Nothing calls home at runtime.

Key Stats: 6 Tools 2 x RTX 3090 GPUs 248K+ RAG Entries 100% Offline

A lot of head scratches, headaches, a lot of code re-writes, but this is what was produced;

The Tools:

AI Chatbot: Team-facing AI assistant with a custom pen-test persona, CVE lookup, and exec summary generation.
The RAG Database: Local knowledge base with 248K+ entries: CVE, MITRE ATT&CK, OWASP, ExploitDB, and internal pentest reports.
Security Config Auditor: Five-engine firewall & config review supporting Cisco, Fortinet, Palo Alto, Juniper, Check Point.
Hashcat Web UI: GPU-accelerated password cracking UI with 6,200+ wordlists and session management.
Admin Panel: Internal management interface.
Terminal CLI: Command-line interface to the chatbot for operators who live in the terminal, and grant AI assistet pen testing.

GPU Partitioning Strategy

Two GPUs, six tools. One GPU runs the chatbot inference engine permanently through vLLM; because it's the highest-frequency tool, it stays resident. The second GPU hosts a larger reasoning model that spins up on demand via a GPU manager, used by the config auditor and firewall review tools.

When it's time to crack hashes, Hashcat doesn't just take over GPU 2, it also leverages the unallocated overhead on GPU 1. Because the vLLM instance is capped, Hashcat can safely scavenge the free VRAM and compute on the first card without killing the chatbot container or interrupting Dave's uptime.

Building the RAG Layer

The local knowledge base holds over 248,000 entries across five collections: ~197,000 CVEs (NVD 2020–2025), 46,000+ ExploitDB entries, the full MITRE ATT&CK Enterprise/Mobile/ICS framework, OWASP WSTG and Top10, and the team's own pentest reports for house-style consistency.

Embeddings run entirely locally. CVEs refresh nightly via cron; a full rebuild runs every month. Updates are downloaded on an internet-connected machine, transferred over, and re-ingested, skipping existing IDs so the process is incremental. Like I mentioned its 100% Local.

Critical detail: The chatbot uses a dedicated CVE lookup with sentinel-based hallucination prevention. If the CVE isn't in the database, the response is explicit about that, no fabricated CVSS scores or descriptions.

What Actually Broke

A UI framework broke silently: A chatbot component in a widely-used Python UI library introduced a schema parsing regression. Rather than patch around it, the chatbot and admin panel were rewritten as FastAPI + plain HTML.
The LLM ignored its system prompt: One of the local models doesn't reliably follow formatting rules at prompt level. CVE accuracy and exec summary formatting now live in Python, not in prompts. Code enforcement, not hope.
CUDA version mismatches are silent killers: A pre-built inference image was compiled against an older CUDA version than the host driver. The fix was straightforward once diagnosed, but finding it cost hours.
Dependency pinning matters more than you think: A major version bump in a numerical computing library broke the vector database entirely. Pin your dependencies and treat upgrades as deliberate decisions.
Responder captures need cleaning before cracking: Raw NTLMv2 capture files need BOM stripping, line ending normalisation, and metadata filtering before they'll parse correctly.
Shell scripting edge cases bite: A PATH export in an install script didn't propagate to the parent shell because of how it was being invoked. One word changed in the run instructions fixed it.

"The most important architectural decision was enforcing behaviour in code rather than trusting model instructions. Prompts are suggestions. Code is law." Lesson from building Dave

The Most Complex Piece: Security Configuration Auditor

The Security Configuration Auditor started as a 2,400-line monolith and was refactored into a 28-module Python package. It runs five analysis engines:

YAML-driven rule matching Network rules analysis (shadowing, any-any policies, zero-hit rules) Static code analysis (entropy, secret scanning, malware detection) AI enrichment Multi-device correlation

Reports come out as HTML with visualisations, DOCX from a pentest template, and CSV. Firepower Management Centre (FMC) files are auto-detected and transparently converted.

What's Next: Red Teaming & Social Engineering

The platform handles technical assessment well. The next frontier is giving the same AI-assisted capability to the human side of an engagement, red teaming and social engineering.

The plan is a campaign planner module: feed in a target organisation, define the objective and scenario type, and get back a structured operation plan, pretext options ranked by plausibility, recommended delivery vectors, talking points, likely objection handling, and MITRE ATT&CK mappings for the report.

Why Offline-First Matters

The air-gap constraint is a design discipline. Every dependency has to be justified and vendored. The result: no runtime dependencies on third-party services, no data leaving the network, and predictable performance. For security assessment work, that's not a nice-to-have, it's the whole point.

By Carl Williamson