Noir Verdict
Noir Verdict: What I Built and What I Learned
Ten questions, four suspects, one bad night, and a 4B parameter model that never talks to the internet.
Two weeks ago, I sat down to build something for the Hugging Face Build Small Hackathon — a competition where every model must be 32B parameters or smaller, and the best entries run entirely offline. I came out the other side with a noir detective game, nine bonus quests checked off, and a notebook full of lessons that I want to share while the scars are still fresh.
The Game
You walk into a 1950s radio station. A phonograph disc has been stolen. Four suspects are waiting in the room — nervous, arrogant, helpful, evasive. You have ten questions to find the truth. Each question either cracks a clue, catches a contradiction, or wastes a turn. At the end, you charge someone. The math decides if you were right. The typewriter decides how it sounds.
The whole thing — a custom 3D noir scene, a deterministic game engine, and a fine-tuned LLM — runs inside a single Hugging Face Gradio Space. The brain of the operation is an NVIDIA Nemotron 3 Nano 4B, fine-tuned with LoRA, quantized to Q4_K_M (just 2.84 GB), and running in-process via llama-cpp-python. No Together. No API keys at runtime. Just a model that can runon a free-tier CPU, talking noir.
You can play it here: build-small-hackathon/noir-verdict
Architecture: Deterministic Truth
The load-bearing idea of this project is what I call deterministic truth. The Python engine owns everything that needs to be reliable. The LLM owns everything that needs to be beautiful.
┌─────────────────────────────────────────────┐
│ Three.js Noir Frontend │
│ (sepia desk lamp, 4 suspects, typewriter) │
└──────────────────┬──────────────────────────┘
│ fetch() API calls
┌──────────────────▼──────────────────────────┐
│ app.py (gradio.Server) │
│ @app.api("/new_game") │
│ @app.api("/interrogate") │
│ @app.api("/charge") │
└──────┬───────────────────────┬──────────────┘
│ │
┌──────▼──────────┐ ┌─────────▼─────────────┐
│ Game Engine │ │ LLM Backend │
│ ──────────── │ │ ─────────── │
│ cases.py │ │ llama-cpp-python │
│ scoring.py │ │ GGUF (Q4_K_M) │
│ state.py │ │ Nemotron 3 4B │
│ prompts.py │ │ in-process, offline │
│ contradictions │ │ │
└─────────────────┘ └───────────────────────┘
The model does exactly one job: stay in character and sound like a hard-boiled detective. A 4B model does that beautifully. A 4B model does not reliably decide if a player logically solved a mystery based on subtle clues — and with this architectural split, it doesn't have to. Python handles the 50/20/15/10/5 scoring formula, token-overlap contradiction detection, and clue revelation.
The Fine-Tune Saga (Across Two CUDA Versions)
The base model is unsloth/NVIDIA-Nemotron-3-Nano-4B — a hybrid Mamba-2 / Transformer architecture with a 1M-token context window. Fine-tuning it on Modal went through more iterations than I want to count. Here is what broke and how I fixed it:
- The wrong base image: Modal's default
debian_slimdoes not have CUDA, causing the first build to silently install the CPU PyTorch wheel and run for 45 minutes before I noticed. Fix: Switched tonvidia/cuda:12.4.1-devel-ubuntu22.04. - The wrong Torch backend: Because Modal's build environment is headless, Unsloth's resolver defaulted to
cu124(which tops out at Torch 2.6). Unsloth-zoo requires Torch ≥ 2.8. Fix: Pinned--torch-backend=cu128. - Missing Mamba-ssm wheels: The Nemotron hybrid needs
mamba-ssmcompiled from source, but there is no prebuilt wheel forcu128+ Python 3.13. The toolchain defaulted toclang++, but the in-image CUDA expectedgcc. Fix: ForcedCC=gccandCXX=g++with--no-build-isolation, aligning the base tonvidia/cuda:12.8.1-devel-ubuntu22.04. - GGUF converter mismatch: Unsloth's generated converter script missed the
NemotronHForCausalLMsignature, resulting in a garbage quantized model. Fix: Pinned the officialllama.cppat commit6471e3cand quantized directly.
The Result: 240 steps on a Modal A10G (24GB VRAM), LoRA rank 16, resulting in a lightning-fast 2.84 GB GGUF running at ~125 tokens/sec with zero role-token leaks.
The Deployment Battles
Shipping a custom Gradio app with an in-process LLM to Hugging Face Spaces revealed the real-world constraints of the free tier.
- OOM during pip install: Compiling
llama-cpp-pythonfrom source crashed the 2 vCPU Space builder. Pulling a prebuiltcp312wheel cut build time from "crash" to 16 seconds. - Hardcoded Gradio versions: HF Spaces Dockerfiles hardcode
gradio[oauth]==5.5.0. My backend relied ongradio.Serverand@app.api(), which were added in 5.7+. I had to write a ~30-line backward-compatibility shim to wrap the bare 5.5 App. - Heavy Boot Times: A 2.84 GB GGUF exceeds free-tier RAM during a cold start, causing OOM kills before the app even booted. Fix: Lazy-load the model on the first
/interrogaterequest. The first user pays the ~80s download cost; subsequent turns reuse the cached model.
What I Learned
Looking back at large number of commits across 19 distinct development phases, most of the work wasn't about the AI at all. It was about deployment, UI, documentation, and infrastructure. The fine-tuning took 2 hours of compute; the debugging took lot of time.
Here are the biggest takeaways:
- Small models thrive on narrow jobs. When you ground outputs deterministically and use code for logic, small models shine. We often give them tasks that are too broad and blame them when they fail. The constraint is the gameplay.
- Always mirror your artifacts. The HF Space-creation rate limit is 20/day. When you hit it, you wait 24 hours. Push your models, datasets, and code to a personal profile so they survive the hackathon organization eventually being cleaned up.
- Lazy-loading is mandatory for CPU Spaces. Players understand a "warming up the model" UI message. They do not understand a silent server crash.
- Compile from source for cutting-edge architectures. Pre-built wheels for hybrid architectures (like Mamba-Transformer) will lag. Learn to pin commits and force compiler toolchains, or you will get stuck.
Artifacts
- Play the Game: Noir Verdict Space
- LoRA Adapter (40.5 MB): noir-verdict-nemotron-4b-lora /build-small-hackathon
- Merged BF16 (7.95 GB): noir-verdict-nemotron-4b-merged
- GGUF (2.84 GB): noir-verdict-nemotron-4b-gguf
- Traces Dataset: noir-verdict-traces
- Github: Noir Verdict Github
"Everybody has a tell. You just need ten questions to find it."
Comments
Post a Comment