Noir Verdict

Noir Verdict: What I Built and What I Learned

Ten questions, four suspects, one bad night, and a 4B parameter model that never talks to the internet.

Two weeks ago, I sat down to build something for the Hugging Face Build Small Hackathon — a competition where every model must be 32B parameters or smaller, and the best entries run entirely offline. I came out the other side with a noir detective game, nine bonus quests checked off, and a notebook full of lessons that I want to share while the scars are still fresh.


The Game

You walk into a 1950s radio station. A phonograph disc has been stolen. Four suspects are waiting in the room — nervous, arrogant, helpful, evasive. You have ten questions to find the truth. Each question either cracks a clue, catches a contradiction, or wastes a turn. At the end, you charge someone. The math decides if you were right. The typewriter decides how it sounds.

The whole thing — a custom 3D noir scene, a deterministic game engine, and a fine-tuned LLM — runs inside a single Hugging Face Gradio Space. The brain of the operation is an NVIDIA Nemotron 3 Nano 4B, fine-tuned with LoRA, quantized to Q4_K_M (just 2.84 GB), and running in-process via llama-cpp-python. No Together. No API keys at runtime. Just a model that can runon a free-tier CPU, talking noir.

You can play it here: build-small-hackathon/noir-verdict


Architecture: Deterministic Truth

The load-bearing idea of this project is what I call deterministic truth. The Python engine owns everything that needs to be reliable. The LLM owns everything that needs to be beautiful.


┌─────────────────────────────────────────────┐
│             Three.js Noir Frontend          │
│  (sepia desk lamp, 4 suspects, typewriter)  │
└──────────────────┬──────────────────────────┘
                   │ fetch() API calls
┌──────────────────▼──────────────────────────┐
│             app.py (gradio.Server)          │
│    @app.api("/new_game")                    │
│    @app.api("/interrogate")                 │
│    @app.api("/charge")                      │
└──────┬───────────────────────┬──────────────┘
       │                       │
┌──────▼──────────┐  ┌─────────▼─────────────┐
│  Game Engine    │  │  LLM Backend          │
│  ────────────   │  │  ───────────          │
│  cases.py       │  │  llama-cpp-python     │
│  scoring.py     │  │  GGUF (Q4_K_M)        │
│  state.py       │  │  Nemotron 3 4B        │
│  prompts.py     │  │  in-process, offline  │
│  contradictions │  │                       │
└─────────────────┘  └───────────────────────┘

The model does exactly one job: stay in character and sound like a hard-boiled detective. A 4B model does that beautifully. A 4B model does not reliably decide if a player logically solved a mystery based on subtle clues — and with this architectural split, it doesn't have to. Python handles the 50/20/15/10/5 scoring formula, token-overlap contradiction detection, and clue revelation.


The Fine-Tune Saga (Across Two CUDA Versions)

The base model is unsloth/NVIDIA-Nemotron-3-Nano-4B — a hybrid Mamba-2 / Transformer architecture with a 1M-token context window. Fine-tuning it on Modal went through more iterations than I want to count. Here is what broke and how I fixed it:

  • The wrong base image: Modal's default debian_slim does not have CUDA, causing the first build to silently install the CPU PyTorch wheel and run for 45 minutes before I noticed. Fix: Switched to nvidia/cuda:12.4.1-devel-ubuntu22.04.
  • The wrong Torch backend: Because Modal's build environment is headless, Unsloth's resolver defaulted to cu124 (which tops out at Torch 2.6). Unsloth-zoo requires Torch ≥ 2.8. Fix: Pinned --torch-backend=cu128.
  • Missing Mamba-ssm wheels: The Nemotron hybrid needs mamba-ssm compiled from source, but there is no prebuilt wheel for cu128 + Python 3.13. The toolchain defaulted to clang++, but the in-image CUDA expected gcc. Fix: Forced CC=gcc and CXX=g++ with --no-build-isolation, aligning the base to nvidia/cuda:12.8.1-devel-ubuntu22.04.
  • GGUF converter mismatch: Unsloth's generated converter script missed the NemotronHForCausalLM signature, resulting in a garbage quantized model. Fix: Pinned the official llama.cpp at commit 6471e3c and quantized directly.

The Result: 240 steps on a Modal A10G (24GB VRAM), LoRA rank 16, resulting in a lightning-fast 2.84 GB GGUF running at ~125 tokens/sec with zero role-token leaks.


The Deployment Battles

Shipping a custom Gradio app with an in-process LLM to Hugging Face Spaces revealed the real-world constraints of the free tier.

  • OOM during pip install: Compiling llama-cpp-python from source crashed the 2 vCPU Space builder. Pulling a prebuilt cp312 wheel cut build time from "crash" to 16 seconds.
  • Hardcoded Gradio versions: HF Spaces Dockerfiles hardcode gradio[oauth]==5.5.0. My backend relied on gradio.Server and @app.api(), which were added in 5.7+. I had to write a ~30-line backward-compatibility shim to wrap the bare 5.5 App.
  • Heavy Boot Times: A 2.84 GB GGUF exceeds free-tier RAM during a cold start, causing OOM kills before the app even booted. Fix: Lazy-load the model on the first /interrogate request. The first user pays the ~80s download cost; subsequent turns reuse the cached model.

What I Learned

Looking back at large number of commits across 19 distinct development phases, most of the work wasn't about the AI at all. It was about deployment, UI, documentation, and infrastructure. The fine-tuning took 2 hours of compute; the debugging took lot of time.

Here are the biggest takeaways:

  1. Small models thrive on narrow jobs. When you ground outputs deterministically and use code for logic, small models shine. We often give them tasks that are too broad and blame them when they fail. The constraint is the gameplay.
  2. Always mirror your artifacts. The HF Space-creation rate limit is 20/day. When you hit it, you wait 24 hours. Push your models, datasets, and code to a personal profile so they survive the hackathon organization eventually being cleaned up.
  3. Lazy-loading is mandatory for CPU Spaces. Players understand a "warming up the model" UI message. They do not understand a silent server crash.
  4. Compile from source for cutting-edge architectures. Pre-built wheels for hybrid architectures (like Mamba-Transformer) will lag. Learn to pin commits and force compiler toolchains, or you will get stuck.

Artifacts


"Everybody has a tell. You just need ten questions to find it."

Comments

Popular posts from this blog

Retro Alpha

Duel of Albion