Duel of Albion

How a Tiny 270M Gemma 3 Model Powers a Real-Time 3D Fighter

By Sathvik A R and Sankalp H S

If you've ever wondered whether you really need a massive, GPU-hungry language model to power smart NPCs, the short answer is: no, you don't.

Welcome to Duel of Albion, our new Hugging Face Space that delivers a real-time 3D fighting game where the NPC opponent is driven entirely by a fine-tuned Gemma 3 270M model. We replaced our older, heavier 4B Gemma model with this specialized 270M LoRA adapter (Sathvik0101/cyber-duel-tiny-users). The result? The tiny NPC actually beat its 4B predecessor in a 10-round head-to-head match by a 19 HP margin—and it does it all running on a free CPU tier.

Here’s a deep dive into how we built it, trained it, and optimized it for real-time inference.


🏗️ The Architecture: Three Core Pillars

To make this work seamlessly on Hugging Face Spaces, we broke the architecture into three distinct layers:

  1. The 3D Game Engine: We bundled a React + Three.js fighting game into a single 693 kB 3d_scene.html file. It handles real-time 3D rendering, character movement, and combat animations, running as a fullscreen iframe.
  2. The AI NPC Backend: A FastAPI server (app.py) loads the base Gemma 3 270M model (google/gemma-3-270m-it) alongside our custom LoRA adapter using PEFT. It exposes a /predict endpoint to evaluate game states.
  3. The Gradio Shell: Since we are hosting on Hugging Face Spaces, we wrapped the app in Gradio. We used heavy CSS overrides to strip away the Gradio UI chrome, leaving only a pixel-perfect iframe of the game.

The Request/Response Flow

When you throw a punch, here is exactly what happens under the hood:

Player input (keyboard/gamepad)
  → 3D Scene JS collects last 5 moves + game state
  → POST /predict { sequence, player, npc, round, distance }
  → Fine-tuned Gemma 3 270M + LoRA generates a counter-move
  → Returns { reasoning, counterMove, sequence }
  → Game engine applies the NPC response in real-time

The API Payload

The model processes 9 legal moves: jab, cross, low_kick, roundhouse, uppercut, parry, backstep, clinch, and throw. Here is what the rich state payload looks like:

{
  "sequence": "jab,cross,low_kick,uppercut,roundhouse",
  "player": {
    "name": "ronin", "speed": 3, "power": 3, "range": 3,
    "weight": 1.0, "stance": "balanced", "stamina": 100, "hp": 100
  },
  "npc": {
    "name": "ronin", "speed": 3, "power": 3, "range": 3,
    "weight": 1.0, "stance": "balanced", "stamina": 100, "hp": 100
  },
  "round": 3,
  "distance": "close"
}

🧠 The Training Pipeline (SFT → GRPO → DPO)

The secret sauce behind the tiny model's success is its training pipeline. We didn't use an LLM teacher or distillation; the adapter was trained entirely on procedurally generated data over three stages:

  • Stage 1: Supervised Fine-Tuning (SFT)
    We generated 24,000 combat scenarios via a Python script. Using the TRL SFTTrainer with LoRA (rank=16, alpha=32), the model learned to associate game states with optimal counter-moves.
    Cost/Time: ~5 hours on Modal ($4.00)
  • Stage 2: GRPO (Group Relative Policy Optimization)
    This was the game-changer. We ported the in-game combat resolver to Python to compute verifiable rewards. No human feedback or LLM judge was needed. The model learned to maximize actual combat effectiveness rather than just imitating training data.
    Cost/Time: ~12 hours on Modal ($9.60)
  • Stage 3: DPO (Direct Preference Optimization)
    An optional, final polish step to align the model's output distribution perfectly.
    Cost/Time: ~2 hours on Modal ($1.60)

⚙️ Model Architecture & Inference Design

Component Detail
Base Model google/gemma-3-270m-it (text-only, bf16, ~540 MB)
Adapter Sathvik0101/cyber-duel-tiny-users (LoRA, r=16, α=32)
Total VRAM ~600 MB (fits entirely on free CPU tier)
Latency 2–4 seconds per move on modern CPU

To keep inference reliable, we format the prompt to exactly mirror the training state, forcing the model to end completions with counter_move: <move>. We also built in a mock mode: if the HF token is missing or ML dependencies fail to load, the backend falls back to returning random legal moves with a simulated 250ms delay, allowing for seamless UI testing.


🚀 Deployment & David vs. Goliath Evaluation

The beauty of this setup is the deployment footprint. A simple Dockerfile grabs Python 3.11, installs the requirements (FastAPI, Gradio, Transformers, PEFT, Torch), and launches the FastAPI app. Because the total footprint is ~600 MB, the whole Space runs on Hugging Face's free cpu-basic tier.

The Showdown: 4B vs 270M

We put our new 270M tiny model into the ring against the original 4B Gemma advisor. Over a 10-round head-to-head match, the 270M variant won by a 19 HP margin. It proves that a well-tuned specialist will reliably out-box a generic heavyweight on a narrowly defined task.

Aspect Original Generalist (4B) Current Specialist (270M + LoRA)
Model Size ~8 GB ~600 MB
Hardware Required GPU CPU (Free Tier)
Boot Time Slow (GPU queueing) Instant
H2H Result Lost Won (+19 HP Margin)

🎯 Key Takeaways & Where to Find the Code

Three main Hugging Face Spaces are currently utilizing this adapter:

  • Sathvik0101/cyberpunk-duel-ai — The flagship 3D game Space.
  • Sathvik0101/cyber-duel-tiny — A standalone playground for the model.
  • sankalphs/duel — A remix.

What we learned:

  1. Small + Specialized > Large + Generic: Reinforcement learning on a specific combat resolver makes tiny models deadly effective.
  2. Kill the LLM Middleman: Procedural data generation works. No large LLMs were used (or harmed) in generating our training data.
  3. GRPO with Verifiable Rewards Shines: Using the actual game engine to calculate rewards meant the model learned to win, not just imitate.
  4. CPU is Enough: 2 to 4 seconds of inference latency is perfectly acceptable for turn-based combat processing, bringing the hosting cost to exactly $0.

Comments

Popular posts from this blog

Noir Verdict

Retro Alpha