Duel of Albion
How a Tiny 270M Gemma 3 Model Powers a Real-Time 3D Fighter
By Sathvik A R and Sankalp H S
If you've ever wondered whether you really need a massive, GPU-hungry language model to power smart NPCs, the short answer is: no, you don't.
Welcome to Duel of Albion, our new Hugging Face Space that delivers a real-time 3D fighting game where the NPC opponent is driven entirely by a fine-tuned Gemma 3 270M model. We replaced our older, heavier 4B Gemma model with this specialized 270M LoRA adapter (Sathvik0101/cyber-duel-tiny-users). The result? The tiny NPC actually beat its 4B predecessor in a 10-round head-to-head match by a 19 HP margin—and it does it all running on a free CPU tier.
Here’s a deep dive into how we built it, trained it, and optimized it for real-time inference.
🏗️ The Architecture: Three Core Pillars
To make this work seamlessly on Hugging Face Spaces, we broke the architecture into three distinct layers:
- The 3D Game Engine: We bundled a React + Three.js fighting game into a single 693 kB
3d_scene.htmlfile. It handles real-time 3D rendering, character movement, and combat animations, running as a fullscreen iframe. - The AI NPC Backend: A FastAPI server (
app.py) loads the base Gemma 3 270M model (google/gemma-3-270m-it) alongside our custom LoRA adapter using PEFT. It exposes a/predictendpoint to evaluate game states. - The Gradio Shell: Since we are hosting on Hugging Face Spaces, we wrapped the app in Gradio. We used heavy CSS overrides to strip away the Gradio UI chrome, leaving only a pixel-perfect iframe of the game.
The Request/Response Flow
When you throw a punch, here is exactly what happens under the hood:
Player input (keyboard/gamepad)
→ 3D Scene JS collects last 5 moves + game state
→ POST /predict { sequence, player, npc, round, distance }
→ Fine-tuned Gemma 3 270M + LoRA generates a counter-move
→ Returns { reasoning, counterMove, sequence }
→ Game engine applies the NPC response in real-time
The API Payload
The model processes 9 legal moves: jab, cross, low_kick, roundhouse, uppercut, parry, backstep, clinch, and throw. Here is what the rich state payload looks like:
{
"sequence": "jab,cross,low_kick,uppercut,roundhouse",
"player": {
"name": "ronin", "speed": 3, "power": 3, "range": 3,
"weight": 1.0, "stance": "balanced", "stamina": 100, "hp": 100
},
"npc": {
"name": "ronin", "speed": 3, "power": 3, "range": 3,
"weight": 1.0, "stance": "balanced", "stamina": 100, "hp": 100
},
"round": 3,
"distance": "close"
}
🧠 The Training Pipeline (SFT → GRPO → DPO)
The secret sauce behind the tiny model's success is its training pipeline. We didn't use an LLM teacher or distillation; the adapter was trained entirely on procedurally generated data over three stages:
- Stage 1: Supervised Fine-Tuning (SFT)
We generated 24,000 combat scenarios via a Python script. Using the TRLSFTTrainerwith LoRA (rank=16, alpha=32), the model learned to associate game states with optimal counter-moves.
Cost/Time: ~5 hours on Modal ($4.00) - Stage 2: GRPO (Group Relative Policy Optimization)
This was the game-changer. We ported the in-game combat resolver to Python to compute verifiable rewards. No human feedback or LLM judge was needed. The model learned to maximize actual combat effectiveness rather than just imitating training data.
Cost/Time: ~12 hours on Modal ($9.60) - Stage 3: DPO (Direct Preference Optimization)
An optional, final polish step to align the model's output distribution perfectly.
Cost/Time: ~2 hours on Modal ($1.60)
⚙️ Model Architecture & Inference Design
| Component | Detail |
|---|---|
| Base Model | google/gemma-3-270m-it (text-only, bf16, ~540 MB) |
| Adapter | Sathvik0101/cyber-duel-tiny-users (LoRA, r=16, α=32) |
| Total VRAM | ~600 MB (fits entirely on free CPU tier) |
| Latency | 2–4 seconds per move on modern CPU |
To keep inference reliable, we format the prompt to exactly mirror the training state, forcing the model to end completions with counter_move: <move>. We also built in a mock mode: if the HF token is missing or ML dependencies fail to load, the backend falls back to returning random legal moves with a simulated 250ms delay, allowing for seamless UI testing.
🚀 Deployment & David vs. Goliath Evaluation
The beauty of this setup is the deployment footprint. A simple Dockerfile grabs Python 3.11, installs the requirements (FastAPI, Gradio, Transformers, PEFT, Torch), and launches the FastAPI app. Because the total footprint is ~600 MB, the whole Space runs on Hugging Face's free cpu-basic tier.
The Showdown: 4B vs 270M
We put our new 270M tiny model into the ring against the original 4B Gemma advisor. Over a 10-round head-to-head match, the 270M variant won by a 19 HP margin. It proves that a well-tuned specialist will reliably out-box a generic heavyweight on a narrowly defined task.
| Aspect | Original Generalist (4B) | Current Specialist (270M + LoRA) |
|---|---|---|
| Model Size | ~8 GB | ~600 MB |
| Hardware Required | GPU | CPU (Free Tier) |
| Boot Time | Slow (GPU queueing) | Instant |
| H2H Result | Lost | Won (+19 HP Margin) |
🎯 Key Takeaways & Where to Find the Code
Three main Hugging Face Spaces are currently utilizing this adapter:
Sathvik0101/cyberpunk-duel-ai— The flagship 3D game Space.Sathvik0101/cyber-duel-tiny— A standalone playground for the model.sankalphs/duel— A remix.
What we learned:
- Small + Specialized > Large + Generic: Reinforcement learning on a specific combat resolver makes tiny models deadly effective.
- Kill the LLM Middleman: Procedural data generation works. No large LLMs were used (or harmed) in generating our training data.
- GRPO with Verifiable Rewards Shines: Using the actual game engine to calculate rewards meant the model learned to win, not just imitate.
- CPU is Enough: 2 to 4 seconds of inference latency is perfectly acceptable for turn-based combat processing, bringing the hosting cost to exactly $0.
Comments
Post a Comment