OpenEnv Competition
Blog of the OpenEnv Competition
DropOldest channels mean a 2-second-per-step LLM and a 40 ms scripted bot drive the same engine without changesResources — 📄 PDF paper · 💻 Environment (GitHub) · 🧑💻 Training repo · 🎬 Demo video · 🤗 OpenEnv Hub
🎮Click above to watch the AI agent play Red Alert
Real-time strategy games have driven landmark AI achievements — DeepMind's AlphaStar for StarCraft II, OpenAI Five for Dota 2, earlier work on StarCraft. These results were impressive but built on bespoke neural-network architectures, imitation learning from human replays, and distributed RL across thousands of TPUs. The infrastructure does not generalize.
Meanwhile, LLM agents have become a credible general-purpose paradigm — pretrained world knowledge, natural-language reasoning, high-level semantic actions. Web navigation, code generation, tool use. The natural next question: can a frontier LLM, without any RTS-specific training, hold its own in a real-time strategy game?
The honest answer is: nobody knows yet, because no existing RTS platform actually supports LLM agents. Current platforms assume agents that act at millisecond timescales with low-level action spaces. LLM agents need the opposite — high-level interfaces, asynchronous interaction, and tolerance for variable inference latency that swings from 40 ms to multiple seconds. Trying to bolt an LLM onto SC2LE or PySC2 is possible but ad-hoc, and the resulting baselines are not comparable across papers.
OpenRA-RL is our attempt to close that gap. We picked the classic Westwood RTS Red Alert (open-sourced by the OpenRA project) because it has rich strategic depth, a clean codebase we could modify, and a built-in AI ladder for opponents. The result is a platform that lets you point a Qwen3, Claude, or scripted Python bot at the same environment with no scaffolding changes.
OpenRA-RL has three layers: a modified OpenRA engine in C# that ticks the game at ~25 Hz, a gRPC bridge that streams observations and accepts commands, and a Python wrapper that exposes a Gymnasium-style reset / step / close interface via FastAPI. On top of that, an MCP server exposes 50 game actions as tools, so any MCP-compatible LLM client can drive a game.
Three layers: the LLM agent talks to an MCP server, which routes to a Python backend, which talks gRPC to the C# game engine. The same Python env is also a plain OpenEnv environment, so a TRL trainer can drive it without going through MCP at all.
The point of the layering is that agent computation is fully decoupled from game execution. A scripted bot at 40 ms/step and an LLM at 2 s/step both interact with the same 25 Hz engine without disrupting game flow.
A 25 Hz game engine produces an observation every ~40 ms. A single LLM step can take 2 seconds or more. A naïve "step the env, wait for the agent, step the env" loop falls over: either the game stalls waiting for the agent (no longer real-time), or the agent gets buried under thousands of stale observations.
We used .NET System.Threading.Channels with bounded, non-blocking semantics:
Observation channel (game → agent) — BoundedChannel<GameObservation> with capacity 1 and DropOldest. Every tick the engine writes the latest world state; the channel silently overwrites whatever the agent hasn't read yet. The agent therefore always reads the most recent state, never a stale queue. A fast agent (~40 ms) loses nothing; a slow agent (~2 s) skips ~50 intermediate ticks but still acts on fresh data.
Action channel (agent → game) — BoundedChannel<AgentAction> with capacity 16. Agents often emit 5–10 commands in a single batch (train a unit, move a squad, set a rally point); 16 is enough buffer to never block the agent. The asymmetry matters: dropping a stale observation is harmless (a fresher one is coming), but dropping an action means the agent's intent silently disappears.
Non-blocking guarantee — Both channels use TryWrite. The game thread never waits for the agent. If no action has arrived by tick time, the engine proceeds with a no-op. Game progression is fully independent of agent latency, which is what makes it fair to benchmark a 40 ms scripted bot against a 3 s LLM on the same map.
Observation channel = capacity 1, DropOldest (always-fresh). Action channel = capacity 16 (buffers command batches). The two timing examples on the bottom row show how a fast and slow agent both run against the same 25 Hz engine.
Training and large-scale evaluation need many concurrent games. Our v1 design spawned a separate dotnet OpenRA.dll process per game. At 64 sessions: ~40 GB RAM, 5–15 seconds per reset. Unworkable.
In v2 we moved to a single .NET process hosting up to 64 sessions. The trick is that ModData (unit stats, building attributes, tech trees, map rules) is immutable after init — load it once, share it lock-free across sessions. That alone reclaims ~35 GB. Each session keeps its own World, OrderManager, and BotBridge, isolated from the others.
| Metric | Legacy (v1) | Multi-session (v2) | Improvement |
|---|---|---|---|
| Reset latency | 5–15 s | 256 ms | ~40× |
| RSS (64 sessions) | ~40 GB | ~6 GB | ~7× |
| JIT compilations | 64× | 1× | 64× |
| Active threads | ~200 | ~20 | ~10× |
| Aggregate ticks/sec | ~8 K | ~15 K | ~2× |
One subtle gotcha worth flagging for anyone doing similar work: don't share the .NET ThreadPool between game ticks and gRPC handlers. We did this in an early v2 and saw 0/16 sessions complete — game-tick tasks saturated the pool, gRPC handlers starved, and the platform deadlocked itself. We now run game ticks on a dedicated BlockingCollection<WorkItem> worker pool sized to the CPU count, separate from the gRPC pool. Per-session SemaphoreSlim(1,1) serializes mutations within a world; sessions tick in parallel. When the worker queue fills up, we return gRPC RESOURCE_EXHAUSTED for natural backpressure instead of unbounded queue growth.
v2: one process, shared ModData, dedicated worker pool, per-session semaphore. A single gRPC channel routes by session_id.
v1, for comparison: 64 separate .NET processes, each paying the JIT and ModData tax.
The environment is an explicit 8-state machine: IDLE → LAUNCHING → LOADING → CONNECTING → STREAMING → PLAYING → GAME_OVER → CLEANUP. On reset() the system walks the chain through process startup and map loading; the Python BridgeClient retries GetState() up to 120 times until the gRPC channel is up. Two explicit error paths (TIMEOUT and CONN_LOST) trigger an immediate abort + cleanup so we never leave resources in a half-broken state. A health-check endpoint independently verifies daemon liveness and gRPC connectivity, and restarts the daemon if either check fails.
Eight states from IDLE to CLEANUP, plus two error transitions and a separate replay-playback path that reuses the same machinery.
Every game is recorded as a deterministic .orarep replay file: orders + random seed, perfectly reproducible tick-by-tick via a ReplayConnection reader. Replays embed the Docker image version that produced them, so playback fidelity survives engine upgrades. Watching is via browser-based noVNC inside the Docker container (openra-rl replay watch) — no local install, no graphics drivers, works from a headless cloud box. Replays double as benchmark evidence: when you upload a result to the OpenRA-Bench leaderboard, the .orarep is attached and anyone can re-verify the game.
OpenRA-RL ships as a first-class OpenEnv environment. OpenEnv is the emerging PyTorch-native standard for RL environment authoring + distribution: a typed reset / step contract, structured observation/action spaces, and a Hugging Face Hub layer for discovery. Authors publish once, trainers consume anywhere, with no environment-specific glue.
Concretely, this means:
Most existing OpenEnv environments target narrow, short-horizon tasks: code execution, single-turn tool use, small-scale games. OpenRA-RL extends the standard into the long-horizon, adversarial, real-time, combinatorial-action regime with variable agent latency. The async decoupling, the multi-session runner, and the deterministic replay format are reusable design patterns for anyone authoring a similarly complex OpenEnv environment.
To exercise every design surface end-to-end, we ran a Qwen3 32B agent served locally via Ollama against the built-in Beginner AI on a 128×128 Allied map. The agent gets structured observations as tool responses and issues actions through the MCP tool set, including a pre-game planning phase and a post-game reflection step whose extracted lessons are injected into the next episode's system prompt.
Five episodes, two timing regimes — Games 1–2 with a 30-minute limit, Games 3–5 with a 5-minute limit — to show the platform supports variable episode lengths in a single experiment.
| Game | Duration | Ticks | Assets | Bldgs | Army | Explored | Calls |
|---|---|---|---|---|---|---|---|
| 1† | 30:23 | 1621 | $6,600 | 5 | $2,920 | 3.7% | 62 |
| 2† | 30:15 | 1477 | $4,000 | 3 | $2,340 | 2.7% | 81 |
| 3 | 5:01 | 540 | $2,800 | 3 | $640 | 2.7% | 18 |
| 4 | 5:19 | 509 | $2,300 | 2 | $540 | 2.2% | 19 |
| 5 | 5:17 | 621 | $2,800 | 3 | $740 | 2.7% | 21 |
† 30-minute limit. Games 3–5 use 5 minutes. All five episodes ended in a draw at the time limit with zero combat engagement. The agent successfully bootstrapped an economy in every game but never produced an offensive force.
That zero-combat result is the actual interesting finding, and a scalar win/loss metric would have flattened it. Look at the multi-dimensional reward instead:
Left: per-dimension scores across all 5 games. Right: Game 1's skill profile as a radar plot. The agent registers non-trivial scores on economy, infrastructure, and tempo but zero on combat and disruption — a precise failure mode you can target for reward shaping or curriculum design.
The build-order timelines confirm what the reward vector says: Power Plant, Barracks, sometimes Refinery — but unit production lags badly, and no offensive units reach the field before time runs out.
One row per episode, reconstructed from the deterministic .orarep replay files. Note the x-axes are not shared — Games 1–2 span ~1500 ticks (30 min), Games 3–5 span ~600 ticks (5 min).
The tool-call distribution explains why an LLM can play a 25 Hz game at all:
Left: by category. Right: top 10 individual tools. The dominance of advance (~57% of all calls) is the agent compressing idle game time by skipping ticks — exactly what the async architecture is designed for.
And the macro indicators show the agent's economic management is consistent across the two horizon regimes once you normalize for time:
Left: end-of-game asset value (drops in shorter games, as expected). Centre: per-tick-normalized economy reward (0.58–0.80 across all five games). Right: percent of map explored. The economy quality is steady whether the game runs 5 minutes or 30 — the only thing that changes is how much absolute output the agent gets.
To give you a sense of what the agent's decision loop actually looks like, here are the first 10 turns of Game 5 verbatim from the run log:
Turn 1 | get_faction_briefing()
Turn 2 | end_planning_phase(strategy="Deploy MCV, build Power
Plant (300 ticks), then Barracks/War Factory (500/750),
train E1 + 1tnk, attack AI base at (64,64).")
-> tick=2, cash=$5000, units=1, buildings=0
Turn 3 | deploy_unit(unit_id=120)
-> tick=4, cash=$5000, units=1, buildings=0
Turn 4 | build_and_place(building_type="apwr")
Turn 5 | advance(ticks=100)
-> tick=55, cash=$5000, units=0, buildings=1
Turn 6 | build_and_place(building_type="apwr")
Turn 7 | build_and_place(building_type="powr")
-> tick=57, cash=$5000, units=0, buildings=1
Turn 8 | advance(ticks=60)
-> tick=108, cash=$4923, units=0, buildings=1
Turn 9 | advance(ticks=130)
-> tick=159, cash=$4838, units=0, buildings=1
Turn 10 | advance(ticks=80)
-> tick=210, cash=$4753, units=0, buildings=1
You can see the three-phase rhythm clearly: intel + planning, build the economy, then advance to bridge the gap between LLM latency and game speed.
The full minimal example — instantiate, reset, step, close — using the standard OpenEnv contract:
from openra_env.config import load_config
from openra_env.server.openra_environment import OpenRAEnvironment
from openra_env.models import ActionType, CommandModel, OpenRAAction
# 1. Configure and instantiate the environment.
config = load_config(game={
"grpc_port": 8000,
"map_name": "tank-duel-basic",
"headless": True,
})
env = OpenRAEnvironment(config=config)
# 2. Reset; obs is a structured observation
# (economy, military, unit/building lists, 9-channel spatial map).
obs = env.reset(seed=0)
# 3. Issue a structured action — one or more CommandModel entries
# drawn from 21 ActionType values (MOVE, ATTACK, BUILD, TRAIN, DEPLOY, ...).
action = OpenRAAction(commands=[
CommandModel(action=ActionType.BUILD, item_type="powr"),
])
obs = env.step(action)
# 4. Close — finalizes the .orarep replay file.
env.close()
The same env.step is what gets called whether you're running a scripted bot, an MCP-tool-using LLM agent, or a TRL-driven GRPO training loop. The bridge translates MCP tool calls into the same OpenRAAction shape before forwarding. Both paths share observation and reward — which is what makes the same env usable by an in-the-loop LLM and a weight-updating RL trainer without rewiring.
If you want to run an LLM agent end-to-end, it's a one-liner:
pip install openra-rl
openra-rl play # interactive wizard for OpenRouter / Ollama / LM Studio
For the full install / Docker / training paths, see the GitHub README.
We do not claim a winning agent. We claim a research testbed, and the five-episode run validates five things about it:
The environment is strategically deep. A frontier LLM playing the simplest opponent went 0–0–5 with zero engagements. That's not a platform failure — it's evidence that even tutorial-tier Red Alert requires real strategic reasoning (build order, army composition, attack timing) that prompt-driven LLMs do not yet capture. The gap is the headroom an RL testbed needs.
The multi-dimensional reward localizes weakness. A win/loss scalar collapses all five games into "draw." The 8-D vector says combat = 0, disruption = 0, economy = 0.58–0.80, infrastructure = high — a concrete target for reward shaping and curriculum work.
The async architecture is load-bearing. 57% of tool calls are advance. Without DropOldest observations and a non-blocking action channel, a 2-second LLM cannot meaningfully play a 25 Hz game. The async design is what makes the LLM a first-class citizen, not a workaround.
In-context reflection helps, but is not enough. Episode 2's reflection diagnoses a build-order mistake ("War Factory before Power Plant"); by Episode 4 the pre-game plan opens with a Power Plant. Prompt-injection learning fixes build order — it does not close the combat gap. That's exactly the kind of environment where the jump from in-context adaptation to weight-updating RL should measurably matter.
It plugs straight into the OpenEnv ecosystem. Every behaviour above (planning, acting, rewarding, reflecting, replaying) is exposed through the standard OpenEnv interface. Pointing TRL / torchforge / Unsloth at the env's Hub identifier requires no environment-side changes.
We're releasing OpenRA-RL as open-source software and inviting the community to push on it. Concretely interesting next steps from where we stand today:
If you build on top of it, we'd love to hear from you on the GitHub issues.
Blog of the OpenEnv Competition