TRL documentation
OpenReward Integration for Training LLMs with Environments
OpenReward Integration for Training LLMs with Environments
OpenReward is an open ecosystem for RL environments built on the Open Reward Standard (ORS) — a public, language-agnostic HTTP/SSE protocol for how an environment exposes its tasks, tools, sessions, and rewards. Because ORS is just a protocol, the same environment can run on the OpenReward platform, self-hosted on any container service, or locally on localhost for development. A catalog of ready-to-use environments is available at openreward.ai.
This guide covers how to integrate OpenReward with TRL. For more on the standard itself, see the ORS docs.
The integration lives at
trl.experimental.openrewardand is gated behind thetrl[openreward]extra (lazy-imported — non-users pay nothing).
When to use OpenReward environments
GRPOTrainer supports environment-based training via the environment_factory slot — see OpenEnv for the general contract. Use OpenReward when you want to train against an ORS-speaking environment: the OpenReward catalog (e.g. Eigent/SETA, kanishk/EndlessTerminals, nebius/SWE-rebench-V2), an env you self-host on your own infra, or a local server you’re developing.
Installation
pip install trl[openreward]
This installs the openreward Python SDK. The integration itself imports openreward lazily, so users who don’t touch trl.experimental.openreward aren’t affected.
Quick start
The OpenRewardSpec class wires a single ORS environment into the three TRL trainer slots — train_dataset, environment_factory, reward_funcs — by exposing properties that map 1:1 to those kwarg names:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(
num_generations=2,
max_steps=5,
max_tool_calling_iterations=20,
log_completions=True,
),
train_dataset=spec.train_dataset,
environment_factory=spec.environment_factory,
reward_funcs=spec.reward_funcs,
)
trainer.train()Under the hood OpenRewardSpec does three things, lazily on first access:
spec.train_dataset: derives adatasets.Datasetfrom the env’s task list (one HTTP roundtrip via the SDK). Has at minimumprompt,task_index, plus per-task metadata columns folded in.spec.environment_factory: returns a zero-arg callable that produces a fresh per-rollout adapter on each call. The adapter exposes one Python method per ORS tool, with a typed signature and docstring auto-generated from the env’s JSON Schema. TRL’s tool collector picks them up viainspect.getmembers.spec.reward_funcs: an outcome-only reward function (last non-null reward in the trajectory) suitable for sparse-reward envs like SETA.
Using a hub environment
Pass an openreward.ai catalog name as the target. The SDK reads OPENREWARD_API_KEY from the environment for authentication.
spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)Using a self-hosted environment
Pass the URL directly. No API key is needed if your server doesn’t enforce one.
spec = OpenRewardSpec("https://my-org-my-env.hf.space", env_name="my_env")The
openrewardSDK by default expects a two-subdomain platform layout (api.<host>for stateless calls andsessions.<host>for SSE-based session calls). For single-host self-hosted servers (one URL serving everything), set the override env vars below before constructingOpenRewardSpec:import os URL = "https://my-org-my-env.hf.space" os.environ["OPENREWARD_API_URL"] = URL os.environ["OPENREWARD_SESSION_URL"] = URL spec = OpenRewardSpec(URL, env_name="my_env")
Running a minimal environment locally
The fastest way to try the integration end-to-end without external dependencies is a tiny ORS server defined with the openreward SDK’s Environment + Server scaffolding. The example below is a complete echo environment — the model wins by calling echo(text=...) with the task’s target string.
# server.py
from pydantic import BaseModel
from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool
class EchoTaskSpec(BaseModel):
target: str
class EchoParams(BaseModel):
text: str
class EchoEnvironment(Environment):
def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
super().__init__(task_spec)
self.config = EchoTaskSpec.model_validate(task_spec)
@classmethod
def list_splits(cls) -> list[str]:
return ["train"]
@classmethod
def list_tasks(cls, split: str) -> list[JSONObject]:
return [{"target": "hello"}, {"target": "world"}]
def get_prompt(self) -> list[TextBlock]:
return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]
@tool
async def echo(self, params: EchoParams) -> ToolOutput:
"""Submit a string. Reward 1.0 + finished if it matches the target.
Args:
text: The string to echo back.
"""
correct = params.text == self.config.target
return ToolOutput(
blocks=[TextBlock(type="text", text="match" if correct else "no match")],
reward=1.0 if correct else 0.0,
finished=correct,
)
if __name__ == "__main__":
Server([EchoEnvironment]).run(host="0.0.0.0", port=8000)Run it:
pip install openreward fastapi uvicorn pydantic
python server.py # listens on :8000Then point OpenRewardSpec at it (with the URL overrides described above):
import os
URL = "http://127.0.0.1:8000"
os.environ["OPENREWARD_API_URL"] = URL
os.environ["OPENREWARD_SESSION_URL"] = URL
from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec(URL, env_name="echoenvironment")
print(spec.train_dataset) # 2 rows, task_index + target columnsThis is also the fixture pattern used by TRL’s own tests — see trl-internal-testing/openreward-echo-env for the deployed Space.
Selecting tasks
OpenRewardSpec accepts either a count or an explicit index list:
spec = OpenRewardSpec("Eigent/SETA", num_tasks=10) # first 10 tasks
spec = OpenRewardSpec("Eigent/SETA", indices=[0, 5, 13, 27]) # specific indices
spec = OpenRewardSpec("Eigent/SETA", indices=list(range(50, 100))) # rangenum_tasks and indices are mutually exclusive and both fetch only the tasks they need (no full task list scan).
How tool binding works
At construction the spec calls the env’s /tools endpoint to fetch a list of tool specs (each with a name, description, and JSON Schema for arguments). For each tool it generates a Python method on the per-rollout adapter with a typed signature and a docstring derived from the schema. So transformers.utils.get_json_schema and TRL’s inspect.getmembers(env, ismethod) both produce the right tool schema for the model with no per-env wrapper code.
If a tool description contains characters that aren’t safe to splice into Python source, the binder falls back to a sanitized form so binding never fails on real envs.
Reward functions
spec.reward_funcs defaults to an outcome-only reward — for each rollout it returns the last non-null reward observed during the trajectory. This is the right default for sparse-reward envs (e.g. SETA, where only submit_solution returns a non-null reward).
If you want a custom reward, write a regular TRL reward function and pass it directly:
def my_reward(environments, **kwargs) -> list[float]:
return [env.reward * 2.0 for env in environments] # double the env reward, etc.
trainer = GRPOTrainer(
...,
reward_funcs=my_reward,
)The per-rollout adapter exposes the running state TRL needs — env.reward, env.rewards, env.metadata, env.finished, env.last_output — for arbitrary post-hoc reward shaping.
OpenRewardSpec
class trl.experimental.openreward.OpenRewardSpec
< source >( target: str num_tasks: int | None = None split: str = 'train' indices: list[int] | None = None api_key: str | None = None secrets: dict[str, str] | None = None env_name: str | None = None include_metadata: bool = True )
Parameters
- target (str) — Either an openreward.ai catalog name (“Eigent/SETA”) or a URL pointing at any ORS server (”https://you-seta.hf.space”, “http://localhost:8080”). Auto-detected by the presence of :// in the string.
- num_tasks (int, optional) —
Cap on the number of tasks pulled into the dataset.
Noneuses every task the env exposes. - split (str, optional, defaults to “train”) — Which split’s task list to draw from.
- indices (list[int], optional) —
Specific task indices to train on. Mutually exclusive with
num_tasks. Useful for debugging or curriculum subsets. - api_key (str, optional) —
OPENREWARD_API_KEYoverride. Only used whentargetis a catalog name. - secrets (dict[str, str], optional) —
Per-session secrets forwarded to
env.session(secrets=). - env_name (str, optional) — Override for the env name to look up on the server. Rarely needed.
- include_metadata (bool, optional, defaults to True) —
Fold per-task metadata (difficulty, category, tags, …) into the dataset rows so reward funcs can
read them via TRL’s
inputsargument.
Single spec object that wires an ORS environment into a TRL trainer.
Limitations
- The integration is in
trl.experimental— APIs may change. SetTRL_EXPERIMENTAL_SILENCE=1to silence the warning in CI logs. - Currently exposes a single
OpenRewardSpeccovering one environment; multi-environment training (à la the OpenEnv “meta-environment” pattern) is not supported yet. - Long-running rollouts (>15 min per episode) need a keepalive ping — not yet wired.
Reference
- Open Reward Standard
- OpenReward platform
openrewardPython SDK- Echo env Space —
trl-internal-testing/openreward-echo-env