TRL documentation

OpenReward Integration for Training LLMs with Environments

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v1.3.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

OpenReward Integration for Training LLMs with Environments

OpenReward is an open ecosystem for RL environments built on the Open Reward Standard (ORS) — a public, language-agnostic HTTP/SSE protocol for how an environment exposes its tasks, tools, sessions, and rewards. Because ORS is just a protocol, the same environment can run on the OpenReward platform, self-hosted on any container service, or locally on localhost for development. A catalog of ready-to-use environments is available at openreward.ai.

This guide covers how to integrate OpenReward with TRL. For more on the standard itself, see the ORS docs.

The integration lives at trl.experimental.openreward and is gated behind the trl[openreward] extra (lazy-imported — non-users pay nothing).

When to use OpenReward environments

GRPOTrainer supports environment-based training via the environment_factory slot — see OpenEnv for the general contract. Use OpenReward when you want to train against an ORS-speaking environment: the OpenReward catalog (e.g. Eigent/SETA, kanishk/EndlessTerminals, nebius/SWE-rebench-V2), an env you self-host on your own infra, or a local server you’re developing.

Installation

pip install trl[openreward]

This installs the openreward Python SDK. The integration itself imports openreward lazily, so users who don’t touch trl.experimental.openreward aren’t affected.

Quick start

The OpenRewardSpec class wires a single ORS environment into the three TRL trainer slots — train_dataset, environment_factory, reward_funcs — by exposing properties that map 1:1 to those kwarg names:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardSpec

spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(
        num_generations=2,
        max_steps=5,
        max_tool_calling_iterations=20,
        log_completions=True,
    ),
    train_dataset=spec.train_dataset,
    environment_factory=spec.environment_factory,
    reward_funcs=spec.reward_funcs,
)
trainer.train()

Under the hood OpenRewardSpec does three things, lazily on first access:

  1. spec.train_dataset: derives a datasets.Dataset from the env’s task list (one HTTP roundtrip via the SDK). Has at minimum prompt, task_index, plus per-task metadata columns folded in.
  2. spec.environment_factory: returns a zero-arg callable that produces a fresh per-rollout adapter on each call. The adapter exposes one Python method per ORS tool, with a typed signature and docstring auto-generated from the env’s JSON Schema. TRL’s tool collector picks them up via inspect.getmembers.
  3. spec.reward_funcs: an outcome-only reward function (last non-null reward in the trajectory) suitable for sparse-reward envs like SETA.

Using a hub environment

Pass an openreward.ai catalog name as the target. The SDK reads OPENREWARD_API_KEY from the environment for authentication.

spec = OpenRewardSpec("Eigent/SETA", num_tasks=64)

Using a self-hosted environment

Pass the URL directly. No API key is needed if your server doesn’t enforce one.

spec = OpenRewardSpec("https://my-org-my-env.hf.space", env_name="my_env")

The openreward SDK by default expects a two-subdomain platform layout (api.<host> for stateless calls and sessions.<host> for SSE-based session calls). For single-host self-hosted servers (one URL serving everything), set the override env vars below before constructing OpenRewardSpec:

import os

URL = "https://my-org-my-env.hf.space"
os.environ["OPENREWARD_API_URL"]     = URL
os.environ["OPENREWARD_SESSION_URL"] = URL

spec = OpenRewardSpec(URL, env_name="my_env")

Running a minimal environment locally

The fastest way to try the integration end-to-end without external dependencies is a tiny ORS server defined with the openreward SDK’s Environment + Server scaffolding. The example below is a complete echo environment — the model wins by calling echo(text=...) with the task’s target string.

# server.py
from pydantic import BaseModel
from openreward.environments import Environment, JSONObject, Server, TextBlock, ToolOutput, tool


class EchoTaskSpec(BaseModel):
    target: str

class EchoParams(BaseModel):
    text: str


class EchoEnvironment(Environment):
    def __init__(self, task_spec: JSONObject = {}, secrets: dict[str, str] = {}):
        super().__init__(task_spec)
        self.config = EchoTaskSpec.model_validate(task_spec)

    @classmethod
    def list_splits(cls) -> list[str]:
        return ["train"]

    @classmethod
    def list_tasks(cls, split: str) -> list[JSONObject]:
        return [{"target": "hello"}, {"target": "world"}]

    def get_prompt(self) -> list[TextBlock]:
        return [TextBlock(type="text", text=f"Echo '{self.config.target}' to win.")]

    @tool
    async def echo(self, params: EchoParams) -> ToolOutput:
        """Submit a string. Reward 1.0 + finished if it matches the target.

        Args:
            text: The string to echo back.
        """
        correct = params.text == self.config.target
        return ToolOutput(
            blocks=[TextBlock(type="text", text="match" if correct else "no match")],
            reward=1.0 if correct else 0.0,
            finished=correct,
        )


if __name__ == "__main__":
    Server([EchoEnvironment]).run(host="0.0.0.0", port=8000)

Run it:

pip install openreward fastapi uvicorn pydantic
python server.py     # listens on :8000

Then point OpenRewardSpec at it (with the URL overrides described above):

import os
URL = "http://127.0.0.1:8000"
os.environ["OPENREWARD_API_URL"]     = URL
os.environ["OPENREWARD_SESSION_URL"] = URL

from trl.experimental.openreward import OpenRewardSpec
spec = OpenRewardSpec(URL, env_name="echoenvironment")
print(spec.train_dataset)        # 2 rows, task_index + target columns

This is also the fixture pattern used by TRL’s own tests — see trl-internal-testing/openreward-echo-env for the deployed Space.

Selecting tasks

OpenRewardSpec accepts either a count or an explicit index list:

spec = OpenRewardSpec("Eigent/SETA", num_tasks=10)                      # first 10 tasks
spec = OpenRewardSpec("Eigent/SETA", indices=[0, 5, 13, 27])            # specific indices
spec = OpenRewardSpec("Eigent/SETA", indices=list(range(50, 100)))      # range

num_tasks and indices are mutually exclusive and both fetch only the tasks they need (no full task list scan).

How tool binding works

At construction the spec calls the env’s /tools endpoint to fetch a list of tool specs (each with a name, description, and JSON Schema for arguments). For each tool it generates a Python method on the per-rollout adapter with a typed signature and a docstring derived from the schema. So transformers.utils.get_json_schema and TRL’s inspect.getmembers(env, ismethod) both produce the right tool schema for the model with no per-env wrapper code.

If a tool description contains characters that aren’t safe to splice into Python source, the binder falls back to a sanitized form so binding never fails on real envs.

Reward functions

spec.reward_funcs defaults to an outcome-only reward — for each rollout it returns the last non-null reward observed during the trajectory. This is the right default for sparse-reward envs (e.g. SETA, where only submit_solution returns a non-null reward).

If you want a custom reward, write a regular TRL reward function and pass it directly:

def my_reward(environments, **kwargs) -> list[float]:
    return [env.reward * 2.0 for env in environments]   # double the env reward, etc.

trainer = GRPOTrainer(
    ...,
    reward_funcs=my_reward,
)

The per-rollout adapter exposes the running state TRL needs — env.reward, env.rewards, env.metadata, env.finished, env.last_output — for arbitrary post-hoc reward shaping.

OpenRewardSpec

class trl.experimental.openreward.OpenRewardSpec

< >

( target: str num_tasks: int | None = None split: str = 'train' indices: list[int] | None = None api_key: str | None = None secrets: dict[str, str] | None = None env_name: str | None = None include_metadata: bool = True )

Parameters

  • target (str) — Either an openreward.ai catalog name (“Eigent/SETA”) or a URL pointing at any ORS server (https://you-seta.hf.space”, “http://localhost:8080”). Auto-detected by the presence of :// in the string.
  • num_tasks (int, optional) — Cap on the number of tasks pulled into the dataset. None uses every task the env exposes.
  • split (str, optional, defaults to “train”) — Which split’s task list to draw from.
  • indices (list[int], optional) — Specific task indices to train on. Mutually exclusive with num_tasks. Useful for debugging or curriculum subsets.
  • api_key (str, optional) — OPENREWARD_API_KEY override. Only used when target is a catalog name.
  • secrets (dict[str, str], optional) — Per-session secrets forwarded to env.session(secrets=).
  • env_name (str, optional) — Override for the env name to look up on the server. Rarely needed.
  • include_metadata (bool, optional, defaults to True) — Fold per-task metadata (difficulty, category, tags, …) into the dataset rows so reward funcs can read them via TRL’s inputs argument.

Single spec object that wires an ORS environment into a TRL trainer.

Limitations

  • The integration is in trl.experimental — APIs may change. Set TRL_EXPERIMENTAL_SILENCE=1 to silence the warning in CI logs.
  • Currently exposes a single OpenRewardSpec covering one environment; multi-environment training (à la the OpenEnv “meta-environment” pattern) is not supported yet.
  • Long-running rollouts (>15 min per episode) need a keepalive ping — not yet wired.

Reference

Update on GitHub