CI/CD Pipeline Debugger Environment (OpenEnv)

1. Project Goal

This repository implements an AI training and evaluation environment where an agent learns to debug broken CI/CD pipelines automatically.

The environment targets real-world DevOps failure patterns, including:

YAML syntax and structure issues
Incorrect build/test commands (for example, npm tset -> npm test)
Dependency and setup failures
Multi-stage pipeline execution errors

This is designed as an RL-style interaction loop:

Observe -> Think -> Act -> Get Reward -> Repeat

2. Why This Matters

CI/CD failures are common, repetitive, and often multi-step to resolve. This project turns that workflow into a structured learning environment where agents:

Read failure context
Reason about root causes
Propose and apply fixes
Get shaped rewards for robust behavior

3. System Architecture

High-level flow:

Agent (LLM) -> Action -> Environment.step() -> Reward/Evaluation -> Next step

Core integration path:

Model -> Action -> Environment.step() -> RewardCalculator

RewardCalculator integrates:

DeterministicGrader
LLMJudge
HiddenTestRunner
AntiHackingDetector

3.1 OpenEnv Interface (Typed)

Typed Pydantic models are defined in env/models.py:

Observation: strict schema for environment observations
Action: normalized tool + payload action schema
Reward: bounded reward model with components

Environment contract:

reset() returns the initial Observation payload
step(action) returns (observation, reward, done, info)
state() returns current environment state snapshot

Server/API contract models are exposed in server/app.py and use the same typed observation/action/reward structures.

3.2 Action and Observation Spaces

Observation fields include:

task_id, difficulty, failure_stage, actual_bug
config, logs, error_message
available_tools, progress_flags
file_modification_count, hidden_test_pass_rate, step_count, last_action_error

Action schema:

tool: one of read_file, read_logs, analyze_error, edit_config, run_pipeline_stage, run_tests, validate_fix, submit_solution
payload: optional dict (for example { "raw": "replace npm tset with npm test" })

Reward schema:

value: bounded float in [0.0, 1.0]
components: reward breakdown dictionary

4. Core Modules

4.1 Quality Judge

File: env/graders/llm_judge.py
Purpose: quality-aware scoring of fixes
Output keys: correctness, minimalism, quality (all in [0,1])
Guarantees:
- strict JSON parsing attempt
- robust fallback parsing for messy output
- no-crash behavior (safe zero scores on failure)

4.2 Deterministic Grader

File: env/graders/deterministic.py
Purpose: reproducible correctness scoring (0-1)
Checks:
- YAML validity
- command and fix correctness
- similarity and issue resolution
Rules:
- deterministic only
- same input, same score

4.3 Anti-Hacking Detector

File: env/anti_hacking.py
Purpose: detect reward-hacking and shortcut behavior
Penalty detectors:
- stage skipping (if: false, when: never)
- fake success (echo tests passed, unsafe exit 0 patterns)
- pipeline breakage between versions
- excessive edits
- timeout abuse via too many steps

4.4 Hidden Tests

File: env/hidden_tests.py
Purpose: test fix robustness, not just exact-match overfitting
Method:
- deterministic variant generation (OS, versions, env shifts)
- evaluate pass rate across variants

4.5 Reward Shaping

File: env/rewards.py
Purpose: step-level learning signal
Components:
- progress rewards (logs, analysis, fix proposal)
- execution rewards (pipeline run, tests pass)
- quality rewards (deterministic + hidden tests + LLM judge)
- anti-hacking penalties

5. Inference and Evaluation

5.1 Prompt and Model Layers

inference/prompts.py: stable prompt templates and fallback action heuristics
inference/model_wrapper.py: OpenAI client action generation, candidate generation, and safe fallback

Canonical action tools used by environment and inference:

read_file
read_logs
analyze_error
edit_config
run_pipeline_stage
run_tests
validate_fix
submit_solution

5.2 Metrics and Artifacts

inference/metrics.py: reward, success-rate, and failure reason tracking
inference/visualize.py: reward curve and metrics artifact export

5.3 Submission-Critical Runtime

File: inference.py (root)
Responsibilities:
- initialize model and environment
- run step loop
- calculate rewards
- emit strict stdout contract
- always emit END line

Required output format:

[START] task=... env=... model=...
[STEP] step= action=... reward=0.00 done=<true|false> error=<msg|null>
[END] success=<true|false> steps= score=<0.000> rewards=<r1,r2,...>

Rules enforced:

single-line logs only
reward values with 2 decimals
lowercase booleans
no extra runtime log noise

6. Task Coverage

The project includes 9 CI-fix tasks spanning:

easy: syntax and typo fixes
medium: dependency/env/cache/permissions issues
hard: matrix logic, conditional flow, orchestration-level failures

Representative baseline tasks (one per difficulty):

easy: easy-command-typo (fix invalid npm tset command)
medium: medium-python-version (align workflow Python version)
hard: hard-needs-order (repair deploy job dependency ordering)

7. Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Environment variables:

export API_BASE_URL="/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fv1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_openai_compatible_api_key>"
# Optional alias; if set, this takes precedence over HF_TOKEN in inference.py
export OPENAI_API_KEY="<same_token_optional>"
# Optional, only if your inference spins environments from local images.
export LOCAL_IMAGE_NAME="<local_env_image_name>"

If you want to use an OpenAI access token directly:

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="<your_openai_access_token>"
# Optional alias:
export OPENAI_API_KEY="<same_token_optional>"

8. Run Inference

Offline/local mode:

python inference.py --offline --force-local-env --max-steps 8 --policy-mode imp --trajectories 4

Model-backed mode:

python inference.py --max-steps 8 --policy-mode imp --trajectories 4

Run baseline across easy/medium/hard tasks:

OpenAI client mode:

OPENAI_API_KEY="<your_openai_compatible_api_key>" python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --force-local-env

Offline reproducible mode:

python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --offline --force-local-env

Policy modes:

sft: deterministic heuristic policy
direct: single model action per step
imp: multi-candidate generation and ranking

9. Baseline Scores

Reproducible baseline artifact:

artifacts/baseline_scores.json

Latest baseline run (max_steps=5, policy_mode=imp, trajectories=3):

Task ID	Difficulty	Score	Success
easy-command-typo	easy	0.541	false
medium-python-version	medium	0.679	false
hard-needs-order	hard	0.513	false

Aggregate:

average score: 0.578
success rate: 0.000

When OPENAI_API_KEY is provided, the same script runs with the OpenAI API client path in inference.py.

10. Tests

Run all tests:

python -m unittest discover -s tests -v

Coverage includes:

LLM judge
deterministic grader
anti-hacking detectors
hidden tests
reward system
end-to-end inference output format

11. Validation and Submission

OpenEnv validation:

python -m openenv.cli.__main__ validate

Pre-submission script:

./validate-submission.sh <your_hf_space_url>

Required environment variables:

export API_BASE_URL="/static-proxy?url=https%3A%2F%2Frouter.huggingface.co%2Fv1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export OPENAI_API_KEY="<your_openai_compatible_api_key>"
# Optional fallback:
export HF_TOKEN="<your_token>"

Docker run (Space/API mode):

docker build -t cicd-debugger-env .
docker run --rm -p 7860:7860 cicd-debugger-env

Server endpoints used by validators:

POST /reset
POST /step
GET /state
GET /health

12. Deploy to Hugging Face Space (OpenAI Token)

This repository is already configured for Docker Spaces (sdk: docker in this README front matter).

Create a new Hugging Face Space with SDK set to Docker.
Push this repository to the Space git remote.
In Space Settings -> Variables and secrets, add these Secrets:

OPENAI_API_KEY=<your_openai_access_token>
API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

Optional Secrets:

HF_TOKEN=<optional_fallback_token>
OFFLINE_INFERENCE=0
MAX_STEPS=8
TEMPERATURE=0.2
MAX_TOKENS=120

Keep the app port as 7860 (already configured).
Wait for build completion, then verify:

curl -sS https://<your-space-name>.hf.space/health
curl -sS -X POST https://<your-space-name>.hf.space/reset -H 'Content-Type: application/json' -d '{}'

Notes:

.env.example is for local development reference only. Hugging Face Spaces use Secrets/Variables from Space Settings.
Runtime code reads OPENAI_API_KEY first and falls back to HF_TOKEN when OPENAI_API_KEY is not provided.

13. One-line Presentation Summary

We built an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug real CI/CD pipelines using multi-step reasoning, hybrid grading, anti-hacking safeguards, and robust reward shaping.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support