Imitation Learning Performance Evaluation Package

A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration → test loop (keyboard/mouse) → per-trial record table → final evaluation summary.

7 Evaluation Metrics

Metric	Description
SR (Task Success Rate)	Fraction of trials that succeeded (%)
DE (Data Efficiency)	Efficiency based on success rate and number of demonstration data
TL (Learning Efficiency)	Time to reach target success rate (when using checkpoint log)
AD (Adaptability)	Difference in success rate between baseline and Out-of-Distribution environments
T_avg (Task Time)	Average task duration per trial (seconds)
TR (Real-time Performance)	Average inference time collected during trials (ms)
SaR (Operational Safety)	Fraction of trials with no safety incident (%)

Quick Start

# From project root (metrics/)
python run_evaluation.py

Or run as a module:

python -m metrics_eval.runner

When the GUI opens, enter the evaluation settings and click 테스트 시작 (Start Test).
For installation and CJK font setup, see INSTALL_ENG.md.

GUI Overview

1. Main Window

Top: Evaluation flow diagram image (picture2.png; scale is controlled by IMG_DISPLAY_SCALE in runner.py)
Evaluation settings panel
- Total number of task trials, number of adaptability trials (extra Out-of-Distribution trials)
- Number of demonstration data, target task success rate (%)
- Checkpoint log (JSON) path, Policy (pt) path under evaluation (with browse buttons)
테스트 시작 (Start Test) button
Status message and safety-incident indicator (when F key is pressed)

2. Test Control Window

Opened when you start a test. It provides mouse buttons that do the same as the keyboard.

[S] 작업 시작 — Record trial start time, start inference-time collection
[E] 작업 종료 — Record trial end time, stop inference-time collection
[Y] 성공 / [N] 실패 — Mark the trial as success or failure
[F] 안전사고 — Mark safety incident for this trial (red notice on main window)

Button style matches the 테스트 시작 button on the main window.

3. Per-Trial Results Window (Record Table)

A table that adds one row per trial as the test runs.

Trial	Adapt.	Success	Task time (s)	Safety	Avg real-time (ms)

Adapt.: - for baseline trials, 적응성 for Out-of-Distribution trials
Safety: 없음 (none) / 있음 (incident)
For many trials, the table grows vertically and can be scrolled.
Window width is aligned with the Test Control window.

4. Evaluation Result Window

When all trials are done, an Evaluation Result toplevel opens with a summary of the 7 metrics.

——— 평가 결과 ———
작업 성공률 (SR): ...
데이터 효율성 (DE): ...
학습 효율성 (TL): ...
적응성 (AD): ...
작업 시간 평균: ...
실시간 성능 (TR, 평균 inference time): ... ms
작업 안정성 (SaR): ...

Input: Keyboard / Mouse

During a test you can use either the keyboard or the Test Control window buttons.

Action	Key	Mouse
Start task	S	[S] 작업 시작
End task	E	[E] 작업 종료
Success	Y	[Y] 성공
Failure	N	[N] 실패
Safety incident	F	[F] 안전사고

Typical flow per trial: S → (run task) → E → Y or N. Press F during the task if a safety incident occurs.

User Code Integration

Learning Efficiency (TL): Checkpoint Log

If your training script logs each saved pt file like this, the evaluation GUI can load the training time for that pt and show TL:

import time
from training_checkpoint_logger import log_checkpoint

start_time = time.time()
# ... training loop ...
for step in range(...):
    ...
    if step % 100 == 0:
        path = f"checkpoints/model_{step}.pt"
        torch.save(model.state_dict(), path)
        log_checkpoint(path, time.time() - start_time)  # default: checkpoint_times.json

In the GUI, set Checkpoint 로그 to the path of checkpoint_times.json and 평가 중인 Policy (pt) 경로 to the pt file you use; TL and SR will be computed and shown together.

Real-time Performance (TR): Inference Time Collection

If you call record_inference_time after each inference in your policy loop, that trial’s inference times are collected and their average is used for TR and for the 평균 실시간성(ms) column in the per-trial table.

import time
from metrics_eval.inference_recorder import record_inference_time

# Per trial: call between [S] and [E]
for step in range(...):
    t0 = time.perf_counter()
    action = policy(obs)
    record_inference_time(time.perf_counter() - t0)

The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.

Project Structure

metrics/
├── run_evaluation.py             # GUI entry point
├── training_checkpoint_logger.py # pt ↔ training time log (TL integration)
├── README_ENG.md                 # This document (English)
├── README_KOR.md                 # Korean version
├── INSTALL_ENG.md                # Installation guide (English)
├── INSTALL_KOR.md                # Installation guide (Korean)
├── metrics_eval/
│   ├── __init__.py               # Exports 7 metric functions
│   ├── task_success_rate.py      # SR
│   ├── data_efficiency.py        # DE
│   ├── learning_efficiency.py    # TL
│   ├── adaptability.py           # AD
│   ├── task_time.py              # T_avg
│   ├── realtime_performance.py   # TR
│   ├── operational_safety.py     # SaR
│   ├── inference_recorder.py     # record_inference_time (TR)
│   └── runner.py                 # GUI, test loop, result display
└── docs/
    └── design_metrics_evaluation.md  # Metric and GUI design

To use only the metric functions, import from metrics_eval:

from metrics_eval import (
    compute_sr,
    compute_de,
    compute_tl_display,
    compute_ad,
    compute_task_time_avg,
    compute_tr,
    compute_sar,
)

Summary

Run: python run_evaluation.py
Input: Evaluation settings → Start Test → per trial: S → E → Y/N (and F if needed)
Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
Integration: training_checkpoint_logger + checkpoint log path → TL
record_inference_time (between S and E) → TR and per-trial average real-time

For installation and environment setup, see INSTALL_ENG.md.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics