Imitation Learning Performance Evaluation Package
A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration β test loop (keyboard/mouse) β per-trial record table β final evaluation summary.
7 Evaluation Metrics
| Metric | Description |
|---|---|
| SR (Task Success Rate) | Fraction of trials that succeeded (%) |
| DE (Data Efficiency) | Efficiency based on success rate and number of demonstration data |
| TL (Learning Efficiency) | Time to reach target success rate (when using checkpoint log) |
| AD (Adaptability) | Difference in success rate between baseline and Out-of-Distribution environments |
| T_avg (Task Time) | Average task duration per trial (seconds) |
| TR (Real-time Performance) | Average inference time collected during trials (ms) |
| SaR (Operational Safety) | Fraction of trials with no safety incident (%) |
Quick Start
# From project root (metrics/)
python run_evaluation.py
Or run as a module:
python -m metrics_eval.runner
When the GUI opens, enter the evaluation settings and click ν
μ€νΈ μμ (Start Test).
For installation and CJK font setup, see INSTALL_ENG.md.
GUI Overview
1. Main Window
- Top: Evaluation flow diagram image (
picture2.png; scale is controlled byIMG_DISPLAY_SCALEinrunner.py) - Evaluation settings panel
- Total number of task trials, number of adaptability trials (extra Out-of-Distribution trials)
- Number of demonstration data, target task success rate (%)
- Checkpoint log (JSON) path, Policy (pt) path under evaluation (with browse buttons)
- ν μ€νΈ μμ (Start Test) button
- Status message and safety-incident indicator (when F key is pressed)
2. Test Control Window
Opened when you start a test. It provides mouse buttons that do the same as the keyboard.
- [S] μμ μμ β Record trial start time, start inference-time collection
- [E] μμ μ’ λ£ β Record trial end time, stop inference-time collection
- [Y] μ±κ³΅ / [N] μ€ν¨ β Mark the trial as success or failure
- [F] μμ μ¬κ³ β Mark safety incident for this trial (red notice on main window)
Button style matches the ν μ€νΈ μμ button on the main window.
3. Per-Trial Results Window (Record Table)
A table that adds one row per trial as the test runs.
| Trial | Adapt. | Success | Task time (s) | Safety | Avg real-time (ms) |
|---|
- Adapt.:
-for baseline trials,μ μμ±for Out-of-Distribution trials - Safety:
μμ(none) /μμ(incident) - For many trials, the table grows vertically and can be scrolled.
- Window width is aligned with the Test Control window.
4. Evaluation Result Window
When all trials are done, an Evaluation Result toplevel opens with a summary of the 7 metrics.
βββ νκ° κ²°κ³Ό βββ
μμ
μ±κ³΅λ₯ (SR): ...
λ°μ΄ν° ν¨μ¨μ± (DE): ...
νμ΅ ν¨μ¨μ± (TL): ...
μ μμ± (AD): ...
μμ
μκ° νκ· : ...
μ€μκ° μ±λ₯ (TR, νκ· inference time): ... ms
μμ
μμ μ± (SaR): ...
Input: Keyboard / Mouse
During a test you can use either the keyboard or the Test Control window buttons.
| Action | Key | Mouse |
|---|---|---|
| Start task | S | [S] μμ μμ |
| End task | E | [E] μμ μ’ λ£ |
| Success | Y | [Y] μ±κ³΅ |
| Failure | N | [N] μ€ν¨ |
| Safety incident | F | [F] μμ μ¬κ³ |
Typical flow per trial: S β (run task) β E β Y or N. Press F during the task if a safety incident occurs.
User Code Integration
Learning Efficiency (TL): Checkpoint Log
If your training script logs each saved pt file like this, the evaluation GUI can load the training time for that pt and show TL:
import time
from training_checkpoint_logger import log_checkpoint
start_time = time.time()
# ... training loop ...
for step in range(...):
...
if step % 100 == 0:
path = f"checkpoints/model_{step}.pt"
torch.save(model.state_dict(), path)
log_checkpoint(path, time.time() - start_time) # default: checkpoint_times.json
In the GUI, set Checkpoint λ‘κ·Έ to the path of checkpoint_times.json and νκ° μ€μΈ Policy (pt) κ²½λ‘ to the pt file you use; TL and SR will be computed and shown together.
Real-time Performance (TR): Inference Time Collection
If you call record_inference_time after each inference in your policy loop, that trialβs inference times are collected and their average is used for TR and for the νκ· μ€μκ°μ±(ms) column in the per-trial table.
import time
from metrics_eval.inference_recorder import record_inference_time
# Per trial: call between [S] and [E]
for step in range(...):
t0 = time.perf_counter()
action = policy(obs)
record_inference_time(time.perf_counter() - t0)
The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.
Project Structure
metrics/
βββ run_evaluation.py # GUI entry point
βββ training_checkpoint_logger.py # pt β training time log (TL integration)
βββ README_ENG.md # This document (English)
βββ README_KOR.md # Korean version
βββ INSTALL_ENG.md # Installation guide (English)
βββ INSTALL_KOR.md # Installation guide (Korean)
βββ metrics_eval/
β βββ __init__.py # Exports 7 metric functions
β βββ task_success_rate.py # SR
β βββ data_efficiency.py # DE
β βββ learning_efficiency.py # TL
β βββ adaptability.py # AD
β βββ task_time.py # T_avg
β βββ realtime_performance.py # TR
β βββ operational_safety.py # SaR
β βββ inference_recorder.py # record_inference_time (TR)
β βββ runner.py # GUI, test loop, result display
βββ docs/
βββ design_metrics_evaluation.md # Metric and GUI design
To use only the metric functions, import from metrics_eval:
from metrics_eval import (
compute_sr,
compute_de,
compute_tl_display,
compute_ad,
compute_task_time_avg,
compute_tr,
compute_sar,
)
Summary
- Run:
python run_evaluation.py - Input: Evaluation settings β Start Test β per trial: S β E β Y/N (and F if needed)
- Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
- Integration:
training_checkpoint_logger+ checkpoint log path β TLrecord_inference_time(between S and E) β TR and per-trial average real-time
For installation and environment setup, see INSTALL_ENG.md.