Imitation Learning Performance Evaluation Package

A GUI-based package that evaluates the performance of imitation-learning-based autonomous manipulation algorithms using 7 quantitative metrics.
It supports the full flow: configuration β†’ test loop (keyboard/mouse) β†’ per-trial record table β†’ final evaluation summary.


7 Evaluation Metrics

Metric Description
SR (Task Success Rate) Fraction of trials that succeeded (%)
DE (Data Efficiency) Efficiency based on success rate and number of demonstration data
TL (Learning Efficiency) Time to reach target success rate (when using checkpoint log)
AD (Adaptability) Difference in success rate between baseline and Out-of-Distribution environments
T_avg (Task Time) Average task duration per trial (seconds)
TR (Real-time Performance) Average inference time collected during trials (ms)
SaR (Operational Safety) Fraction of trials with no safety incident (%)

Quick Start

# From project root (metrics/)
python run_evaluation.py

Or run as a module:

python -m metrics_eval.runner

When the GUI opens, enter the evaluation settings and click ν…ŒμŠ€νŠΈ μ‹œμž‘ (Start Test).
For installation and CJK font setup, see INSTALL_ENG.md.


GUI Overview

1. Main Window

  • Top: Evaluation flow diagram image (picture2.png; scale is controlled by IMG_DISPLAY_SCALE in runner.py)
  • Evaluation settings panel
    • Total number of task trials, number of adaptability trials (extra Out-of-Distribution trials)
    • Number of demonstration data, target task success rate (%)
    • Checkpoint log (JSON) path, Policy (pt) path under evaluation (with browse buttons)
  • ν…ŒμŠ€νŠΈ μ‹œμž‘ (Start Test) button
  • Status message and safety-incident indicator (when F key is pressed)

2. Test Control Window

Opened when you start a test. It provides mouse buttons that do the same as the keyboard.

  • [S] μž‘μ—… μ‹œμž‘ β€” Record trial start time, start inference-time collection
  • [E] μž‘μ—… μ’…λ£Œ β€” Record trial end time, stop inference-time collection
  • [Y] 성곡 / [N] μ‹€νŒ¨ β€” Mark the trial as success or failure
  • [F] μ•ˆμ „μ‚¬κ³  β€” Mark safety incident for this trial (red notice on main window)

Button style matches the ν…ŒμŠ€νŠΈ μ‹œμž‘ button on the main window.

3. Per-Trial Results Window (Record Table)

A table that adds one row per trial as the test runs.

Trial Adapt. Success Task time (s) Safety Avg real-time (ms)
  • Adapt.: - for baseline trials, 적응성 for Out-of-Distribution trials
  • Safety: μ—†μŒ (none) / 있음 (incident)
  • For many trials, the table grows vertically and can be scrolled.
  • Window width is aligned with the Test Control window.

4. Evaluation Result Window

When all trials are done, an Evaluation Result toplevel opens with a summary of the 7 metrics.

β€”β€”β€” 평가 κ²°κ³Ό β€”β€”β€”
μž‘μ—… 성곡λ₯  (SR): ...
데이터 νš¨μœ¨μ„± (DE): ...
ν•™μŠ΅ νš¨μœ¨μ„± (TL): ...
적응성 (AD): ...
μž‘μ—… μ‹œκ°„ 평균: ...
μ‹€μ‹œκ°„ μ„±λŠ₯ (TR, 평균 inference time): ... ms
μž‘μ—… μ•ˆμ •μ„± (SaR): ...

Input: Keyboard / Mouse

During a test you can use either the keyboard or the Test Control window buttons.

Action Key Mouse
Start task S [S] μž‘μ—… μ‹œμž‘
End task E [E] μž‘μ—… μ’…λ£Œ
Success Y [Y] 성곡
Failure N [N] μ‹€νŒ¨
Safety incident F [F] μ•ˆμ „μ‚¬κ³ 

Typical flow per trial: S β†’ (run task) β†’ E β†’ Y or N. Press F during the task if a safety incident occurs.


User Code Integration

Learning Efficiency (TL): Checkpoint Log

If your training script logs each saved pt file like this, the evaluation GUI can load the training time for that pt and show TL:

import time
from training_checkpoint_logger import log_checkpoint

start_time = time.time()
# ... training loop ...
for step in range(...):
    ...
    if step % 100 == 0:
        path = f"checkpoints/model_{step}.pt"
        torch.save(model.state_dict(), path)
        log_checkpoint(path, time.time() - start_time)  # default: checkpoint_times.json

In the GUI, set Checkpoint 둜그 to the path of checkpoint_times.json and 평가 쀑인 Policy (pt) 경둜 to the pt file you use; TL and SR will be computed and shown together.

Real-time Performance (TR): Inference Time Collection

If you call record_inference_time after each inference in your policy loop, that trial’s inference times are collected and their average is used for TR and for the 평균 μ‹€μ‹œκ°„μ„±(ms) column in the per-trial table.

import time
from metrics_eval.inference_recorder import record_inference_time

# Per trial: call between [S] and [E]
for step in range(...):
    t0 = time.perf_counter()
    action = policy(obs)
    record_inference_time(time.perf_counter() - t0)

The runner clears the buffer on [S] and uses the average for that trial when [E] is pressed.


Project Structure

metrics/
β”œβ”€β”€ run_evaluation.py             # GUI entry point
β”œβ”€β”€ training_checkpoint_logger.py # pt ↔ training time log (TL integration)
β”œβ”€β”€ README_ENG.md                 # This document (English)
β”œβ”€β”€ README_KOR.md                 # Korean version
β”œβ”€β”€ INSTALL_ENG.md                # Installation guide (English)
β”œβ”€β”€ INSTALL_KOR.md                # Installation guide (Korean)
β”œβ”€β”€ metrics_eval/
β”‚   β”œβ”€β”€ __init__.py               # Exports 7 metric functions
β”‚   β”œβ”€β”€ task_success_rate.py      # SR
β”‚   β”œβ”€β”€ data_efficiency.py        # DE
β”‚   β”œβ”€β”€ learning_efficiency.py    # TL
β”‚   β”œβ”€β”€ adaptability.py           # AD
β”‚   β”œβ”€β”€ task_time.py              # T_avg
β”‚   β”œβ”€β”€ realtime_performance.py   # TR
β”‚   β”œβ”€β”€ operational_safety.py     # SaR
β”‚   β”œβ”€β”€ inference_recorder.py     # record_inference_time (TR)
β”‚   └── runner.py                 # GUI, test loop, result display
└── docs/
    └── design_metrics_evaluation.md  # Metric and GUI design

To use only the metric functions, import from metrics_eval:

from metrics_eval import (
    compute_sr,
    compute_de,
    compute_tl_display,
    compute_ad,
    compute_task_time_avg,
    compute_tr,
    compute_sar,
)

Summary

  • Run: python run_evaluation.py
  • Input: Evaluation settings β†’ Start Test β†’ per trial: S β†’ E β†’ Y/N (and F if needed)
  • Display: Test Control window (mouse), per-trial table (live), evaluation result window (7 metrics)
  • Integration: training_checkpoint_logger + checkpoint log path β†’ TL
    record_inference_time (between S and E) β†’ TR and per-trial average real-time

For installation and environment setup, see INSTALL_ENG.md.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading