Single Camera SmolVLA Training

CursedRock17 · January 22, 2026, 7:51pm

Good Afternoon,

My university lab team recently built a pair of SO101 arms, ran through the basic calibration setup, teleoperation, then trained the follower arm to pick a block through 50 episodes on the ACT policy. It was overall a pretty solid and fun process, but we’re looking to get some sort of practical use of the arms now.

Particularly we wanted to use the SmolVLA policy and just one camera on the wrist joint overlooking the gripper. I read the SmolVLA paper and noticed it was configured for multiple cameras, but noticed in Apendix A that some of the dataset they trained only used one. I’ll be collecting dataset this afternoon with a plan of 50 episodes a piece with 4 different prompts (pick up w, x, y, z different objects) each at 5 separate locations, 10 times @ 30FPS. I hope to then remap the observation.image.wrist to the given observation.image.camera1 and find a way to remove the other two given in the dataset. I was wondering if anybody else had experience using 1 camera with smolVLA? If so how did it perform?

John6666 · January 23, 2026, 5:49am

If you have difficult questions or topics related to robotics (and related models), I recommend LeRobot Discord.

What the ecosystem currently implies about “1-camera SmolVLA”

1) SmolVLA was trained with varying camera counts, but with standardized camera semantics

The SmolVLA paper describes a normalization step that prioritizes top, wrist, side perspectives and renames them into a standard ordering (OBS_IMAGE_1, OBS_IMAGE_2, OBS_IMAGE_3), and explicitly says unused views were dropped during training. (arXiv)
The Hugging Face blog repeats the same idea: each camera is mapped to a standardized scheme and unused views are dropped. (Hugging Face)

Implication: the model family can handle “not always 3 cameras,” but the semantics/order of the views matter.

2) The released `lerobot/smolvla_base` checkpoint is configured for three image keys

The config.json for lerobot/smolvla_base defines three visual inputs:

observation.images.camera1
observation.images.camera2
observation.images.camera3
and also contains empty_cameras: 0. (Hugging Face)

Implication: out-of-the-box, the policy expects three image features unless you modify the config or insert a processor that synthesizes/masks missing inputs.

3) Tooling pitfalls: camera key matching and camera order sensitivity

A common runtime failure mode is a strict check that the robot’s camera keys match the policy’s image feature keys (assertion on key sets). (GitHub)
A community report suggests the model can be sensitive to camera order: keeping the same order at inference as training helped; reordering hurt. (GitHub)
Another recent SO-101 report trained with two cameras and had “hiccups” due to a missing third camera, leading them to remap cameras and create a dummy third input; they suspected this could confuse behavior. (GitHub)

Implication: even if the model architecture can accept fewer views, the LeRobot wiring and the implicit ordering can dominate your results.

Similar cases and online resources (practical, directly relevant)

Primary references (most “load-bearing”)

SmolVLA paper (view normalization and dropping unused views) (arXiv)
Hugging Face SmolVLA blog (standardized camera scheme) (Hugging Face)
LeRobot SmolVLA docs (recommended dataset scale, example commands) (Hugging Face)
lerobot/smolvla_base config (3 camera inputs + empty_cameras) (Hugging Face)

“This broke for me” issues that map to your plan

Camera order / naming confusion discussion (#1763) — highlights concatenation in order and empirical sensitivity to order (GitHub)
Robot camera key mismatch assertion (#1620) — strict key-set matching can block inference if policy expects different camera keys (GitHub)
Two-camera + dummy camera debugging (#2753) — exactly the sort of “missing camera feature” workaround you’re anticipating (GitHub)

Good “how-to” writeups with concrete debugging heuristics

Correll Lab / Medium: stresses validating input dimensions, camera feed order/names, and especially using the correct stats.json for action scaling (a frequent cause of “arm moves weird then stalls”). (Medium)
Community dataset stats (what people actually record): most datasets have 1–2 cameras, and 640×480 @ 30fps dominates; most episodes are <30s. (kamenski.me)

How I’d approach your single-wrist-camera SO101 setup

Your current plan is structurally aligned with what HF recommends

HF’s SmolVLA docs recommend ~50 episodes as a starting point and explicitly mention that 25 was not enough in their pick-place example (50 episodes across 5 positions, 10 repeats each). (Hugging Face)
Your “4 prompts × 5 locations × 10 repeats = 200 episodes” is, in principle, a strong base if the demos are clean and consistent.

The main technical decision: do you want to preserve pretraining camera-slot semantics or simplify to 1 input?

Because camera identity is largely implicit via position in the token sequence (and community reports indicate order can matter), you have two competing goals:

Option A — Simplify to one camera input (easiest wiring; may require more data)

Idea: edit the policy config so it only expects one visual feature, and record only that camera key.

Pros:
- Avoids “missing camera” hacks and strict key-set matching at runtime.
- Cleanest operationally: one physical camera, one key, no dummy streams.
Cons:
- Your wrist view becomes the first/only view; if pretraining expected “top-down first,” you’re relying on fine-tuning to compensate.
- With default settings (vision encoder frozen in base config (Hugging Face)), adaptation may be slower than you’d like if the view shift is large.

When this works well: tasks where the object is always in the wrist view early, workspace is constrained, and demonstrations are consistent.

Option B — Keep wrist in the “wrist slot” (camera2) and handle missing views (best semantic continuity; more plumbing)

Idea: preserve the standardized convention suggested by the paper/blog (wrist as the second view), even if you only have one camera.

In practice, that means:

ensuring your real wrist image maps to observation.images.camera2
and then deciding what camera1/camera3 are:
- black/zero images,
- or duplicates of the wrist image,
- or some synthetic placeholder created by a processor.

This aligns with the “order matters” concern raised in #1763. (GitHub)
But it’s also where people run into the “missing third camera” headaches seen in #2753. (GitHub)

When this works well: you want maximum compatibility with pretraining conventions and you’re willing to engineer the input pipeline carefully.

My recommendation for your case: start with Option A (single input) as your baseline because it’s operationally robust, then try Option B as an ablation if results are underwhelming.

Concrete, practical implementation guidance

1) Camera key mapping: prefer “fix it at the source”

If you can, record your wrist camera directly under the key you intend the policy to consume (e.g., record it as camera1 or camera2) rather than recording wrist and renaming later. This reduces places where things can silently diverge.

If you do need renaming or augmentation, LeRobot’s processor pipeline is the intended mechanism for transforming observation keys before the policy sees them. (Hugging Face)

2) Be deliberate about episode length and dataset size

If you record long episodes at 30 fps, your dataset gets big fast. Community stats show most episodes are fairly short (many under 30s), and 640×480@30fps is common. (kamenski.me)
For single-camera wrist-only, short, focused episodes often train better than long “wandering” sequences.

3) Don’t skip “data sanity” checks (this is where most failures come from)

Before you run a full training:

replay episodes and verify frames match what you think you recorded
confirm the policy’s expected image keys match the dataset keys (otherwise you’ll hit the kind of key assertions seen in #1620) (GitHub)
confirm your training/inference uses the correct stats.json (action scaling mismatch is a known cause of erratic or stalled motion). (Medium)

Expected performance with only a wrist camera (realistically)

What should work

Close-range manipulation where the object is consistently visible early in the episode.
Prompt-conditioned “pick X” if:
- your objects are visually distinct,
- the wrist view has enough context to disambiguate,
- demonstrations include the object entering view reliably.

Common failure modes

Search/approach fails: wrist camera doesn’t provide global scene layout; if the object starts out-of-view, the policy can flail.
Generalization across locations: wrist-only view can look similar across positions until late; the policy may need more diverse demos to infer where to go.
Order/semantics mismatch: if you compress to one view, you may lose the “top then wrist” structure that pretraining implicitly learned (and #1763 suggests order can matter). (GitHub)

A practical experiment plan (fast feedback, minimizes wasted collection)

Given you’re collecting this afternoon, the fastest path to insight is:

Baseline: 1 camera only (Option A).
Ablation 1: same dataset, but map wrist to the “wrist slot” (camera2) with a dummy/duplicate for camera1 (Option B).
Ablation 2: change nothing but swap order/slot — see if performance shifts, validating whether your setup is order-sensitive like #1763 reports. (GitHub)

You’ll learn quickly whether “slot semantics” matter in your environment with your data scale.

If you want a single “most likely to succeed” choice without extra engineering: train and deploy with one camera key only (Option A), keep episodes short and consistent, and only add dummy-slot complexity if baseline results are clearly capped.

Topic		Replies	Views
Help : integrating a Agilex PiPER robot arm to lerobot Intermediate	7	240	October 2, 2025
LoRA Training 训练 Beginners	1	55	January 16, 2026
Fine-tuning GR00T for Chess Setup on Unitree G1 Models	2	32	January 10, 2026
VLA models (Post-Training Isaac GR00T N1.5) Models	2	58	November 10, 2025
Smollm or othe SLM's example uses andmfeedback for getting the most of of them Beginners	5	264	October 4, 2024