If you have difficult questions or topics related to robotics (and related models), I recommend LeRobot Discord.
What the ecosystem currently implies about “1-camera SmolVLA”
1) SmolVLA was trained with varying camera counts, but with standardized camera semantics
The SmolVLA paper describes a normalization step that prioritizes top, wrist, side perspectives and renames them into a standard ordering (OBS_IMAGE_1, OBS_IMAGE_2, OBS_IMAGE_3), and explicitly says unused views were dropped during training. (arXiv)
The Hugging Face blog repeats the same idea: each camera is mapped to a standardized scheme and unused views are dropped. (Hugging Face)
Implication: the model family can handle “not always 3 cameras,” but the semantics/order of the views matter.
2) The released lerobot/smolvla_base checkpoint is configured for three image keys
The config.json for lerobot/smolvla_base defines three visual inputs:
observation.images.camera1
observation.images.camera2
observation.images.camera3
and also contains empty_cameras: 0. (Hugging Face)
Implication: out-of-the-box, the policy expects three image features unless you modify the config or insert a processor that synthesizes/masks missing inputs.
3) Tooling pitfalls: camera key matching and camera order sensitivity
- A common runtime failure mode is a strict check that the robot’s camera keys match the policy’s image feature keys (assertion on key sets). (GitHub)
- A community report suggests the model can be sensitive to camera order: keeping the same order at inference as training helped; reordering hurt. (GitHub)
- Another recent SO-101 report trained with two cameras and had “hiccups” due to a missing third camera, leading them to remap cameras and create a dummy third input; they suspected this could confuse behavior. (GitHub)
Implication: even if the model architecture can accept fewer views, the LeRobot wiring and the implicit ordering can dominate your results.
Similar cases and online resources (practical, directly relevant)
Primary references (most “load-bearing”)
- SmolVLA paper (view normalization and dropping unused views) (arXiv)
- Hugging Face SmolVLA blog (standardized camera scheme) (Hugging Face)
- LeRobot SmolVLA docs (recommended dataset scale, example commands) (Hugging Face)
lerobot/smolvla_base config (3 camera inputs + empty_cameras) (Hugging Face)
“This broke for me” issues that map to your plan
- Camera order / naming confusion discussion (#1763) — highlights concatenation in order and empirical sensitivity to order (GitHub)
- Robot camera key mismatch assertion (#1620) — strict key-set matching can block inference if policy expects different camera keys (GitHub)
- Two-camera + dummy camera debugging (#2753) — exactly the sort of “missing camera feature” workaround you’re anticipating (GitHub)
Good “how-to” writeups with concrete debugging heuristics
- Correll Lab / Medium: stresses validating input dimensions, camera feed order/names, and especially using the correct
stats.json for action scaling (a frequent cause of “arm moves weird then stalls”). (Medium)
- Community dataset stats (what people actually record): most datasets have 1–2 cameras, and 640×480 @ 30fps dominates; most episodes are <30s. (kamenski.me)
How I’d approach your single-wrist-camera SO101 setup
Your current plan is structurally aligned with what HF recommends
HF’s SmolVLA docs recommend ~50 episodes as a starting point and explicitly mention that 25 was not enough in their pick-place example (50 episodes across 5 positions, 10 repeats each). (Hugging Face)
Your “4 prompts × 5 locations × 10 repeats = 200 episodes” is, in principle, a strong base if the demos are clean and consistent.
The main technical decision: do you want to preserve pretraining camera-slot semantics or simplify to 1 input?
Because camera identity is largely implicit via position in the token sequence (and community reports indicate order can matter), you have two competing goals:
Option A — Simplify to one camera input (easiest wiring; may require more data)
Idea: edit the policy config so it only expects one visual feature, and record only that camera key.
-
Pros:
- Avoids “missing camera” hacks and strict key-set matching at runtime.
- Cleanest operationally: one physical camera, one key, no dummy streams.
-
Cons:
- Your wrist view becomes the first/only view; if pretraining expected “top-down first,” you’re relying on fine-tuning to compensate.
- With default settings (vision encoder frozen in base config (Hugging Face)), adaptation may be slower than you’d like if the view shift is large.
When this works well: tasks where the object is always in the wrist view early, workspace is constrained, and demonstrations are consistent.
Option B — Keep wrist in the “wrist slot” (camera2) and handle missing views (best semantic continuity; more plumbing)
Idea: preserve the standardized convention suggested by the paper/blog (wrist as the second view), even if you only have one camera.
In practice, that means:
This aligns with the “order matters” concern raised in #1763. (GitHub)
But it’s also where people run into the “missing third camera” headaches seen in #2753. (GitHub)
When this works well: you want maximum compatibility with pretraining conventions and you’re willing to engineer the input pipeline carefully.
My recommendation for your case: start with Option A (single input) as your baseline because it’s operationally robust, then try Option B as an ablation if results are underwhelming.
Concrete, practical implementation guidance
1) Camera key mapping: prefer “fix it at the source”
If you can, record your wrist camera directly under the key you intend the policy to consume (e.g., record it as camera1 or camera2) rather than recording wrist and renaming later. This reduces places where things can silently diverge.
If you do need renaming or augmentation, LeRobot’s processor pipeline is the intended mechanism for transforming observation keys before the policy sees them. (Hugging Face)
2) Be deliberate about episode length and dataset size
If you record long episodes at 30 fps, your dataset gets big fast. Community stats show most episodes are fairly short (many under 30s), and 640×480@30fps is common. (kamenski.me)
For single-camera wrist-only, short, focused episodes often train better than long “wandering” sequences.
3) Don’t skip “data sanity” checks (this is where most failures come from)
Before you run a full training:
- replay episodes and verify frames match what you think you recorded
- confirm the policy’s expected image keys match the dataset keys (otherwise you’ll hit the kind of key assertions seen in #1620) (GitHub)
- confirm your training/inference uses the correct
stats.json (action scaling mismatch is a known cause of erratic or stalled motion). (Medium)
Expected performance with only a wrist camera (realistically)
What should work
Common failure modes
- Search/approach fails: wrist camera doesn’t provide global scene layout; if the object starts out-of-view, the policy can flail.
- Generalization across locations: wrist-only view can look similar across positions until late; the policy may need more diverse demos to infer where to go.
- Order/semantics mismatch: if you compress to one view, you may lose the “top then wrist” structure that pretraining implicitly learned (and #1763 suggests order can matter). (GitHub)
A practical experiment plan (fast feedback, minimizes wasted collection)
Given you’re collecting this afternoon, the fastest path to insight is:
- Baseline: 1 camera only (Option A).
- Ablation 1: same dataset, but map wrist to the “wrist slot” (camera2) with a dummy/duplicate for camera1 (Option B).
- Ablation 2: change nothing but swap order/slot — see if performance shifts, validating whether your setup is order-sensitive like #1763 reports. (GitHub)
You’ll learn quickly whether “slot semantics” matter in your environment with your data scale.
If you want a single “most likely to succeed” choice without extra engineering: train and deploy with one camera key only (Option A), keep episodes short and consistent, and only add dummy-slot complexity if baseline results are clearly capped.