Segformer85M β Apple Orchard Semantic Segmentation
Segformer-B5 (85M parameters) fine-tuned for 8-class semantic segmentation of outdoor apple orchard scenes captured from a robotic platform.
This repo contains two checkpoints:
| File | When to use |
|---|---|
Segformer85Mv1.pt |
Original v1, trained only on the spring oak_0415 dataset. Best baseline. |
Segformer85Mv2.pt β |
v1 + fine-tuned on a second dataset (different camera, autumn season). Use this for general deployment β same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons. |
Quick Use
from huggingface_hub import hf_hub_download
from transformers import SegformerForSemanticSegmentation
import torch, cv2, numpy as np
import torch.nn.functional as F
# 1. Download weights β pick v1 OR v2
ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
# ^^^^^^^^^^^^^^^^^
# use v2 by default
# 2. Init architecture from base + load fine-tuned weights
NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
model = SegformerForSemanticSegmentation.from_pretrained(
"nvidia/segformer-b5-finetuned-ade-640-640",
num_labels=8,
id2label={i:n for i,n in enumerate(NAMES)},
label2id={n:i for i,n in enumerate(NAMES)},
ignore_mismatched_sizes=True,
).cuda().eval()
model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])
# 3. Inference
img = cv2.imread("your_image.jpg")
H, W = img.shape[:2]
H32, W32 = (H//32)*32, (W//32)*32
rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()
with torch.no_grad():
logits = model(pixel_values=x).logits
logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
pred = logits.argmax(1)[0].cpu().numpy() # H x W, values 0..7
A ready-to-use predict.py is included in this repo.
Classes (id β name)
| ID | Class | Notes |
|---|---|---|
| 0 | tree | Apple trees (priority class for downstream tasks) |
| 1 | ground | Grass / dirt / orchard floor |
| 2 | person | Workers in scene |
| 3 | sky | |
| 4 | road | Path between rows |
| 5 | mountain | Distant terrain |
| 6 | building | Sheds, equipment shelters |
| 7 | background | Unknown / unlabeled regions (model output rare) |
Architecture & Preprocessing
| Base model | nvidia/segformer-b5-finetuned-ade-640-640 |
| Parameters | ~85M |
| Decoder head | Reinitialized for 8 classes |
| Input format | RGB, normalized with ImageNet mean/std |
mean |
[0.485, 0.456, 0.406] |
std |
[0.229, 0.224, 0.225] |
| Input resolution | Any HΓW where both are multiples of 32 |
| Trained at | 1024Γ576 (native 16:9) |
Performance
v1 (Segformer85Mv1.pt) β original training only
Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):
| Metric | Value |
|---|---|
| Tree IoU | 0.742 |
| mIoU (7 real classes) | 0.714 |
| Pixel accuracy | 0.834 |
v2 (Segformer85Mv2.pt) β v1 + Orchard Navigation fine-tune β
Same v1 hold-out β no regression on old domain:
| Metric | v1 | v2 |
|---|---|---|
| Tree IoU (orig orchard, no leak) | 0.742 | 0.742 β |
| mIoU (orig orchard) | 0.714 | 0.712 |
NEW orchard hold-out (different camera, autumn season β Aug+Sep capture):
| Metric | v1 | v2 |
|---|---|---|
| Tree recall on new orchard | ~0.55 (estimated) | 0.999 π |
Visual qualitative: v1 sometimes misclassifies autumn foliage as person (red); v2 cleanly segments it as tree. See samples/ for side-by-side examples.
v1 per-class IoU (8-class, no leak)
| Class | IoU | Precision | Recall |
|---|---|---|---|
| tree | 0.742 | 0.79 | 0.93 |
| ground | 0.851 | 0.91 | 0.93 |
| person | 0.719 | 0.82 | 0.85 |
| sky | 0.769 | 0.83 | 0.91 |
| road | 0.804 | 0.86 | 0.92 |
| mountain | 0.437 | 0.62 | 0.66 |
| building | 0.711 | 0.84 | 0.83 |
Training Data
v1 base
- ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
- Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (
vinesβtree,moutainβmountaintypo fixed) - Pseudo-labels generated by an earlier model to fill SAM annotation gaps
- Temporal split: frames
<=4500train (5177 samples), frames>4500validation (155 samples) β no neighbor leakage
v2 fine-tune (NEW)
- +311 images from "Orchard Navigation" dataset:
- 178 frames from a Sep-16 recording (autumn season)
- 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
- Tree-only polygon annotations
- Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
- Non-tree pixels in new images set to
ignore_index=255so the model only adapts its tree decisions, leaving other classes untouched
Training Recipe
v1
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW, weight_decay 0.01 |
| LR | 2e-5, cosine schedule |
| Epochs | 30 |
| Batch | 2 Γ grad_accum 4 (effective 8) |
| Resolution | 1024Γ576 |
| Precision | bfloat16 |
| Loss | weighted cross-entropy |
| Class weights | tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1 |
| Hardware | RTX 5090 (32 GB), ~2.3 hours |
v2 fine-tune (delta from v1)
| Hyperparameter | Value |
|---|---|
| LR | 5e-6 (10Γ lower for safe fine-tune) |
| Epochs | 8 (best at epoch 3) |
ignore_index |
255 (for unlabeled pixels in new data) |
| Everything else | Same as v1 |
| Hardware | RTX 5090, ~13 minutes |
Limitations
This model was trained on a single Korean apple orchard (spring 2024) with a single robot platform, plus a small fine-tune on a second autumn capture. Expect degradation on:
- β οΈ Different orchards (different tree species, layouts, training systems)
- β οΈ Different cameras (different FOV, color profiles, sensors)
- π Different seasons not in training (winter dormant trees)
- π Different lighting (rain, dawn/dusk, night)
- π Aerial / drone perspectives
For deployment in a new context, plan to fine-tune on 100-300 in-domain images.
Files in This Repo
| File | Purpose |
|---|---|
Segformer85Mv1.pt |
Original v1 weights (339 MB) |
Segformer85Mv2.pt |
v1 + Orchard Navigation fine-tune (339 MB) β |
predict.py |
Standalone inference script (defaults to v2) |
README.md |
This file |
samples/*.jpg |
v1 prediction examples (in-domain) |
samples_v6_vs_v7/*.jpg |
v1 vs v2 side-by-side on new orchard (showcases v2 improvement) |
train_v6_5090.py |
v1 training script |
finetune_v7.py |
v2 fine-tune script |
history_v6.json |
v1 per-epoch training history |
history_v7.json |
v2 fine-tune history |
v6_OOD_full_res.mp4 |
1-minute OOD inference video at native resolution |
License
Apache 2.0
Model tree for WEN0256/Segformer85Mv1
Base model
nvidia/segformer-b5-finetuned-ade-640-640