Segformer85M — Apple Orchard Semantic Segmentation

Segformer-B5 (85M parameters) fine-tuned for 8-class semantic segmentation of outdoor apple orchard scenes captured from a robotic platform.

This repo contains two checkpoints:

File	When to use
`Segformer85Mv1.pt`	Original v1, trained only on the spring oak_0415 dataset. Best baseline.
`Segformer85Mv2.pt` ⭐	v1 + fine-tuned on a second dataset (different camera, autumn season). Use this for general deployment — same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons.

Quick Use

from huggingface_hub import hf_hub_download
from transformers import SegformerForSemanticSegmentation
import torch, cv2, numpy as np
import torch.nn.functional as F

# 1. Download weights — pick v1 OR v2
ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
#                                                                       ^^^^^^^^^^^^^^^^^
#                                                                       use v2 by default

# 2. Init architecture from base + load fine-tuned weights
NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
model = SegformerForSemanticSegmentation.from_pretrained(
    "nvidia/segformer-b5-finetuned-ade-640-640",
    num_labels=8,
    id2label={i:n for i,n in enumerate(NAMES)},
    label2id={n:i for i,n in enumerate(NAMES)},
    ignore_mismatched_sizes=True,
).cuda().eval()
model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])

# 3. Inference
img = cv2.imread("your_image.jpg")
H, W = img.shape[:2]
H32, W32 = (H//32)*32, (W//32)*32
rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()

with torch.no_grad():
    logits = model(pixel_values=x).logits
    logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
    pred = logits.argmax(1)[0].cpu().numpy()  # H x W, values 0..7

A ready-to-use predict.py is included in this repo.

Classes (id → name)

ID	Class	Notes
0	tree	Apple trees (priority class for downstream tasks)
1	ground	Grass / dirt / orchard floor
2	person	Workers in scene
3	sky
4	road	Path between rows
5	mountain	Distant terrain
6	building	Sheds, equipment shelters
7	background	Unknown / unlabeled regions (model output rare)

Architecture & Preprocessing


Base model	`nvidia/segformer-b5-finetuned-ade-640-640`
Parameters	~85M
Decoder head	Reinitialized for 8 classes
Input format	RGB, normalized with ImageNet mean/std
`mean`	`[0.485, 0.456, 0.406]`
`std`	`[0.229, 0.224, 0.225]`
Input resolution	Any H×W where both are multiples of 32
Trained at	1024×576 (native 16:9)

Performance

v1 (Segformer85Mv1.pt) — original training only

Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):

Metric	Value
Tree IoU	0.742
mIoU (7 real classes)	0.714
Pixel accuracy	0.834

v2 (Segformer85Mv2.pt) — v1 + Orchard Navigation fine-tune ⭐

Same v1 hold-out → no regression on old domain:

Metric	v1	v2
Tree IoU (orig orchard, no leak)	0.742	0.742 ✅
mIoU (orig orchard)	0.714	0.712

NEW orchard hold-out (different camera, autumn season — Aug+Sep capture):

Metric	v1	v2
Tree recall on new orchard	~0.55 (estimated)	0.999 🚀

Visual qualitative: v1 sometimes misclassifies autumn foliage as person (red); v2 cleanly segments it as tree. See samples/ for side-by-side examples.

v1 per-class IoU (8-class, no leak)

Class	IoU	Precision	Recall
tree	0.742	0.79	0.93
ground	0.851	0.91	0.93
person	0.719	0.82	0.85
sky	0.769	0.83	0.91
road	0.804	0.86	0.92
mountain	0.437	0.62	0.66
building	0.711	0.84	0.83

Training Data

v1 base

~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (vines→tree, moutain→mountain typo fixed)
Pseudo-labels generated by an earlier model to fill SAM annotation gaps
Temporal split: frames <=4500 train (5177 samples), frames >4500 validation (155 samples) — no neighbor leakage

v2 fine-tune (NEW)

+311 images from "Orchard Navigation" dataset:
- 178 frames from a Sep-16 recording (autumn season)
- 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
Tree-only polygon annotations
Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
Non-tree pixels in new images set to ignore_index=255 so the model only adapts its tree decisions, leaving other classes untouched

Training Recipe

v1

Hyperparameter	Value
Optimizer	AdamW, weight_decay 0.01
LR	2e-5, cosine schedule
Epochs	30
Batch	2 × grad_accum 4 (effective 8)
Resolution	1024×576
Precision	bfloat16
Loss	weighted cross-entropy
Class weights	tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1
Hardware	RTX 5090 (32 GB), ~2.3 hours

v2 fine-tune (delta from v1)

Hyperparameter	Value
LR	5e-6 (10× lower for safe fine-tune)
Epochs	8 (best at epoch 3)
`ignore_index`	255 (for unlabeled pixels in new data)
Everything else	Same as v1
Hardware	RTX 5090, ~13 minutes

Limitations

This model was trained on a single Korean apple orchard (spring 2024) with a single robot platform, plus a small fine-tune on a second autumn capture. Expect degradation on:

⚠️ Different orchards (different tree species, layouts, training systems)
⚠️ Different cameras (different FOV, color profiles, sensors)
💀 Different seasons not in training (winter dormant trees)
💀 Different lighting (rain, dawn/dusk, night)
💀 Aerial / drone perspectives

For deployment in a new context, plan to fine-tune on 100-300 in-domain images.

Files in This Repo

File	Purpose
`Segformer85Mv1.pt`	Original v1 weights (339 MB)
`Segformer85Mv2.pt`	v1 + Orchard Navigation fine-tune (339 MB) ⭐
`predict.py`	Standalone inference script (defaults to v2)
`README.md`	This file
`samples/*.jpg`	v1 prediction examples (in-domain)
`samples_v6_vs_v7/*.jpg`	v1 vs v2 side-by-side on new orchard (showcases v2 improvement)
`train_v6_5090.py`	v1 training script
`finetune_v7.py`	v2 fine-tune script
`history_v6.json`	v1 per-epoch training history
`history_v7.json`	v2 fine-tune history
`v6_OOD_full_res.mp4`	1-minute OOD inference video at native resolution

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for WEN0256/Segformer85Mv1

Base model

nvidia/segformer-b5-finetuned-ade-640-640

Finetuned

(23)

this model