Segformer85M β€” Apple Orchard Semantic Segmentation

Segformer-B5 (85M parameters) fine-tuned for 8-class semantic segmentation of outdoor apple orchard scenes captured from a robotic platform.

This repo contains two checkpoints:

File When to use
Segformer85Mv1.pt Original v1, trained only on the spring oak_0415 dataset. Best baseline.
Segformer85Mv2.pt ⭐ v1 + fine-tuned on a second dataset (different camera, autumn season). Use this for general deployment β€” same accuracy on the original orchard, dramatically better generalization to new cameras / new seasons.

Quick Use

from huggingface_hub import hf_hub_download
from transformers import SegformerForSemanticSegmentation
import torch, cv2, numpy as np
import torch.nn.functional as F

# 1. Download weights β€” pick v1 OR v2
ckpt_path = hf_hub_download(repo_id="WEN0256/Segformer85Mv1", filename="Segformer85Mv2.pt")
#                                                                       ^^^^^^^^^^^^^^^^^
#                                                                       use v2 by default

# 2. Init architecture from base + load fine-tuned weights
NAMES = ["tree","ground","person","sky","road","mountain","building","background"]
model = SegformerForSemanticSegmentation.from_pretrained(
    "nvidia/segformer-b5-finetuned-ade-640-640",
    num_labels=8,
    id2label={i:n for i,n in enumerate(NAMES)},
    label2id={n:i for i,n in enumerate(NAMES)},
    ignore_mismatched_sizes=True,
).cuda().eval()
model.load_state_dict(torch.load(ckpt_path, map_location="cuda")["model"])

# 3. Inference
img = cv2.imread("your_image.jpg")
H, W = img.shape[:2]
H32, W32 = (H//32)*32, (W//32)*32
rgb = cv2.cvtColor(cv2.resize(img, (W32, H32)), cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406]); std = np.array([0.229, 0.224, 0.225])
x = torch.from_numpy(((rgb - mean) / std).transpose(2,0,1)).unsqueeze(0).float().cuda()

with torch.no_grad():
    logits = model(pixel_values=x).logits
    logits = F.interpolate(logits, size=(H, W), mode="bilinear", align_corners=False)
    pred = logits.argmax(1)[0].cpu().numpy()  # H x W, values 0..7

A ready-to-use predict.py is included in this repo.

Classes (id β†’ name)

ID Class Notes
0 tree Apple trees (priority class for downstream tasks)
1 ground Grass / dirt / orchard floor
2 person Workers in scene
3 sky
4 road Path between rows
5 mountain Distant terrain
6 building Sheds, equipment shelters
7 background Unknown / unlabeled regions (model output rare)

Architecture & Preprocessing

Base model nvidia/segformer-b5-finetuned-ade-640-640
Parameters ~85M
Decoder head Reinitialized for 8 classes
Input format RGB, normalized with ImageNet mean/std
mean [0.485, 0.456, 0.406]
std [0.229, 0.224, 0.225]
Input resolution Any HΓ—W where both are multiples of 32
Trained at 1024Γ—576 (native 16:9)

Performance

v1 (Segformer85Mv1.pt) β€” original training only

Validated on a temporally-disjoint hold-out from the same recording (frames 4501+, no leakage):

Metric Value
Tree IoU 0.742
mIoU (7 real classes) 0.714
Pixel accuracy 0.834

v2 (Segformer85Mv2.pt) β€” v1 + Orchard Navigation fine-tune ⭐

Same v1 hold-out β†’ no regression on old domain:

Metric v1 v2
Tree IoU (orig orchard, no leak) 0.742 0.742 βœ…
mIoU (orig orchard) 0.714 0.712

NEW orchard hold-out (different camera, autumn season β€” Aug+Sep capture):

Metric v1 v2
Tree recall on new orchard ~0.55 (estimated) 0.999 πŸš€

Visual qualitative: v1 sometimes misclassifies autumn foliage as person (red); v2 cleanly segments it as tree. See samples/ for side-by-side examples.

v1 per-class IoU (8-class, no leak)

Class IoU Precision Recall
tree 0.742 0.79 0.93
ground 0.851 0.91 0.93
person 0.719 0.82 0.85
sky 0.769 0.83 0.91
road 0.804 0.86 0.92
mountain 0.437 0.62 0.66
building 0.711 0.84 0.83

Training Data

v1 base

  • ~5300 frames from a single oak_0415_oneRadar_1 recording (spring, single camera)
  • Initial annotations from 3 separate Roboflow projects (SAM-assisted polygons), merged + class-aligned (vinesβ†’tree, moutainβ†’mountain typo fixed)
  • Pseudo-labels generated by an earlier model to fill SAM annotation gaps
  • Temporal split: frames <=4500 train (5177 samples), frames >4500 validation (155 samples) β€” no neighbor leakage

v2 fine-tune (NEW)

  • +311 images from "Orchard Navigation" dataset:
    • 178 frames from a Sep-16 recording (autumn season)
    • 134 frames from a Windows webcam capture (Aug 23, different camera/sensor)
  • Tree-only polygon annotations
  • Mixed with 500 sampled v1 images (full 8-class masks) to prevent forgetting
  • Non-tree pixels in new images set to ignore_index=255 so the model only adapts its tree decisions, leaving other classes untouched

Training Recipe

v1

Hyperparameter Value
Optimizer AdamW, weight_decay 0.01
LR 2e-5, cosine schedule
Epochs 30
Batch 2 Γ— grad_accum 4 (effective 8)
Resolution 1024Γ—576
Precision bfloat16
Loss weighted cross-entropy
Class weights tree 1.5, ground 0.5, person 1.5, sky 1.0, road 1.0, mountain 1.0, building 1.0, background 0.1
Hardware RTX 5090 (32 GB), ~2.3 hours

v2 fine-tune (delta from v1)

Hyperparameter Value
LR 5e-6 (10Γ— lower for safe fine-tune)
Epochs 8 (best at epoch 3)
ignore_index 255 (for unlabeled pixels in new data)
Everything else Same as v1
Hardware RTX 5090, ~13 minutes

Limitations

This model was trained on a single Korean apple orchard (spring 2024) with a single robot platform, plus a small fine-tune on a second autumn capture. Expect degradation on:

  • ⚠️ Different orchards (different tree species, layouts, training systems)
  • ⚠️ Different cameras (different FOV, color profiles, sensors)
  • πŸ’€ Different seasons not in training (winter dormant trees)
  • πŸ’€ Different lighting (rain, dawn/dusk, night)
  • πŸ’€ Aerial / drone perspectives

For deployment in a new context, plan to fine-tune on 100-300 in-domain images.

Files in This Repo

File Purpose
Segformer85Mv1.pt Original v1 weights (339 MB)
Segformer85Mv2.pt v1 + Orchard Navigation fine-tune (339 MB) ⭐
predict.py Standalone inference script (defaults to v2)
README.md This file
samples/*.jpg v1 prediction examples (in-domain)
samples_v6_vs_v7/*.jpg v1 vs v2 side-by-side on new orchard (showcases v2 improvement)
train_v6_5090.py v1 training script
finetune_v7.py v2 fine-tune script
history_v6.json v1 per-epoch training history
history_v7.json v2 fine-tune history
v6_OOD_full_res.mp4 1-minute OOD inference video at native resolution

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for WEN0256/Segformer85Mv1

Finetuned
(23)
this model