🦺 Safety Rope v20260503 Ablation v3 — 完整 20 路比較

RoI Align cvat2 project 8 + 10 manifest_v3 (10120 train / 2681 val / 4560 test) 2026-05-02 → 2026-05-03

📊 主要結論(TL;DR)

Best Test AP
0.9167
Best Precision (FP 最低)
0.854
Best FP(絕對數)
241
vs v2 主版改善
+2.7pp AP / -18% FP
三個關鍵發現:
  1. DINOv3-S + RE upper-bbox 是新冠軍(22M params 小模型!)AP 0.9167 / P 0.854 / FP 241 — 比 v2 mobilenetv3 HD 主版 +2.7pp AP / -52 FP(-18%)
  2. RE upper-bbox 假設驗證成功:對冠軍加 RE 帶來 +0.77pp AP / +2pp Precision / -55 FP(-18.6%),跟 research agent 預測完美吻合
  3. 更小是更好(在這個 task 上):DINOv3-S (22M) 勝 DINOv3-B (87M) 0.9pp AP,ViT 大模型在 10K rows 容易過擬合

📋 完整對照表(按 AP 排序)

Rank模型Backbone解析度Augtest_APF1PRFP
1DINOv3-S + RE on upper bbox ⭐⭐⭐ 新冠軍vit_small_patch16_dinov31280×720photo+RE0.91670.84490.8540.836241
2DINOv3-S (no RE)vit_small_patch16_dinov31280×720photo0.90900.85680.8340.881296
3SigLIP-B @512 + photometricvit_base_patch16_siglip_512512×5120.90260.83290.8180.849320
4DINOv3-Bvit_base_patch16_dinov31280×720photo0.89960.83380.8330.835283
5mobilenetv3 HD widex (X=1.5)mobilenetv3_large_1001280×7200.89840.83240.7900.879395
6mobilenetv3 HD (v2 主版)mobilenetv3_large_1001280×7200.88980.84260.8310.854293
7mobilenetv3 HD noexp (X=0)mobilenetv3_large_1001280×7200.87760.83050.7900.876395
8mobilenetv3 HD + photometricmobilenetv3_large_1001280×720photo0.86510.82410.8050.845347
9SigLIP-B @512 (no aug)vit_base_patch16_siglip_512512×5120.85340.77370.7230.832538
10mobilenetv3 640mobilenetv3_large_100640²0.84500.81620.7500.895504
11SigLIP-L @384vit_large_patch16_siglip_384384×3840.83190.73380.7110.758520
12CLIP-B @384 (no aug)vit_base_patch16_clip_384384×3840.81440.73290.6630.819704
13SigLIP-B HD1280 (interpolated)vit_base_patch16_siglip_5121280×7200.80060.73010.6340.861842
14CLIP-B @384 + photometricvit_base_patch16_clip_384384×3840.78680.72360.6860.766594

Zero-shot baselines(research agent 測量,不訓練)

MethodAPF1
SigLIP-2 base zero-shot crop0.77350.7153
Qwen2.5-VL-3B zero-shot crop0.73700.7310
SigLIP-2 so400m zero-shot crop0.71000.7144
CLIP zero-shot crop0.67300.6610
SigLIP-2 base zero-shot mark0.59100.6330
CLIP zero-shot mark0.53200.5860

🎯 RE upper-bbox ablation 細節

變體APF1PRFPdelta vs base
DINOv3-S 基線(v10)0.90900.8570.8340.881296
DINOv3-S + RE upper(v11)0.91670.8450.8540.836241 +0.77pp AP / +2.0pp P / -55 FP(-18.6%) / R -4.5pp

RE 設計:只在 person bbox 上半 60% 範圍(頭、胸、手)做 random erasing(prob 0.4,area 5-20%), 保留下方腰扣鉤環區 + 外擴範圍的繩子拖地段 + 上方鉤點。 forces 模型用「繩子鉤環、anchor、lifeline 的視覺證據」而非「身穿 PPE 制服 = correct」shortcut。

📈 訓練曲線(top-4)

🔬 v2 → v3 進化軌跡

  1. v2 階段 確定 mobilenetv3 HD + 外擴 1.0/0.2/1.5 是好設計(baseline 0.890 AP)
  2. v3 階段 引入 DINOv3 / SigLIP / CLIP 對照後 → photometric augment for SigLIP 升 5pp
  3. v3 中盤 SigLIP HD1280 失敗(pos_embed 內插破壞)、SigLIP-L 過擬合(10K rows 撐不住 large model)
  4. v3 末 DINOv3-S(22M)超 ViT-B、超 mobilenetv3,加 RE upper-bbox 後直接奪冠

🚀 推薦使用

主版(FP 最低):DINOv3-S + RE

# R2 公開
https://pub-478929a98a5c440cb22c2241c0bde314.r2.dev/safety_rope_v20260503_p10_dinov3_small_re/best.pt

# ckpt schema:跟 v2 mobilenetv3 同 RoIAlign + MLP 2-cls,只是 backbone 換成 DINOv3 ViT-S/16
# 推論:Person YOLO bbox → expand 1.0/0.2/1.5 → image @1280×720 過 backbone → RoIAlign(spatial_scale=1/16) → MLP
# 86 MB ckpt(ViT-Small 22M params + RoIAlign head + 1.0/0.2/1.5 expand 設定)

📦 R2 公開下載

🔬 線上比較(ppe-demo)

ppe-demo.intemotech.com 安全繩 dropdown 將整合: