RoI Align cvat2 project 8 + 10 manifest_v3 (10120 train / 2681 val / 4560 test) 2026-05-02 → 2026-05-03
| Rank | 模型 | Backbone | 解析度 | Aug | test_AP | F1 | P | R | FP |
|---|---|---|---|---|---|---|---|---|---|
| 1 | DINOv3-S + RE on upper bbox ⭐⭐⭐ 新冠軍 | vit_small_patch16_dinov3 | 1280×720 | photo+RE | 0.9167 | 0.8449 | 0.854 | 0.836 | 241 |
| 2 | DINOv3-S (no RE) | vit_small_patch16_dinov3 | 1280×720 | photo | 0.9090 | 0.8568 | 0.834 | 0.881 | 296 |
| 3 | SigLIP-B @512 + photometric | vit_base_patch16_siglip_512 | 512×512 | — | 0.9026 | 0.8329 | 0.818 | 0.849 | 320 |
| 4 | DINOv3-B | vit_base_patch16_dinov3 | 1280×720 | photo | 0.8996 | 0.8338 | 0.833 | 0.835 | 283 |
| 5 | mobilenetv3 HD widex (X=1.5) | mobilenetv3_large_100 | 1280×720 | — | 0.8984 | 0.8324 | 0.790 | 0.879 | 395 |
| 6 | mobilenetv3 HD (v2 主版) | mobilenetv3_large_100 | 1280×720 | — | 0.8898 | 0.8426 | 0.831 | 0.854 | 293 |
| 7 | mobilenetv3 HD noexp (X=0) | mobilenetv3_large_100 | 1280×720 | — | 0.8776 | 0.8305 | 0.790 | 0.876 | 395 |
| 8 | mobilenetv3 HD + photometric | mobilenetv3_large_100 | 1280×720 | photo | 0.8651 | 0.8241 | 0.805 | 0.845 | 347 |
| 9 | SigLIP-B @512 (no aug) | vit_base_patch16_siglip_512 | 512×512 | — | 0.8534 | 0.7737 | 0.723 | 0.832 | 538 |
| 10 | mobilenetv3 640 | mobilenetv3_large_100 | 640² | — | 0.8450 | 0.8162 | 0.750 | 0.895 | 504 |
| 11 | SigLIP-L @384 | vit_large_patch16_siglip_384 | 384×384 | — | 0.8319 | 0.7338 | 0.711 | 0.758 | 520 |
| 12 | CLIP-B @384 (no aug) | vit_base_patch16_clip_384 | 384×384 | — | 0.8144 | 0.7329 | 0.663 | 0.819 | 704 |
| 13 | SigLIP-B HD1280 (interpolated) | vit_base_patch16_siglip_512 | 1280×720 | — | 0.8006 | 0.7301 | 0.634 | 0.861 | 842 |
| 14 | CLIP-B @384 + photometric | vit_base_patch16_clip_384 | 384×384 | — | 0.7868 | 0.7236 | 0.686 | 0.766 | 594 |
| Method | AP | F1 |
|---|---|---|
| SigLIP-2 base zero-shot crop | 0.7735 | 0.7153 |
| Qwen2.5-VL-3B zero-shot crop | 0.7370 | 0.7310 |
| SigLIP-2 so400m zero-shot crop | 0.7100 | 0.7144 |
| CLIP zero-shot crop | 0.6730 | 0.6610 |
| SigLIP-2 base zero-shot mark | 0.5910 | 0.6330 |
| CLIP zero-shot mark | 0.5320 | 0.5860 |
| 變體 | AP | F1 | P | R | FP | delta vs base |
|---|---|---|---|---|---|---|
| DINOv3-S 基線(v10) | 0.9090 | 0.857 | 0.834 | 0.881 | 296 | — |
| DINOv3-S + RE upper(v11) | 0.9167 | 0.845 | 0.854 | 0.836 | 241 | +0.77pp AP / +2.0pp P / -55 FP(-18.6%) / R -4.5pp |
RE 設計:只在 person bbox 上半 60% 範圍(頭、胸、手)做 random erasing(prob 0.4,area 5-20%), 保留下方腰扣鉤環區 + 外擴範圍的繩子拖地段 + 上方鉤點。 forces 模型用「繩子鉤環、anchor、lifeline 的視覺證據」而非「身穿 PPE 制服 = correct」shortcut。
# R2 公開
https://pub-478929a98a5c440cb22c2241c0bde314.r2.dev/safety_rope_v20260503_p10_dinov3_small_re/best.pt
# ckpt schema:跟 v2 mobilenetv3 同 RoIAlign + MLP 2-cls,只是 backbone 換成 DINOv3 ViT-S/16
# 推論:Person YOLO bbox → expand 1.0/0.2/1.5 → image @1280×720 過 backbone → RoIAlign(spatial_scale=1/16) → MLP
# 86 MB ckpt(ViT-Small 22M params + RoIAlign head + 1.0/0.2/1.5 expand 設定)
ppe-demo.intemotech.com 安全繩 dropdown 將整合: