🦺 Safety Rope v6 訓練報告

version: v20260503_p10_dinov3_small_re_v6_camaug · 訓練日 2026-05-03 · backbone vit_small_patch16_dinov3 @ 1280×720
arch: ViT-S patch16 + RoIAlign + MLP 2-cls,外擴 1.0 / 0.2 / 1.5(X / Y_top / Y_bot)

TL;DR

0.8884
test AP
vs v4: +2.3pp
0.8267
test F1
vs v4: +1.6pp
0.839
Precision
vs v4: +6.0pp
0.815
Recall
vs v4: -2.9pp
283
FP(誤報)
vs v4: -149 (-34%)
核心結論:相對於 v4 同 backbone 同資料,僅在訓練中加 rotation ±5° + Gaussian blur 兩組 augmentation,FP 從 432 降到 283(−35%),test_AP 同步 +2.3pp。表示「攝影機擺位微差導致的 FP」可被 viewpoint augmentation 緩解。Recall 微跌 −2.8pp(更保守)但場域降誤報目標達成。

四版對照表(test set)

版本datatest_APF1PRTPFPFNTNbest eptrain s
v1178 task baseline 0.91670.8449 0.8540.836 14142412782627 42401
v2+14 hard-neg (192) 0.87550.8236 0.7980.851 14733742572548 102736
v4226 task baseline 0.86510.8102 0.7800.843 15284322842531 82612
v6226 + rotation+blur ⭐ 0.88840.8267 0.8390.815 14762833362680 123300

v1 仍持有最高 AP(test set 規模 4560 相對小,且 178 task 純度高),但 FP 241 是在小 test 上算的;v6 在 226 task / 4775 row 大 test 上達 FP 283 已逼近 v1。

v4 vs v6 Confusion Matrix(同 226 task data)

v4 (no camaug)v6 (+rot+blur)Δ
TP15281476-52
FP432283-149
FN284336+52
TN25312680+149

Augmentation 配置(v6 vs v4)

Photometric jitter沿用 v4
brightness ±0.4 / contrast ±0.3 / saturation ±0.4
ROI bbox jitter沿用 v4
center ±20% / size 0.7-1.4× / 外擴 ratio 0.5-2.0×
Random erasing沿用 v4
person bbox 上半 60%,prob 0.4 / area 5-20%
Horizontal flip沿用 v4
prob 0.5(hat / 繩 / PPE 左右對稱)
Rotation ±5°v6 新增
PIL rotate, fill=灰,prob 0.5(攝影機歪斜)
Gaussian blurv6 新增
ImageFilter.GaussianBlur(σ=0.5-1.5),prob 0.2(攝影機對焦微差)

Training history

eptrain_lossval_APval_F1val_AP bar
10.3885 0.86810.7884 0.868
20.2949 0.83820.7603 0.838
30.1857 0.86130.7740 0.861
40.1516 0.88050.8271 0.881
50.1254 0.86000.7852 0.860
60.1086 0.88060.8432 0.881
70.0986 0.85730.8110 0.857
80.0805 0.86270.8299 0.863
90.0771 0.79510.7686 0.795
100.0641 0.88480.8356 0.885
110.0520 0.87110.8178 0.871
12 ⭐0.0548 0.89390.8427 0.894
130.0432 0.85780.8126 0.858
140.0372 0.84890.8221 0.849
150.0286 0.83300.7970 0.833
160.0222 0.85530.8298 0.855
170.0273 0.85040.8278 0.850
180.0206 0.86870.8285 0.869
190.0144 0.85500.8310 0.855
200.0131 0.85370.8126 0.854

best_epoch=12, val_AP=0.8939(patience=8 在 ep20 觸發 early stop)

推論速度(mac MPS 全 pipeline)

backboneparamsYOLO msViT+ROI mstotalFPS
v6 ViT-S DINOv322.5M47821297.7
v5 ViT-Tiny (對照)6.6M497.65617.8

v6 跟 v4 同 backbone,速度一致(差異 <2ms in noise)。ViT-Tiny 速度 2.3× 但 FP 707(v6 的 2.5×),不適合主版。

下一步候選

  1. YOLO 修正:e2e audit 顯示場域 FP 中 78% 來自 YOLO 假框、漏框 69%。改善 YOLO(提高 conf threshold 0.35→0.5 / imgsz 640→1280 / 加場域 hard-neg 重訓)效益遠超 ViT 端。
  2. v7 = v6 + Gaussian noise(σ=2-5, prob 0.2):sensor noise augmentation,低成本(5min code + 1h 訓練)
  3. Class-weighted CE:對 wrong class 加 1.3× 權重,再降 FP 30-80。
  4. Two-stage rescore:對 prob 0.3-0.7 的不確定 case 用 multi-scale ROI ensemble 二判。

📦 模型下載

檔案大小用途下載
safety_rope_v20260503_v6_camaug/best.pt86 MBfp32 完整 ckpt(訓練/評估)R2 link
safety_rope_v20260503_v6_camaug/best_fp16.pt43 MBfp16 inference ckpt(推論部署用,下載快一半)R2 link
safety_rope_v20260503_v6_camaug/summary.json5 KB訓練 metadata(hyperparams、test metrics、history)R2 link
person_yolo11n_v20260501/best.pt5.5 MBYOLO person detector(pipeline 第一階段必備)R2 link

🧪 推論工程師 quick-start

完整 pipeline:RTSP/影片 → YOLO 偵 person → 對每個 bbox 外擴 1.0/0.2/1.5 → ViT RoIAlign → 2-cls prob (correct vs wrong)。Apple MPS / CUDA 都支援。

環境需求

pip install torch torchvision timm ultralytics opencv-python pillow numpy

下載 ckpt

# 推論部署用 fp16(推薦,43MB)
curl -L -o best_fp16.pt \
  https://pub-478929a98a5c440cb22c2241c0bde314.r2.dev/safety_rope_v20260503_v6_camaug/best_fp16.pt

# YOLO person detector
curl -L -o person_yolo11n_v20260501.pt \
  https://pub-478929a98a5c440cb22c2241c0bde314.r2.dev/person_yolo11n_v20260501/best.pt

1) Model 結構(複製貼上即用)

import torch, torch.nn as nn, timm
import torchvision.ops as tvops

class SafetyRopeModel(nn.Module):
    """DINOv3 ViT-S/16 backbone + RoIAlign + MLP 2-cls。
    Forward 一次整張 1280×720 圖,N 個 person bbox 共用 backbone feature。"""
    def __init__(self, backbone_name="vit_small_patch16_dinov3",
                 img_w=1280, img_h=720, patch=16, n_special=5):
        super().__init__()
        self.backbone = timm.create_model(backbone_name, pretrained=False,
                                          num_classes=0, global_pool="")
        self.feat_ch = self.backbone.embed_dim   # 384
        self.grid_h, self.grid_w = img_h // patch, img_w // patch  # 45, 80
        self.n_special = n_special  # DINOv3 = 1 CLS + 4 REG
        self.roi_align = tvops.RoIAlign(output_size=(7, 7),
                                        spatial_scale=1.0/patch, sampling_ratio=2)
        self.head = nn.Sequential(
            nn.Conv2d(self.feat_ch, 256, 3, padding=1), nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1,1)), nn.Flatten(),
            nn.Dropout(0.3), nn.Linear(256, 2),
        )

    def forward(self, image, rois):
        # image: [1, 3, 720, 1280] normalized; rois: [N, 5] (batch_idx, x1,y1,x2,y2 in resized coords)
        feats = self.backbone.forward_features(image)
        # 去 CLS+REG token:DINOv3 是 [B, 5+H*W, D],留 patch tokens
        if feats.shape[1] == self.grid_h*self.grid_w + self.n_special:
            feats = feats[:, self.n_special:]
        elif feats.shape[1] == self.grid_h*self.grid_w + 1:
            feats = feats[:, 1:]   # CLS only fallback
        B, N, D = feats.shape
        feats = feats.transpose(1,2).reshape(B, D, self.grid_h, self.grid_w)
        return self.head(self.roi_align(feats, rois))

2) 載入 ckpt + 推論

import cv2, numpy as np, torch
from ultralytics import YOLO

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

# 1. 載 ckpt(fp16 也可,下面會自動 cast)
ck = torch.load("best_fp16.pt", map_location=DEVICE, weights_only=False)
IMG_W, IMG_H = ck["img_size"]                # 1280, 720
EXPAND_X, EXPAND_YT, EXPAND_YB = ck["expand_x"], ck["expand_y_top"], ck["expand_y_bot"]  # 1.0, 0.2, 1.5
THR = float(ck["thr"])                       # 0.432(v6 default,可調)
LABELS = ck.get("labels", ["wrong", "correct"])

model = SafetyRopeModel(ck["backbone_name"], IMG_W, IMG_H).to(DEVICE).eval()
# fp16 → fp32 for forward stability
model.load_state_dict({k: v.float() if v.dtype == torch.float16 else v
                       for k, v in ck["model_state"].items()})

yolo = YOLO("person_yolo11n_v20260501.pt")  # 從 R2 下載 person_yolo11n_v20260501/best.pt 改名即可

@torch.no_grad()
def infer(frame_bgr, conf=0.35):
    rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
    H, W = rgb.shape[:2]
    # 1) YOLO person 偵測
    res = yolo(rgb, verbose=False, imgsz=640, conf=conf)[0]
    if res.boxes is None or len(res.boxes) == 0:
        return []
    persons = res.boxes.xyxy.cpu().numpy()
    # 2) 整張 resize + normalize 給 ViT
    img_resized = cv2.resize(rgb, (IMG_W, IMG_H))
    arr = (img_resized.astype(np.float32)/255.0 - MEAN) / STD
    x = torch.from_numpy(arr.transpose(2,0,1)).unsqueeze(0).float().to(DEVICE)
    # 3) 對每個 bbox 外擴 + 縮到 1280×720 座標
    sx, sy = IMG_W/W, IMG_H/H
    rois = []
    for x1, y1, x2, y2 in persons:
        bw, bh = x2-x1, y2-y1
        ex1 = max(0, x1 - bw*EXPAND_X);  ey1 = max(0, y1 - bh*EXPAND_YT)
        ex2 = min(W, x2 + bw*EXPAND_X);  ey2 = min(H, y2 + bh*EXPAND_YB)
        rois.append([0.0, ex1*sx, ey1*sy, ex2*sx, ey2*sy])
    rois_t = torch.tensor(rois, dtype=torch.float32).to(DEVICE)
    # 4) ViT forward 一次 → N 個 prob
    logits = model(x, rois_t)
    probs = torch.softmax(logits, dim=-1)[:, 1].float().cpu().numpy()  # 第 1 類 = correct
    # 5) 組結果
    out = []
    for (x1,y1,x2,y2), p in zip(persons, probs):
        out.append({
            "bbox": [float(x1), float(y1), float(x2), float(y2)],
            "prob": float(p),
            "pred": LABELS[1] if p >= THR else LABELS[0],   # "correct" or "wrong"
        })
    return out

# === 用法 ===
frame = cv2.imread("sample.jpg")
results = infer(frame)
for r in results:
    print(f"  bbox={r['bbox']}  prob={r['prob']:.3f}  → {r['pred']}")

3) 場域部署建議


Raw artifacts: 5090-2:~/runs_new/safety_rope_v20260503_p10_dinov3_small_re_v6_camaug/
對照 audit (v4 e2e): safety_rope_v4_audit_e2e.html(顯示 YOLO 假框 338 / 漏框 3314 等場域真實狀況)
R2 bucket: rai-models (public read) · 帳號 rai.mobile.studio@gmail.com