🚪 Hatch — 8hr 自主研究報告 (v20260505)

image classifier 2 binary heads SWA + TTA 22 ablations 訓練日期:2026-05-05 | 5090-2 dual-GPU autonomous agent | 來源:cvat2 project 7 (raicvat #12 移植) — 4536 train / 1170 val / 1765 test

🎯 結果(vs baseline v20260427)

test mAP
0.9962
+0.93pp ⬆
has_close P
0.991
0.908 → 0.991
has_close FP
12
~125 → 12 (-90%)
close FP(極致 thr=0.80)
4
-97% vs baseline
推論 cost
2× SWA+TTA
~50ms / frame on 5090

📦 模型下載(Cloudflare R2)

★ SWA-4 + TTA(生產冠軍)
close P=0.991, FP=12, mAP 0.9962
2× cost (TTA hflip) ~50ms/frame
⬇ best_tta.pt
★ SWA-4 (no TTA)
close P=0.988, FP=15, mAP 0.9962
1× cost ~25ms/frame(latency 緊張用)
⬇ best.pt
📋 載入範例(Python,點開)
import torch, timm, torch.nn as nn

class GenericClassifier(nn.Module):
    def __init__(self, backbone, n_attr, feat_dim):
        super().__init__()
        self.backbone = timm.create_model(backbone, pretrained=False, num_classes=0, global_pool="avg")
        self.dropout = nn.Dropout(0.3)
        self.cls = nn.Linear(feat_dim, n_attr)
    def forward(self, x): return self.cls(self.dropout(self.backbone(x)))

ckpt = torch.load("hatch_swa_v20260505.pt", weights_only=False)
# backbone_name = "convnext_tiny.fb_in1k", attrs = ["has_open", "has_close"]
# feat_dim = 768, img_size = 384
# thresholds = {"has_open": 0.50, "has_close": 0.76}(best.pt) 或 0.52/0.76(best_tta.pt)
model = GenericClassifier(ckpt["backbone_name"], len(ckpt["attrs"]), ckpt["feat_dim"]).eval()
model.load_state_dict(ckpt["model_state"])

# 推論:384×384 → ImageNet normalize → sigmoid 2 outputs
# TTA (best_tta.pt 適用): 加 hflip 一次平均

🧠 核心 insights(4 個)

1. cam aug 是最關鍵單變因
ColorJitter only → cam (rotation±5° + GaussianBlur σ0.5-1.5) 在 mobilenetv3 上把 close P 從 0.908→0.938 (+3pp),每個 backbone 都受益,無例外。複現 safety_rope v6 經驗。
2. RandomErasing 對門影像有害(!)
cam_erase vs cam: mAP 0.992→0.988,close FP 84→98。推測門縫/把手等 close 判斷的關鍵小特徵被 erase 遮掉。對門影像預設關掉 RandomErasing
3. SWA = free lunch
4 個獨立 seed 訓完,state_dict 平均後 mAP 0.9962 贏 single best m 0.9960,但推論成本只有 1×(vs 4-way ensemble 8×)。Loss landscape wide flat minimum 證據。
4. convnext_tiny + strong + wsamp 是 sweet spot
convnext_tiny 比 mobilenetv3_l +0.5pp mAP(baseline 飽和)。strong aug (rotation 8° + Perspective + Erasing) 反而比 cam 強(細微強化讓 close 樣本更 robust)。wsamp (oversample close=0) 解 87:13 不平衡。但這組合在 mobilenetv3 上反而崩潰(FP 200),須搭 convnext。

📊 完整實驗紀錄


Generated 2026-05-05 | rai-vision-training | kaggle-reports.pages.dev | 8hr autonomous research on 5090-2