🚪 Hatch — 8hr 自主研究報告 (v20260505)

image classifier 2 binary heads SWA + TTA 22 ablations 訓練日期：2026-05-05 ｜ 5090-2 dual-GPU autonomous agent ｜來源：cvat2 project 7 (raicvat #12 移植) — 4536 train / 1170 val / 1765 test

🎯 結果（vs baseline v20260427）

test mAP

0.9962

+0.93pp ⬆

has_close P

0.991

0.908 → 0.991

has_close FP

~125 → 12 (-90%)

close FP（極致 thr=0.80）

-97% vs baseline

推論 cost

2× SWA+TTA

~50ms / frame on 5090

📦 模型下載（Cloudflare R2）

★ SWA-4 + TTA（生產冠軍）

close P=0.991, FP=12, mAP 0.9962
2× cost (TTA hflip) ~50ms/frame

⬇ best_tta.pt

★ SWA-4 (no TTA)

close P=0.988, FP=15, mAP 0.9962
1× cost ~25ms/frame（latency 緊張用）

⬇ best.pt

📋 載入範例（Python，點開）

import torch, timm, torch.nn as nn

class GenericClassifier(nn.Module):
    def __init__(self, backbone, n_attr, feat_dim):
        super().__init__()
        self.backbone = timm.create_model(backbone, pretrained=False, num_classes=0, global_pool="avg")
        self.dropout = nn.Dropout(0.3)
        self.cls = nn.Linear(feat_dim, n_attr)
    def forward(self, x): return self.cls(self.dropout(self.backbone(x)))

ckpt = torch.load("hatch_swa_v20260505.pt", weights_only=False)
# backbone_name = "convnext_tiny.fb_in1k", attrs = ["has_open", "has_close"]
# feat_dim = 768, img_size = 384
# thresholds = {"has_open": 0.50, "has_close": 0.76}（best.pt） 或 0.52/0.76（best_tta.pt）
model = GenericClassifier(ckpt["backbone_name"], len(ckpt["attrs"]), ckpt["feat_dim"]).eval()
model.load_state_dict(ckpt["model_state"])

# 推論：384×384 → ImageNet normalize → sigmoid 2 outputs
# TTA (best_tta.pt 適用): 加 hflip 一次平均

🧠 核心 insights（4 個）

1. cam aug 是最關鍵單變因
ColorJitter only → cam (rotation±5° + GaussianBlur σ0.5-1.5) 在 mobilenetv3 上把 close P 從 0.908→0.938 (+3pp)，每個 backbone 都受益，無例外。複現 safety_rope v6 經驗。

2. RandomErasing 對門影像有害（!）
cam_erase vs cam: mAP 0.992→0.988，close FP 84→98。推測門縫/把手等 close 判斷的關鍵小特徵被 erase 遮掉。對門影像預設關掉 RandomErasing。

3. SWA = free lunch
4 個獨立 seed 訓完，state_dict 平均後 mAP 0.9962 贏 single best m 0.9960，但推論成本只有 1×（vs 4-way ensemble 8×）。Loss landscape wide flat minimum 證據。

4. convnext_tiny + strong + wsamp 是 sweet spot
convnext_tiny 比 mobilenetv3_l +0.5pp mAP（baseline 飽和）。strong aug (rotation 8° + Perspective + Erasing) 反而比 cam 強（細微強化讓 close 樣本更 robust）。wsamp (oversample close=0) 解 87:13 不平衡。但這組合在 mobilenetv3 上反而崩潰（FP 200），須搭 convnext。

📊 完整實驗紀錄

Generated 2026-05-05 ｜ rai-vision-training ｜ kaggle-reports.pages.dev ｜ 8hr autonomous research on 5090-2