👥 Pedestrian Age + Gender — v20260507 訓練中期報告

multi-task: gender binary + age 4-class image cls (no bbox) PA-100K + MSP60K mix (160K crops) ⏳ 8hr 研究進行中 訓練日期:2026-05-07 | 5090-2 dual-GPU agent | 已完成 16 ablations,SWA + cross-dataset 還沒做

🎯 冠軍指標(E convnext_tiny)

gender accuracy
0.857
age 4-class accuracy
0.924
age macro F1
0.683
adult recall
0.965
child recall
0.727
elder recall ⚠
0.345
⚠ Elder 弱類仍是關卡:所有 16 個變體 elder recall 都在 0.10-0.51 之間,**focal loss + class weight 是唯一突破方法**(D variant 0.51)但傷整體 a_acc。根因:訓練集 elder 只 1.4-3.1%(PA-100K 0.014, MSP60K 0.031),需更多 elder 樣本(自標 cvat2 工地 elder 帆船)或合成。
💡 ConvNeXt-Tiny 顯著優於 MobileNetV3-L:a_acc 0.92 vs 0.88(+4pp),參數 28M vs 4.2M,推論成本 ~3×。對 prod 部署適合:5090 GPU 推論不是 bottleneck,cascade 一次跑值得換更高 acc。

📦 模型下載(中期冠軍)

⭐ E convnext_tiny (a_acc 0.924)
ConvNeXt-Tiny 28M, 384×192, 12 epochs
g_acc 0.857 / a_acc 0.924 / a_f1 0.683
訓練 PA-100K 80K + MSP60K 30K (mix train)
106 MB
⬇ best.pt
📋 載入 + 推論範例(Python,點開)
import torch, torch.nn as nn, timm
from PIL import Image
import torchvision.transforms as T

# Model 結構(從 train script 抽出)
class MultiHead(nn.Module):
    def __init__(self, backbone_name, drop_rate=0.3, num_age=4):
        super().__init__()
        self.backbone = timm.create_model(
            backbone_name, pretrained=False, num_classes=0,
            global_pool="avg", drop_rate=drop_rate)
        # probe feat_dim (mnv3/effb0 跟 num_features 不一致)
        with torch.no_grad():
            feat_dim = self.backbone(torch.zeros(1, 3, 64, 64)).shape[-1]
        self.feat_dim = feat_dim
        self.gender_head = nn.Linear(feat_dim, 1)
        self.age_head = nn.Linear(feat_dim, num_age)
    def forward(self, x):
        f = self.backbone(x)
        return self.gender_head(f).squeeze(-1), self.age_head(f)

# Load
ckpt = torch.load("age_gender_v20260507E_convnext_tiny.pt", weights_only=False)
# ckpt['args']['backbone'] = "convnext_tiny.fb_in22k_ft_in1k"
# ckpt['args']['img_h']=384, img_w=192
model = MultiHead(ckpt["args"]["backbone"]).eval()
model.load_state_dict(ckpt["model_state"])

# Inference (input: person crop)
mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225]
tf = T.Compose([T.Resize((384, 192)), T.ToTensor(), T.Normalize(mean, std)])

img = Image.open("person_crop.jpg").convert("RGB")
x = tf(img).unsqueeze(0)
with torch.no_grad():
    g_logit, a_logit = model(x)

gender_prob = torch.sigmoid(g_logit).item()      # > 0.5 = female
gender = "female" if gender_prob > 0.5 else "male"
age_idx = a_logit.argmax(dim=-1).item()
age_group = ["child", "young", "adult", "elder"][age_idx]

print(f"gender: {gender} ({gender_prob:.2f})")
print(f"age: {age_group}")
# 注意:young class 在 PA-100K/MSP60K 都沒有 supervised data,
# 所以 prediction 不會出 young;只會出 child/adult/elder

🧪 完整 16 ablation 對比

變體backboneaug/lossimgep g_acca_acca_f1childadultelder備註
A baseline mnv3lmobilenetv3_lcamaug384×192120.8380.8860.6210.540.950.38
B strongaug mnv3lmobilenetv3_lstrong384×192120.8350.8820.6140.600.940.36
C balsamp mnv3lmobilenetv3_lcamaug+wsamp384×192120.8300.9130.6170.540.980.17weighted sampler
D focal mnv3lmobilenetv3_lcamaug+focal384×192110.8530.8520.6080.780.870.51focal γ=2 提升 elder
E convnext_tiny ⭐convnext_tinycamaug384×192120.8570.9240.6830.730.960.34整體冠軍 (lucky)
E2 convnext_tiny seed2026convnext_tinycamaug384×192110.8600.9230.6790.740.960.32seed 對照
E3 convnext_tiny seed7convnext_tinycamaug384×192120.8570.9230.6890.730.960.38best a_f1
E4 convnext_tiny seed2024convnext_tinycamaug384×192120.8550.9190.6790.710.960.36
F efficientnet_b0efficientnet_b0camaug384×192120.8500.9050.6590.670.950.37
G convnext_smallconvnext_smallcamaug384×192120.8600.9230.6800.740.960.32比 tiny 沒贏
H img224 mnv3lmobilenetv3_lcamaug224×224120.8230.8740.5930.550.940.32解析度降反而差
I ema mnv3lmobilenetv3_lcamaug+ema384×192120.8350.8850.6170.570.940.36
J pa100k onlymobilenetv3_lcamaug384×19270.8410.9240.6080.650.980.11PA-100K 單 dataset
K msp60k onlymobilenetv3_lcamaug384×192120.7980.8880.6570.820.910.36MSP60K 單 dataset,elder 高但 g_acc 低
M vanilla mnv3lmobilenetv3_lno-aug384×19230.8330.9180.6340.580.990.16no aug 對照
N sqrtinv mnv3lmobilenetv3_lcamaug+sqrtinv384×19260.8370.9060.6470.600.960.33sqrt-inv class weight

🆚 Cross-source 對比(PA-100K only vs MSP60K only vs mix)

訓練 sourceg_acca_accchildadultelder
PA-100K only (J)0.8410.9240.650.980.11
MSP60K only (K)0.7980.8880.820.910.36
PA+MSP mix (A baseline)0.8380.8860.540.950.38
PA+MSP mix (E convnext)0.8570.9240.730.960.34
觀察

🧠 核心 insights(中期)

1. ConvNeXt-Tiny > MobileNetV3-L:a_acc 0.92 vs 0.88,gender +1.9pp。28M 參數 vs 4.2M,5090 推論差 ~3× 但 prod 可接受。複現 hatch v505 經驗(convnext_tiny 是 sweet spot)。
2. cam aug 比 strong aug 好:rotation±5°+blur 已足夠,strong aug (rotation 8°+Perspective+heavy ColorJitter) 反而傷 a_acc -0.4pp。對齊 hatch v505 的「cam=最佳,strong 過度」結論。但 mnv3l 對 strong aug 弱類提升明顯(child 0.54→0.60)。
3. focal loss 救 elder:D variant elder 0.34→0.51(+50%),但整體 a_acc 0.886→0.852。trade-off 明顯,可作 production-time 對 elder 多 head 的 fallback。
4. img 384×192 > 224×224:H variant a_acc 0.874 vs A 的 0.886(-1.2pp),驗證 person crop 用 portrait shape 較合適(vs square)。對齊 ppe21 的 384×192 設計。
5. Multi-seed 穩定:4 個 convnext seed (E/E2/E3/E4) a_acc 0.919-0.924,σ ≤ 0.003,model variance 小。可走 SWA 4-seed 拿 free lunch(hatch v505 證實 +0.2pp)— 還沒跑完。

⚠ 已知限制

🚀 下一步(依 ROI)

  1. SWA 4-seed(已 train 4 seed E/E2/E3/E4) — agent 還沒做,預期 +0.2pp 免費
  2. cvat2 自標 1-2K 工地真實樣本(用既有 schema age_group + age_fine 12-class + age_estimate continuous 三層)
  3. fine-tune 冠軍 ckpt 用自標 calibration set 補 elder 弱類
  4. cross-dataset eval:在 PA-100K test / MSP60K test 各自跑看 domain gap(agent 還沒做)
  5. 整合進 ppe21 multi-task:把 age + gender head 加到既有 21-attr partial-label BCE pipeline,共用 backbone
  6. elder hard-neg mining:對 elder 弱 model 推論輸出收集成 hard examples 再 oversample

Generated 2026-05-07 | 訓練中(agent 跑 8hr 自主研究)| rai-vision-training | kaggle-reports.pages.dev