5090-2 指揮 · VLM 推論在 gx10-4t (DGX Spark GB10, 不碰 production) · 全程同 test 集對照 · 目標:驗證「VLM 對 PPE 偵測結果二次確認以降 FP」可行性,先驗效果不做工具
e context-margin crop + prompt 英 yes/no;模糊區 acc 85.0%(vs model 65.0%)、整體 82.8%、0.73s/張從 cvat #12 (factory_ppe) 撈 person,用 factory_ppe_v20260610_nv(MNv3-L,27-attr,crop384×192) 推論取 hard_hat 信心分桶。共 270 樣本:
| 子集 | 桶(model conf) | n | gt 分布 |
|---|---|---|---|
| broad | 模糊 0.3–0.7 | 60 | — |
| broad | 高信心 ≥0.7 | 60 | — |
| broad | 低信心 ≤0.3 | 60 | — |
| forklift | 模糊 0.3–0.7 | 18 | — |
| forklift | 高信心 ≥0.7 | 18 | — |
| forklift | 低信心 ≤0.3 | 54 | — |
broad = Test split 全 source 平衡集(90 yes / 90 no),衡量 verifier 本質能力;forklift = 4 現役堆高機鏡頭場域代表集。bbox 用 data_id 拼 disk path + 對齊 cvat frame/person_idx(已驗證與原 manifest crop 一致)。模糊區只有 116 個(model 多數樣本信心極端),已全用。
ClutterVLMHandler| VLM | 最佳視覺 | 最佳prompt | 整體acc | 模糊acc | 高信acc | 低信acc | s/張 | 解析失敗 |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL 7B | c crop×2 | 中 few-shot | 78.9% | 75.0% | 81.7% | 80.0% | 0.45 | 0% |
| Qwen2.5-VL 3B | c crop×2 | 英 yes/no | 76.7% | 70.0% | 83.3% | 76.7% | 0.32 | 0% |
| InternVL3 8B | e context-margin crop | 英 yes/no | 82.8% | 85.0% | 83.3% | 80.0% | 0.73 | 0% |
| 視覺 | 模糊acc(平均) | 整體acc(平均) | s/張 |
|---|---|---|---|
| a 整張+黃框 | 64.3% | 72.4% | 2.18 |
| b crop | 64.0% | 73.3% | 1.15 |
| c crop×2 | 73.3% | 77.2% | 1.40 |
| d 整張+黃框+箭頭 | 61.7% | 72.0% | 2.18 |
| e context-margin crop | 58.7% | 73.3% | 1.26 |
| prompt | 模糊acc(平均) | 整體acc(平均) | s/張 |
|---|---|---|---|
| 中 yes/no | 63.3% | 73.6% | 0.65 |
| 中 CoT | 64.0% | 73.2% | 3.15 |
| 中 few-shot | 67.3% | 74.0% | 0.66 |
| 中 JSON | 64.0% | 74.1% | 3.03 |
| 英 yes/no | 63.3% | 73.4% | 0.68 |
| VLM | 視覺 | prompt | 整體 | 模糊 | 高 | 低 | s |
|---|---|---|---|---|---|---|---|
| internvl3-8b | a | en_yesno | 75.0% | 56.7% | 85.0% | 83.3% | 0.86 |
| internvl3-8b | a | zh_cot | 68.9% | 56.7% | 80.0% | 70.0% | 2.32 |
| internvl3-8b | a | zh_fewshot | 76.1% | 66.7% | 83.3% | 78.3% | 0.78 |
| internvl3-8b | a | zh_json | 73.9% | 65.0% | 81.7% | 75.0% | 2.53 |
| internvl3-8b | a | zh_yesno | 72.8% | 60.0% | 85.0% | 73.3% | 0.78 |
| internvl3-8b | b | en_yesno | 83.3% | 80.0% | 81.7% | 88.3% | 0.73 |
| internvl3-8b | b | zh_cot | 81.1% | 70.0% | 88.3% | 85.0% | 2.34 |
| internvl3-8b | b | zh_fewshot | 78.3% | 66.7% | 86.7% | 81.7% | 0.65 |
| internvl3-8b | b | zh_json | 79.4% | 70.0% | 90.0% | 78.3% | 2.49 |
| internvl3-8b | b | zh_yesno | 81.7% | 73.3% | 88.3% | 83.3% | 0.64 |
| internvl3-8b | c | en_yesno | 81.7% | 76.7% | 80.0% | 88.3% | 0.74 |
| internvl3-8b | c | zh_cot | 79.4% | 71.7% | 85.0% | 81.7% | 2.32 |
| internvl3-8b | c | zh_fewshot | 78.9% | 68.3% | 86.7% | 81.7% | 0.67 |
| internvl3-8b | c | zh_json | 78.3% | 66.7% | 90.0% | 78.3% | 2.52 |
| internvl3-8b | c | zh_yesno | 80.6% | 71.7% | 88.3% | 81.7% | 0.66 |
| internvl3-8b | d | en_yesno | 73.9% | 60.0% | 80.0% | 81.7% | 0.85 |
| internvl3-8b | d | zh_cot | 70.0% | 56.7% | 80.0% | 73.3% | 2.38 |
| internvl3-8b | d | zh_fewshot | 74.4% | 63.3% | 83.3% | 76.7% | 0.78 |
| internvl3-8b | d | zh_json | 70.6% | 56.7% | 81.7% | 73.3% | 2.55 |
| internvl3-8b | d | zh_yesno | 71.7% | 58.3% | 81.7% | 75.0% | 0.77 |
| internvl3-8b | e | en_yesno | 82.8% | 85.0% | 83.3% | 80.0% | 0.73 |
| internvl3-8b | e | zh_cot | 77.8% | 66.7% | 90.0% | 76.7% | 2.18 |
| internvl3-8b | e | zh_fewshot | 81.1% | 78.3% | 86.7% | 78.3% | 0.65 |
| internvl3-8b | e | zh_json | 79.4% | 70.0% | 91.7% | 76.7% | 2.32 |
| internvl3-8b | e | zh_yesno | 79.4% | 71.7% | 90.0% | 76.7% | 0.65 |
| qwen25vl-3b | a | en_yesno | 72.2% | 60.0% | 78.3% | 78.3% | 0.90 |
| qwen25vl-3b | a | zh_cot | 60.6% | 50.0% | 61.7% | 70.0% | 1.57 |
| qwen25vl-3b | a | zh_fewshot | 71.1% | 56.7% | 71.7% | 85.0% | 0.88 |
| qwen25vl-3b | a | zh_json | 73.3% | 56.7% | 78.3% | 85.0% | 2.07 |
| qwen25vl-3b | a | zh_yesno | 70.0% | 51.7% | 71.7% | 86.7% | 0.87 |
| qwen25vl-3b | b | en_yesno | 75.0% | 61.7% | 86.7% | 76.7% | 0.16 |
| qwen25vl-3b | b | zh_cot | 47.8% | 43.3% | 45.0% | 55.0% | 1.01 |
| qwen25vl-3b | b | zh_fewshot | 72.8% | 56.7% | 83.3% | 78.3% | 0.14 |
| qwen25vl-3b | b | zh_json | 73.9% | 61.7% | 88.3% | 71.7% | 1.34 |
| qwen25vl-3b | b | zh_yesno | 73.9% | 56.7% | 88.3% | 76.7% | 0.13 |
| qwen25vl-3b | c | en_yesno | 76.7% | 70.0% | 83.3% | 76.7% | 0.32 |
| qwen25vl-3b | c | zh_cot | 55.0% | 51.7% | 55.0% | 58.3% | 1.08 |
| qwen25vl-3b | c | zh_fewshot | 75.6% | 63.3% | 85.0% | 78.3% | 0.30 |
| qwen25vl-3b | c | zh_json | 73.9% | 60.0% | 90.0% | 71.7% | 1.53 |
| qwen25vl-3b | c | zh_yesno | 73.9% | 61.7% | 83.3% | 76.7% | 0.31 |
| qwen25vl-3b | d | en_yesno | 75.6% | 66.7% | 83.3% | 76.7% | 0.89 |
| qwen25vl-3b | d | zh_cot | 57.8% | 48.3% | 61.7% | 63.3% | 1.59 |
| qwen25vl-3b | d | zh_fewshot | 75.0% | 60.0% | 78.3% | 86.7% | 0.87 |
| qwen25vl-3b | d | zh_json | 72.8% | 60.0% | 78.3% | 80.0% | 2.03 |
| qwen25vl-3b | d | zh_yesno | 75.6% | 60.0% | 80.0% | 86.7% | 0.86 |
| qwen25vl-3b | e | en_yesno | 73.3% | 63.3% | 86.7% | 70.0% | 0.20 |
| qwen25vl-3b | e | zh_cot | 61.7% | 51.7% | 63.3% | 70.0% | 0.97 |
| qwen25vl-3b | e | zh_fewshot | 71.7% | 61.7% | 71.7% | 81.7% | 0.16 |
| qwen25vl-3b | e | zh_json | 73.9% | 55.0% | 90.0% | 76.7% | 1.34 |
| qwen25vl-3b | e | zh_yesno | 70.6% | 53.3% | 85.0% | 73.3% | 0.16 |
| qwen25vl-7b | a | en_yesno | 71.7% | 61.7% | 78.3% | 75.0% | 1.19 |
| qwen25vl-7b | a | zh_cot | 75.0% | 68.3% | 75.0% | 81.7% | 3.79 |
| qwen25vl-7b | a | zh_fewshot | 70.0% | 66.7% | 70.0% | 73.3% | 1.18 |
| qwen25vl-7b | a | zh_json | 72.8% | 60.0% | 73.3% | 85.0% | 3.57 |
| qwen25vl-7b | a | zh_yesno | 72.8% | 65.0% | 80.0% | 73.3% | 1.17 |
| qwen25vl-7b | b | en_yesno | 73.9% | 63.3% | 80.0% | 78.3% | 0.25 |
| qwen25vl-7b | b | zh_cot | 70.0% | 65.0% | 73.3% | 71.7% | 2.51 |
| qwen25vl-7b | b | zh_fewshot | 74.4% | 65.0% | 76.7% | 81.7% | 0.23 |
| qwen25vl-7b | b | zh_json | 75.0% | 65.0% | 80.0% | 80.0% | 2.53 |
| qwen25vl-7b | b | zh_yesno | 73.3% | 61.7% | 80.0% | 78.3% | 0.23 |
| qwen25vl-7b | c | en_yesno | 77.8% | 75.0% | 80.0% | 78.3% | 0.47 |
| qwen25vl-7b | c | zh_cot | 74.4% | 71.7% | 75.0% | 76.7% | 2.78 |
| qwen25vl-7b | c | zh_fewshot | 78.9% | 75.0% | 81.7% | 80.0% | 0.45 |
| qwen25vl-7b | c | zh_json | 76.7% | 73.3% | 76.7% | 80.0% | 2.85 |
| qwen25vl-7b | c | zh_yesno | 78.3% | 71.7% | 85.0% | 78.3% | 0.45 |
| qwen25vl-7b | d | en_yesno | 71.1% | 60.0% | 80.0% | 73.3% | 1.18 |
| qwen25vl-7b | d | zh_cot | 75.0% | 63.3% | 76.7% | 85.0% | 3.86 |
| qwen25vl-7b | d | zh_fewshot | 70.0% | 63.3% | 73.3% | 73.3% | 1.18 |
| qwen25vl-7b | d | zh_json | 73.3% | 61.7% | 76.7% | 81.7% | 3.55 |
| qwen25vl-7b | d | zh_yesno | 70.6% | 60.0% | 81.7% | 70.0% | 1.15 |
| qwen25vl-7b | e | en_yesno | 72.8% | 56.7% | 85.0% | 76.7% | 0.29 |
| qwen25vl-7b | e | zh_cot | 71.7% | 51.7% | 80.0% | 83.3% | 2.80 |
| qwen25vl-7b | e | zh_fewshot | 76.7% | 66.7% | 78.3% | 85.0% | 0.27 |
| qwen25vl-7b | e | zh_json | 72.8% | 60.0% | 83.3% | 75.0% | 2.66 |
| qwen25vl-7b | e | zh_yesno | 72.8% | 58.3% | 85.0% | 75.0% | 0.27 |
註:high/low 桶為了測 verifier 刻意平衡取樣(含大量罕見的 model 錯判 case),故桶內「model acc」非場域真實比例,請看上面 FP/FN 抓捕率而非桶 acc。
Claude 人工逐張看 60 個模糊樣本(縮圖 crop×2),分四類,對比 model vs VLM(best) 各類 acc。揭露「模糊區」的組成 + VLM 在每類的增益:
| 類別 | 佔比 | model acc | VLM acc | 解讀 |
|---|---|---|---|---|
| 清楚可判+標對 | 28% | 76.5% | 100.0% | VLM 完勝 model(基本功扎實) |
| 可混淆頭飾(帽/罩/帽兜) | 35% | 61.9% | 76.2% | ★ VLM 價值帶:語意分辨頭飾,明顯優於 CNN |
| 縮圖太小/糊 | 28% | 58.8% | 88.2% | VLM 在全解析度下仍可判(縮圖騙了人工 audit) |
| cvat 標註錯 | 8% | 60.0% | 60.0% | VLM「誤判」多半是抓到標註錯,可當標註 audit |
把「真值」換成 Claude 人工清楚判讀(去掉 cvat 標註噪音):VLM 在可判讀樣本上與人工一致率 = 90.6%(n=32)。這比對 cvat 噪音真值的數字更能反映 VLM 真實能力。
| VLM | prompt 類 | s/張 |
|---|---|---|
| Qwen2.5-VL 7B | yes/no 短輸出 | 0.68 |
| Qwen2.5-VL 7B | CoT 長輸出 | 3.15 |
| Qwen2.5-VL 7B | JSON | 3.03 |
| Qwen2.5-VL 3B | yes/no 短輸出 | 0.49 |
| Qwen2.5-VL 3B | CoT 長輸出 | 1.24 |
| Qwen2.5-VL 3B | JSON | 1.66 |
| InternVL3 8B | yes/no 短輸出 | 0.78 |
| InternVL3 8B | CoT 長輸出 | 2.31 |
| InternVL3 8B | JSON | 2.48 |
硬體 = gx10-4t DGX Spark GB10 (unified memory)。單卡單 slot。yes/no 短輸出最快;CoT/JSON 因生成 token 多較慢。verifier 只需對「模糊區」樣本跑(佔 production 流量極少數),整體成本可控。
forklift 子集上最佳 VLM 組合 acc = 88.9%(InternVL3 8B c/zh_json)。
擴展 attr(反光衣/護目鏡/口罩)用 hard_hat 最佳組合(Qwen-7B + crop×2 + yes/no)對既有題庫含該 attr 真值的樣本跑。買 VLM vs model(各用自己 threshold)整體 acc。
| attr | n(有真值) | VLM 整體acc | model 整體acc | VLM 是否較強 |
|---|---|---|---|---|
| 安全帽 | 180 | 82.8% | 55.0% | VLM 較強 |
| 反光衣 | 125 | 88.0% | 91.2% | VLM 較弱 |
| 護目鏡 | 27 | 74.1% | 92.6% | VLM 較弱 |
| 口罩 | 32 | 81.2% | 87.5% | VLM 較弱 |
產出時間 2026-06-14 08:04 · 5090-2 · 資料/腳本在 ~/vlm_verifier_research/ · VLM 推論 gx10-4t(未碰 production)