🔬 VLM-as-verifier 可行性研究 — 用多模態 VLM 二次確認 PPE 偵測(20260614)

5090-2 指揮 · VLM 推論在 gx10-4t (DGX Spark GB10, 不碰 production) · 全程同 test 集對照 · 目標:驗證「VLM 對 PPE 偵測結果二次確認以降 FP」可行性,先驗效果不做工具

TL;DR — VLM 二次確認 PPE 是可行的,最划算的定位是「降 FP 過濾器」:

0. 研究設定

0.1 題庫(同集對照)

從 cvat #12 (factory_ppe) 撈 person,用 factory_ppe_v20260610_nv(MNv3-L,27-attr,crop384×192) 推論取 hard_hat 信心分桶。共 270 樣本:

子集桶(model conf)ngt 分布
broad模糊 0.3–0.760
broad高信心 ≥0.760
broad低信心 ≤0.360
forklift模糊 0.3–0.718
forklift高信心 ≥0.718
forklift低信心 ≤0.354

broad = Test split 全 source 平衡集(90 yes / 90 no),衡量 verifier 本質能力;forklift = 4 現役堆高機鏡頭場域代表集。bbox 用 data_id 拼 disk path + 對齊 cvat frame/person_idx(已驗證與原 manifest crop 一致)。模糊區只有 116 個(model 多數樣本信心極端),已全用。

0.2 變因矩陣

1. 變因矩陣完整結果(hard_hat, broad 集)

1.1 各 VLM 最佳組合

VLM最佳視覺最佳prompt整體acc模糊acc高信acc低信accs/張解析失敗
Qwen2.5-VL 7Bc crop×2中 few-shot78.9%75.0%81.7%80.0%0.450%
Qwen2.5-VL 3Bc crop×2英 yes/no76.7%70.0%83.3%76.7%0.320%
InternVL3 8Be context-margin crop英 yes/no82.8%85.0%83.3%80.0%0.730%

1.2 視覺標記法效果(Qwen-7B,跨 prompt 平均,模糊區 acc)

視覺模糊acc(平均)整體acc(平均)s/張
a 整張+黃框64.3%72.4%2.18
b crop64.0%73.3%1.15
c crop×273.3%77.2%1.40
d 整張+黃框+箭頭61.7%72.0%2.18
e context-margin crop58.7%73.3%1.26

1.3 prompt 策略效果(Qwen-7B,跨視覺平均,模糊區 acc)

prompt模糊acc(平均)整體acc(平均)s/張
中 yes/no63.3%73.6%0.65
中 CoT64.0%73.2%3.15
中 few-shot67.3%74.0%0.66
中 JSON64.0%74.1%3.03
英 yes/no63.3%73.4%0.68

1.4 完整 75 組合表

展開全表
VLM視覺prompt整體模糊s
internvl3-8baen_yesno75.0%56.7%85.0%83.3%0.86
internvl3-8bazh_cot68.9%56.7%80.0%70.0%2.32
internvl3-8bazh_fewshot76.1%66.7%83.3%78.3%0.78
internvl3-8bazh_json73.9%65.0%81.7%75.0%2.53
internvl3-8bazh_yesno72.8%60.0%85.0%73.3%0.78
internvl3-8bben_yesno83.3%80.0%81.7%88.3%0.73
internvl3-8bbzh_cot81.1%70.0%88.3%85.0%2.34
internvl3-8bbzh_fewshot78.3%66.7%86.7%81.7%0.65
internvl3-8bbzh_json79.4%70.0%90.0%78.3%2.49
internvl3-8bbzh_yesno81.7%73.3%88.3%83.3%0.64
internvl3-8bcen_yesno81.7%76.7%80.0%88.3%0.74
internvl3-8bczh_cot79.4%71.7%85.0%81.7%2.32
internvl3-8bczh_fewshot78.9%68.3%86.7%81.7%0.67
internvl3-8bczh_json78.3%66.7%90.0%78.3%2.52
internvl3-8bczh_yesno80.6%71.7%88.3%81.7%0.66
internvl3-8bden_yesno73.9%60.0%80.0%81.7%0.85
internvl3-8bdzh_cot70.0%56.7%80.0%73.3%2.38
internvl3-8bdzh_fewshot74.4%63.3%83.3%76.7%0.78
internvl3-8bdzh_json70.6%56.7%81.7%73.3%2.55
internvl3-8bdzh_yesno71.7%58.3%81.7%75.0%0.77
internvl3-8been_yesno82.8%85.0%83.3%80.0%0.73
internvl3-8bezh_cot77.8%66.7%90.0%76.7%2.18
internvl3-8bezh_fewshot81.1%78.3%86.7%78.3%0.65
internvl3-8bezh_json79.4%70.0%91.7%76.7%2.32
internvl3-8bezh_yesno79.4%71.7%90.0%76.7%0.65
qwen25vl-3baen_yesno72.2%60.0%78.3%78.3%0.90
qwen25vl-3bazh_cot60.6%50.0%61.7%70.0%1.57
qwen25vl-3bazh_fewshot71.1%56.7%71.7%85.0%0.88
qwen25vl-3bazh_json73.3%56.7%78.3%85.0%2.07
qwen25vl-3bazh_yesno70.0%51.7%71.7%86.7%0.87
qwen25vl-3bben_yesno75.0%61.7%86.7%76.7%0.16
qwen25vl-3bbzh_cot47.8%43.3%45.0%55.0%1.01
qwen25vl-3bbzh_fewshot72.8%56.7%83.3%78.3%0.14
qwen25vl-3bbzh_json73.9%61.7%88.3%71.7%1.34
qwen25vl-3bbzh_yesno73.9%56.7%88.3%76.7%0.13
qwen25vl-3bcen_yesno76.7%70.0%83.3%76.7%0.32
qwen25vl-3bczh_cot55.0%51.7%55.0%58.3%1.08
qwen25vl-3bczh_fewshot75.6%63.3%85.0%78.3%0.30
qwen25vl-3bczh_json73.9%60.0%90.0%71.7%1.53
qwen25vl-3bczh_yesno73.9%61.7%83.3%76.7%0.31
qwen25vl-3bden_yesno75.6%66.7%83.3%76.7%0.89
qwen25vl-3bdzh_cot57.8%48.3%61.7%63.3%1.59
qwen25vl-3bdzh_fewshot75.0%60.0%78.3%86.7%0.87
qwen25vl-3bdzh_json72.8%60.0%78.3%80.0%2.03
qwen25vl-3bdzh_yesno75.6%60.0%80.0%86.7%0.86
qwen25vl-3been_yesno73.3%63.3%86.7%70.0%0.20
qwen25vl-3bezh_cot61.7%51.7%63.3%70.0%0.97
qwen25vl-3bezh_fewshot71.7%61.7%71.7%81.7%0.16
qwen25vl-3bezh_json73.9%55.0%90.0%76.7%1.34
qwen25vl-3bezh_yesno70.6%53.3%85.0%73.3%0.16
qwen25vl-7baen_yesno71.7%61.7%78.3%75.0%1.19
qwen25vl-7bazh_cot75.0%68.3%75.0%81.7%3.79
qwen25vl-7bazh_fewshot70.0%66.7%70.0%73.3%1.18
qwen25vl-7bazh_json72.8%60.0%73.3%85.0%3.57
qwen25vl-7bazh_yesno72.8%65.0%80.0%73.3%1.17
qwen25vl-7bben_yesno73.9%63.3%80.0%78.3%0.25
qwen25vl-7bbzh_cot70.0%65.0%73.3%71.7%2.51
qwen25vl-7bbzh_fewshot74.4%65.0%76.7%81.7%0.23
qwen25vl-7bbzh_json75.0%65.0%80.0%80.0%2.53
qwen25vl-7bbzh_yesno73.3%61.7%80.0%78.3%0.23
qwen25vl-7bcen_yesno77.8%75.0%80.0%78.3%0.47
qwen25vl-7bczh_cot74.4%71.7%75.0%76.7%2.78
qwen25vl-7bczh_fewshot78.9%75.0%81.7%80.0%0.45
qwen25vl-7bczh_json76.7%73.3%76.7%80.0%2.85
qwen25vl-7bczh_yesno78.3%71.7%85.0%78.3%0.45
qwen25vl-7bden_yesno71.1%60.0%80.0%73.3%1.18
qwen25vl-7bdzh_cot75.0%63.3%76.7%85.0%3.86
qwen25vl-7bdzh_fewshot70.0%63.3%73.3%73.3%1.18
qwen25vl-7bdzh_json73.3%61.7%76.7%81.7%3.55
qwen25vl-7bdzh_yesno70.6%60.0%81.7%70.0%1.15
qwen25vl-7been_yesno72.8%56.7%85.0%76.7%0.29
qwen25vl-7bezh_cot71.7%51.7%80.0%83.3%2.80
qwen25vl-7bezh_fewshot76.7%66.7%78.3%85.0%0.27
qwen25vl-7bezh_json72.8%60.0%83.3%75.0%2.66
qwen25vl-7bezh_yesno72.8%58.3%85.0%75.0%0.27

2. 模糊區準度 — VLM 能不能救 model?(研究核心)

模糊區三方對照(best combo vs model baseline vs cvat 真值):

2.1 ★ 當「降 FP 過濾器」用:VLM 抓 model 錯判的能力

把 VLM 接在 model 信心極端輸出後面當二次確認(best combo):結論:在「高信心偵測」上當 FP 過濾器最划算 — 抓回 6 成假陽性、幾乎不誤殺正確的。

註:high/low 桶為了測 verifier 刻意平衡取樣(含大量罕見的 model 錯判 case),故桶內「model acc」非場域真實比例,請看上面 FP/FN 抓捕率而非桶 acc。

2.2 ★ 模糊區 audit 分層:model 在哪種模糊上被 VLM 補強

Claude 人工逐張看 60 個模糊樣本(縮圖 crop×2),分四類,對比 model vs VLM(best) 各類 acc。揭露「模糊區」的組成 + VLM 在每類的增益:

類別佔比model accVLM acc解讀
清楚可判+標對28%76.5%100.0%VLM 完勝 model(基本功扎實)
可混淆頭飾(帽/罩/帽兜)35%61.9%76.2%★ VLM 價值帶:語意分辨頭飾,明顯優於 CNN
縮圖太小/糊28%58.8%88.2%VLM 在全解析度下仍可判(縮圖騙了人工 audit)
cvat 標註錯8%60.0%60.0%VLM「誤判」多半是抓到標註錯,可當標註 audit

把「真值」換成 Claude 人工清楚判讀(去掉 cvat 標註噪音):VLM 在可判讀樣本上與人工一致率 = 90.6%(n=32)。這比對 cvat 噪音真值的數字更能反映 VLM 真實能力。

3. 速度評估

VLMprompt 類s/張
Qwen2.5-VL 7Byes/no 短輸出0.68
Qwen2.5-VL 7BCoT 長輸出3.15
Qwen2.5-VL 7BJSON3.03
Qwen2.5-VL 3Byes/no 短輸出0.49
Qwen2.5-VL 3BCoT 長輸出1.24
Qwen2.5-VL 3BJSON1.66
InternVL3 8Byes/no 短輸出0.78
InternVL3 8BCoT 長輸出2.31
InternVL3 8BJSON2.48

硬體 = gx10-4t DGX Spark GB10 (unified memory)。單卡單 slot。yes/no 短輸出最快;CoT/JSON 因生成 token 多較慢。verifier 只需對「模糊區」樣本跑(佔 production 流量極少數),整體成本可控。

4. 4 現役堆高機鏡頭場域發現

FOX_[TC-1-102/124] / IRODA_[頻道4/5] / HONCHUAN 場域,hard_hat 幾乎全為「無」:

forklift 子集上最佳 VLM 組合 acc = 88.9%(InternVL3 8B c/zh_json)。

5. 其他 attr 可行性差異

擴展 attr(反光衣/護目鏡/口罩)用 hard_hat 最佳組合(Qwen-7B + crop×2 + yes/no)對既有題庫含該 attr 真值的樣本跑。買 VLM vs model(各用自己 threshold)整體 acc。

attrn(有真值)VLM 整體accmodel 整體accVLM 是否較強
安全帽18082.8%55.0%VLM 較強
反光衣12588.0%91.2%VLM 較弱
護目鏡2774.1%92.6%VLM 較弱
口罩3281.2%87.5%VLM 較弱

6. 給 operator 的建議:要不要做、怎麼做

建議:值得做。VLM 二次確認對 PPE(至少 hard_hat)有實質增益,最務實的落地是「對 model 高信心偵測做降 FP 過濾」。

產出時間 2026-06-14 08:04 · 5090-2 · 資料/腳本在 ~/vlm_verifier_research/ · VLM 推論 gx10-4t(未碰 production)