AI 已經可以取代皮膚科醫師看病了嗎？

還不行。2026 JAAD 研究中，DEXI AI 在 114 張皮膚鏡影像中誤判 10 張，其中 5/20 顆黑色素瘤被判成普通痣，這在臨床上是不能接受的漏診率。AI 目前的合理定位是「第二雙眼睛」，最終診斷仍需要醫師整合病史、肉眼檢查、皮膚鏡和必要時的切片。

AI 熱圖看的地方和醫師相似，代表 AI 想法一樣嗎？

不一定。熱圖只能顯示「哪些像素區域和模型輸出有關」，不能證明 AI 使用了和醫師相同的臨床推論。過去研究就有 AI 被照片上的尺規、毛髮或燈光反光誤導的例子，所以熱圖重疊是「值得進一步研究」的線索，不是「AI 已經會看皮膚」的證明。

AI 現在能幫皮膚科醫師看病嗎？2026 三篇 JAAD 研究的綜合解答

一句話結論

AI 在皮膚科目前比較合理的定位，是診斷輔助、可解釋性檢查與教育資料增強，而不是取代皮膚科醫師。Kremer 等人的 JAAD Studies show that ，DEXI 皮膚鏡 AI 熱圖和皮膚科醫師眼動熱圖有相當重疊，但熱圖相似不等於 AI 使用了和醫師完全相同的臨床邏輯。

衛教：病人最常問的 8 題

Q1. 我手機 App 拍痣判斷會準嗎？

目前市售手機 App 的差異非常大，多數沒有公開過完整的前瞻性臨床驗證。本研究使用的 DEXI 是專業皮膚鏡影像（dermoscopy，醫師用的偏光放大鏡）+ 商用 AI 系統，和手機隨手拍的條件差非常多。手機 App 可以作為「提醒你去看醫師」的工具，不能取代正式診斷。

Q2. AI 已經可以幫我「排除」黑色素瘤了嗎？

還不行。在這篇 2026 JAAD 研究中，DEXI AI 看 114 張皮膚鏡影像，共誤判 10 張；最關鍵的是20 顆黑色素瘤裡有 5 顆被 AI 判成普通痣（漏診率 25%）。臨床上不能用這個準確度單獨排除惡性病灶。發現痣有變化（變大、不對稱、邊緣不規則、顏色不均、出血、隆起）請直接看皮膚科。

Q3. AI 看的位置和醫師一樣，是不是代表 AI 已經會看皮膚了？

熱圖只能顯示「哪些區域和 AI 的輸出有關」，不能證明 AI 真的用同樣的臨床邏輯在思考。過去研究就發現 AI 會被照片裡的尺規、毛髮、燈光反光「騙到」，做出看起來對、其實理由錯的判斷。所以「AI 熱圖像醫師」是好的開始，不是終點。

Q4. 看到 ChatGPT 之類 AI 生出來的「皮膚病照片」可以拿來自我比對嗎？

不建議。Lipner 在同一期 JAAD Reviews 提醒，生成式 AI 影像在皮膚科目前仍有膚色代表性不足、病理特徵不準確的問題。它最合理的用途是醫師教學素材或資料增強，不適合民眾拿來「對照自己的疹子」做自我診斷。

Q5. 那以後是不是會變成只跟 AI 看診？

就目前的證據看，比較合理的圖像是「AI 當第二雙眼睛」。AI 可以提醒醫師某個區域值得再看一次、可以協助挑出風險較高的病灶優先排程，但病史、肉眼整體判斷、必要時切片，這些還是只有醫師能做。AI 改變的是流程，不是取代醫師。

Q6. AI 看濕疹、乾癬準嗎？

比看痣(黑色素瘤領域)差很多。2026 JAAD Mahajan 等人用 1758 張臨床照片測 GPT-5、Gemini-2.5-Pro 與 Janus-pro-7b 看 12 種發炎性皮膚病的能力、整體準確率只有 GPT-5 46.2%、Gemini 45.1%、Janus 30.8%。同一個 GPT-5 看異位性皮膚炎(濕疹)能達到 60.5%、但看乾癬只有 18.1%、看多形性紅斑(erythema multiforme)30%。
換句話說：發炎性皮膚病種類繁多、紋路重疊、AI 目前還沒準到可以靠手機 App 自我診斷。看到 AI 對你的紅疹說「乾癬」、實際上可能是濕疹、玫瑰斑、苔癬樣疹⋯⋯都有可能。

Q7. 我看門診時、醫師用手機 / 平板錄音、那是什麼？

那很可能是「AI 數位轉錄秘書(digital scribe)」、會在就診過程中自動把醫師與病人的對話轉寫成病歷草稿、醫師再修改、最後存進電子病歷。2026 JAAD Cao 等人在美國 Medical College of Wisconsin 評估這類工具用在 56 位醫師(包含 12 位皮膚科)、發現皮膚科醫師每天因此省下大約 15.4 分鐘的病歷時間、病歷內容由醫師「自己手打」的比例(PNC)從 95.5% 降到 43.1%、AI 已經承擔大半起草工作。對病人來說、好處是醫師有更多時間看你、減少邊看邊低頭打字；隱私上、各醫院通常設置錄音「處理完即刪、不外傳」的規範。若你不希望被錄音、可以直接告訴醫師、醫師會關掉這項功能。

Q8. 我的皮膚顏色比較深、AI 看得跟淺膚色一樣準嗎？

目前的證據是不一樣準、而且差距明顯。Mahajan 2026 JAAD 顯示 GPT-5 對 Fitzpatrick 3-4(中等膚色)的辨識準確率是 50.8%、但對 Fitzpatrick 5-6(深膚色)只有 37.5%、相差 13 個百分點。Gemini 與 Janus 也呈現相同方向的差距、且都達統計顯著(p < 0.05)。原因是這些 AI 模型主要用淺膚色資料訓練、對深膚色的疾病樣態學習不足。在台灣多數族群屬 Fitzpatrick 3-4、所以 AI 表現「中等」；對於有原住民血統、東南亞血統、或本身偏深膚色的患者、AI 自我診斷的可靠度會更低、更應該找皮膚科醫師親自評估。

30 秒重點

Lipner 的 JAAD Reviews 將皮膚科 AI 放在兩條主線：診斷 / 影像分析，以及醫學教育 / 資料增強。
Kremer 等人比較 4 位皮膚科醫師的眼動熱圖與 DEXI AI 熱圖，納入 114 張皮膚鏡影像。
醫師 vs DEXI 的 median pixel-wise correlation 為 r = 0.540，接近醫師彼此之間的 r = 0.591，高於錯配比較的 r = 0.434。
DEXI 在本研究中誤判 10/114 張影像，其中 5/20 個黑色素瘤被判成痣；所以本文不把結果解讀成「AI 已可獨立排除黑色素瘤」。
生成式 AI 可用於醫學教育與資料增強，但 Lipner 也提醒：偏差、代表性不足與領域準確性仍需驗證。

目前 AI 在皮膚科的三個角色

如果把「AI 在皮膚科」只想成「手機拍一張痣，AI 告訴你是不是黑色素瘤」，會把問題想得太窄。Lipner 在 2026 年 JAAD Reviews 的導讀，把近期兩篇文章放在同一個脈絡：一篇處理診斷 AI 的可解釋性，另一篇處理文字轉影像模型在醫學教育與資料增強的應用。

角色	目前合理用途	主要風險
診斷輔助	分析皮膚鏡影像，作為第二意見或風險分層線索。	資料集偏差、影像品質差異、罕見型態不足，且仍可能漏判 melanoma。
可解釋性工具	用 heat map / saliency map 顯示模型關注區域，讓醫師檢查 AI 是否看在有臨床意義的位置。	熱圖不是因果證明；不同方法可能產生不同圖，也可能被影像 artifact 牽著走。
教育與資料增強	建立教學素材、少見病影像補充、訓練資料擴增。	可能延續偏差，生成影像未必符合真實病理或皮膚鏡特徵。

Kremer 2026 JAAD 研究怎麼做？(比較皮膚科醫師與 AI 在皮膚鏡影像分析的熱圖：眼動追蹤研究)

Kremer 等人的問題很具體：AI 皮膚鏡分類模型產生的熱圖，是否真的落在皮膚科醫師會看的診斷區域？研究讓 4 位皮膚科醫師在不知道診斷的情況下看皮膚鏡影像，同時記錄眼動軌跡，再把眼動資料轉成熱圖；同一批影像也用 DEXI 演算法產生 class activation map。

設計元素	內容
影像來源與類別	主要來自 HAM10000，另有少量 MSKCC 與 BCN200000；包含 melanoma、BCC、SCC、nevi、benign keratoses、vascular lesions，各類原規劃 20 張。
納入分析	技術原因排除 6 張後，剩 114 張：60 benign、54 malignant。
醫師讀片	4 位皮膚科醫師，包含 1 位超過 35 年經驗的皮膚鏡專家，以及 3 位較年輕醫師。
AI 系統	DEXI (Dermoscopy EXplainable Intelligence)，以 Vectra software system 產生熱圖。
主要分析	用 pixel-wise rank correlation 比較醫師眼動熱圖與 DEXI 熱圖；醫師彼此相關性作為上方參考，錯配影像作為下方參考。

圖一研究流程概念圖。重點不是比較「誰診斷更準」，而是比較 AI 熱圖和醫師視覺注意區域是否重疊。

主要數字怎麼解讀？

0.540醫師 gaze heat map vs DEXI heat map 的 median correlation。

0.591醫師彼此之間 heat map 的 median correlation，上方參考值。

0.434DEXI 和非同張影像眼動熱圖的 null correlation，下方參考值。

這組數字最保守的解讀是：DEXI 熱圖和皮膚科醫師的視覺注意區域有實質重疊，而且接近醫師彼此之間的重疊程度。這支持 DEXI 可能有一定程度的可解釋性，也就是它的預測不是完全落在與人類診斷無關的位置。

但這不等於「AI 已經會像醫師一樣診斷」。Kremer 等人也報告，DEXI 誤判 10/114 張影像；其中 melanoma 有 5/20 被判成 nevi，這是臨床上不能輕描淡寫的錯誤。因此，這篇研究比較像在回答「AI 的解釋圖是否值得進一步研究」，不是在宣告「AI 可以排除黑色素瘤」。

Gaze map 和 fixation map 為什麼不同？

研究中 gaze heat map 的醫師-DEXI 相關性高於 fixation heat map（約 r = 0.53 vs r = 0.46，P < .001）。作者討論一個可能解釋：診斷錨點可能在早期、短暫、全局式視覺反應中就被掃到；後續 fixation 可能更反映個別醫師為了確認診斷而進行的細部搜尋，因此個人差異較大。

錯誤診斷病灶反而 overlap 較高，代表什麼？

研究發現醫師診斷錯誤的病灶，醫師-DEXI 熱圖相關性高於診斷正確的病灶（r = 0.568 vs 0.521）。作者推測，這可能反映困難病灶被看得更久、掃描區域更多，因此有更多區域與 DEXI 熱圖重疊。這不能解讀成「overlap 越高越準」，反而提醒我們：熱圖重疊是可解釋性線索，不是準確度指標。

熱圖相似的限制：看起來合理，不等於因果解釋

這篇文章最值得住院醫師帶走的警語，是「heat map 不是模型思考過程的錄影」。熱圖可顯示哪些像素區域和模型輸出相關，但不一定能證明模型真的用那些臨床特徵做分類。Kremer 等人也提到，不同 saliency methods 可能產生不同熱圖；過去 ISIC Challenge 的經驗也顯示，模型可能注意到影像 artifact，而非真正的病灶特徵。

方法學限制

每種病灶類別樣本數小，限制 subgroup interpretation。
沒有 lesion size 資料，可能影響熱圖相關性。
熱圖重疊不能證明 AI 的分類依據和醫師的皮膚鏡邏輯相同。
本研究並未直接測試 AI 是否改善臨床醫師診斷準確度、信心或病人結果。
DEXI 是特定商業系統與特定資料流程，結果不能直接外推到所有 AI 皮膚鏡 app。

住院醫師重點：這幾篇研究怎麼讀？

1. 這篇不是 AI accuracy paper，而是 explainability paper

不要把討論焦點放成「DEXI 準不準」。它主要問的是：模型關注區域是否與醫師視覺注意區域相似。

2. 可解釋性不是 nice-to-have，而是臨床導入門檻

如果模型只給一個分數，醫師很難知道它是否看到了 pigment network、asymmetry、border irregularity、color heterogeneity，或只是被 ruler mark、hair、illumination artifact 影響。可解釋性工具至少提供一個檢查入口。

3. 生成式 AI 可以做教育，但需要皮膚科醫師把關

Lipner 對 text-to-image models 的整理是保守而清楚的：它們可能幫助 rare disease imaging、data augmentation 與 medical education，但目前仍有 bias 與 domain-specific accuracy 的問題。在皮膚科，膚色代表性、疾病型態、拍攝條件與病理真實性都不能跳過人工審核。

臨床實務：AI 最合理的位置是第二雙眼睛

現階段，AI 在皮膚科最合理的位置不是最後裁判，而是第二雙眼睛：提醒醫師某個區域值得看、提供教學對照、協助建立可審核的模型輸出。對病人來說，AI app 不能取代完整病史、肉眼檢查、皮膚鏡、追蹤影像與必要時切片。對醫師來說，AI 也不能取代皮膚鏡訓練；相反地，它把「我們到底在看什麼」這件事推到更需要說清楚的位置。

延伸閱讀：皮膚切片與腫瘤切除手術完整衛教 · 日光性角化症 AK + 鱗狀細胞癌 SCC 完整衛教 · 口腔黏膜檢查與切片指南

Mahajan 2026 JAAD：多模態大型語言模型評估發炎性皮膚病(橫斷研究)

Mahajan 等人(Brigham & Women's Hospital, Harvard)2026 年 6 月發表的橫斷研究、是目前最大型一次性同時評估「主流 mLLM 看發炎性皮膚病(inflammatory skin diseases, ISDs)」的研究。設計重點：1,758 張 Stanford + Google Skin Condition Image Network (SCIN) 臨床照片、涵蓋 12 種 ISD(異位性皮膚炎、蕁麻疹、乾癬、痤瘡、白血球破壞性血管炎、苔癬樣疹、玫瑰斑、環狀肉芽腫、脂漏性皮膚炎、皮膚紅斑性狼瘡、多形性紅斑、 pityriasis lichenoides)。三個受測模型：GPT-5(proprietary)、Gemini-2.5-Pro(proprietary)、Janus-pro-7b(open-source)。Primary outcome 為 diagnostic accuracy、附 95% CI、依疾病、Fitzpatrick 膚色、年齡、形態學紋路、解剖位置分層分析。

主要結果：整體準確率不到 50%、且疾病間差距極大

整體 accuracy：GPT-5 46.2%(95% CI 42.1-50.3)、Gemini 45.1%、Janus 30.8%。看似相近、但疾病層級差距很大、且不同 model 強項不同。例如 GPT-5 看異位性皮膚炎達 60.5%、但看乾癬只有 18.1%；Gemini 反過來看脂漏性皮膚炎 81.8%、看皮膚紅斑性狼瘡只有 10%。Janus 在環狀肉芽腫 0% 但在血管炎 52.2%。

圖：三大主流多模態 LLM 在 12 種發炎性皮膚病的辨識準確率。整體不到 50%(虛線)、但疾病間差距極大、且不同 model 強項不一致(例如 GPT-5 強於環狀肉芽腫、Gemini 強於脂漏性皮膚炎)。臨床意義是：單一 model 不能當通用 ISD 篩選工具、需配合 ensemble 或專科醫師驗證。LCV = leukocytoclastic vasculitis、環肉 = granuloma annulare、PL = pityriasis lichenoides。

次要結果：膚色與年齡 bias 系統性存在

所有三個 model 對 Fitzpatrick 5-6(深膚色)的準確率都比對 Fitzpatrick 3-4(中等膚色)顯著低、p 值皆 < 0.05。GPT-5 從 FST 3-4 的 50.8%掉到 FST 5-6 的 37.5%、相差 13 個百分點。年齡方向也很清楚：60-69 歲族群準確率最高(GPT-5 51.2%)、18-29 歲最低(34.7%) — 可能因為老年患者的紅疹照片更接近模型訓練資料中的典型 textbook 表現、年輕患者的早期或非典型病灶較少被收錄。

圖：三大 mLLM 對不同 Fitzpatrick 膚色族群的準確率。GPT-5 在 FST 3-4 達 50.8%、但 FST 5-6 掉到 37.5%、相差 13 個百分點(p < 0.05)。對應到台灣臨床、多數族群屬 FST 3-4、但對偏深膚色患者(東南亞、原住民血統)AI 工具的可靠度需打折扣。

與黑色素瘤領域對比：為什麼 ISD 比較難？

這篇研究最有衝擊力的地方、是直接和現有黑色素瘤 AI 的成績做對比。過去多篇 systematic review 報告 mLLM 在 melanoma 與色素病灶領域可達 70-85% 的辨識準確率(Zarfati 2024 JCM、Daneshjou 2022 Sci Adv)、甚至接近 board-certified dermatologist 的水準。為什麼 ISD 表現掉這麼多？

比較項目	Melanoma 領域	ISD 領域
疾病種類數	主要 binary(melanoma vs 非 melanoma)或少數類別	12 類以上、且形態學重疊
影像類型	dermoscopy(皮膚鏡)已標準化、解析度高	clinical photo、相機 / 燈光 / 角度差異大
訓練資料量	ISIC archive 數十萬張、標註齊全	SCIN 等資料集相對小、多種疾病分散
疾病表現多變度	限定範圍(不對稱、邊界、顏色、直徑、變化)	變化巨大、年齡 / 膚色 / 期別 / 部位影響大
臨床任務	篩檢 / 風險分層、binary 判斷較適合	diagnosis + treatment、多類別、需 contextual reasoning

這也解釋了為什麼 Mahajan 的結論很節制：「mLLMs 在 ISD 的當前世代尚未為診斷準備好、但作為篩選 / 分流工具有潛力」。對應到 Kremer 2026 的 explainability 結論：AI 能「看在合理位置」、不代表「會給出合理診斷」 — ISD 的後者比 melanoma 困難。

Cao 2026 JAAD：AI 數位轉錄秘書對工作流的影響(高量 vs 低量門診科別與皮膚科比較)

Cao 等人(Medical College of Wisconsin)2026 年 6 月的這篇 brief report、把焦點從「AI 看病」移到「AI 寫病歷」。設計：單中心 retrospective cohort、72 位接受訓練的醫師中、56 位(77.8%)連續使用 AI 數位轉錄秘書(Nuance Dragon Ambient eXperience / DAX 2021 版本)超過一個月被納入分析、橫跨 16 個專科、包含 12 位皮膚科醫師。資料期間 2021/02-2023/07、用 Epic Signal 提取 EMR 時間指標、與各醫師「自己使用前」做配對對照(self-control)。獨立樣本 t 檢定、p < .05 為顯著。

主要結果：皮膚科每天省 15.4 分鐘、AI 接管半數 note 起草

皮膚科 12 位醫師、每日整體 EMR/note 時間從 82.6 分鐘降到 67.2 分鐘、省下 15.4 分鐘(p = 0.002)；上班時段內 note 時間 48.5 → 37.2 分鐘(-11.3 min、p = 0.002)；下班後 EMR 時間 34.1 → 30.0 分鐘(NS、p = 0.11)。最戲劇性的指標是 note 內容由醫師「自己手打」的比例(Provider Note Contribution, PNC)：從 95.5% 降到 43.1%、AI 接手大半起草。note 長度(NL)反而從 4,446 → 4,913 字、p = 0.002、但時間沒增加 — 暗示 AI 自動產生的內容(templated / boilerplate)佔了多數。

圖：AI 數位轉錄秘書(Nuance DAX)使用前後、各專科群每日整體 EMR/note 時間變化。皮膚科、內科、內科次專科顯著省時(綠色、皆 p < 0.05)；骨科、整形 / ENT、神經、泌尿、腫瘤 / 安寧未達統計顯著、其中腫瘤 / 安寧族群甚至呈現「使用後反而多花 25.4 分鐘」的趨勢。Cao 等人推測：低量、複雜、個別化的門診類型、AI 草稿編輯成本可能超過自己打。

高量 vs 低量門診的反差：誰真的受益？

作者把 56 位醫師依「半日門診是否超過 10 人」分為 high-volume clinicians(HVCs, n = 34、含 12 位皮膚科)與 low-volume clinicians(LVCs, n = 22)。HVCs 每 appointment 省 2.0 分鐘(p < .001)、每日省 28.3 分鐘(p < .001)、每週可釋出 1.3-2.4 小時。但 LVCs 卻反向：上班時段內 note 時間雖然也降(-9.3 min、p = .003)、但下班後 EMR 時間反而上升 16.4 分鐘(p = .004)、整體 net 為負。

圖：HVC vs LVC 對 AI 數位轉錄秘書的不同受益型態。HVC 各 metric 都顯著節省、每週可釋出 1.3-2.4 小時；LVC 在上班時段內節省、但下班後 EMR 反而增加 16.4 分鐘(p = 0.004)、整體 net 為負。原因推測：低量門診每次更複雜、AI 草稿需要更多人工編輯、累積到下班後處理。

對台灣健保門診的意涵：三篇證據怎麼用

台灣健保門診的型態剛好對齊 Cao 研究的 HVC 條件 — 多數皮膚科診所 / 醫院半日門診人數遠超過 10 人(常見 30-60 人)、疾病組合高度結構化(痘痘、濕疹、灰指甲、玫瑰斑、雞眼為大宗)、note 模板化程度高。理論上、AI 數位轉錄秘書在台灣健保門診的 ROI 應該比研究中的 12 位皮膚科 HVC 還高、可能釋出每週 2-4 小時。

但有幾個台灣特有的注意點：

① 語言：台灣門診是台語 + 國語 + 醫學英文夾雜、英文為主訓練的 DAX-like 工具辨識率可能下降；目前較成熟的多語系工具仍以英文為強項、應評估在地化版本。
② 個資與醫療法：依個資法與醫療法、錄音需患者明確同意、錄音檔處理完即刪、且不得上傳到境外伺服器(部分 cloud-based AI 服務需特別審核)。建議由院內資安 + 醫倫委員會把關後再導入。
③ 健保給付：目前健保署尚未開放「AI 診斷」給付項目；但 AI 寫病歷屬於行政流程改善、不涉及給付申報、各院可自行導入。AI 影像辨識(例如皮膚鏡 AI 篩檢)若未來納入給付、會比照 GIM / 細胞學 AI 篩檢路線需獨立 IDE 核可。
④ Mahajan 對台灣的警示：台灣常見的乾癬、玫瑰斑、苔癬樣疹這幾種、是 GPT-5 表現最差的族群(乾癬 18%、玫瑰斑 42%、苔癬樣 34%)。患者若用 ChatGPT 或手機 App 自我診斷、誤判機率不低。臨床上若以 AI 作為初步分流、應同時設置「醫師最終覆核」機制、不應以 AI 結果直接給藥。
⑤ Kremer 對台灣的啟示：DEXI 在皮膚鏡層級的 explainability 高、但 melanoma 漏判仍存在；台灣皮膚癌(含 melanoma)發生率雖低於白人族群、但 acral melanoma(手腳掌黑色素瘤)在亞洲人比例反而較高、AI 訓練資料偏少、應更謹慎。

三條前線綜合：AI 在皮膚科的機會 vs 風險

把三篇 JAAD 2026 研究放在同一張 2D 圖上、可以看出 AI 在皮膚科的「成熟度 vs 臨床影響力」分布：

圖：AI 在皮膚科的三條前線在「成熟度 × 臨床影響力」象限上的相對位置。寫病歷(Cao)：成熟度與影響力俱高、在高量門診(含台灣健保門診)最有即戰力。解釋自己(Kremer)：方法成熟度中等、臨床影響力依配套機制而定。看影像：melanoma 領域已可實用、但 ISD 領域(Mahajan)準確率僅 46% 且有膚色 bias、尚未準備好獨立臨床使用。

總結三條前線的當前狀態：

寫病歷：已可實用、高量門診優先導入。隱私與本地化配套需到位。
看影像：melanoma 可用作第二意見、ISD 還在 screening 階段、不可獨立使用。深膚色族群準確率明顯下降。
解釋自己：方法學成熟、但「熱圖相似 ≠ 推理相似」。應作為「AI 是否在合理區域」的健檢、而非「AI 推理正確」的證明。

參考資料

Lipner SR. Highlights from JAAD Reviews: May 2026 - Artificial intelligence: Where are we now? J Am Acad Dermatol. 2026;94(5):1434-1435. doi:10.1016/j.jaad.2026.02.070.
Kremer N, Polo-Silveira L, Bajaj S, et al. Comparing dermatologists' and artificial intelligence heat maps in dermoscopic image analysis via eye tracking. J Am Acad Dermatol. 2026;94(5):1461-1468. doi:10.1016/j.jaad.2025.12.104.
Chanda T, Haggenmueller S, Bucher TC, et al. Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study. Nat Commun. 2025;16(1):4739. doi:10.1038/s41467-025-59532-5.
Hauser K, Kurz A, Haggenmueller S, et al. Explainable artificial intelligence in skin cancer recognition: a systematic review. Eur J Cancer. 2022;167:54-69. doi:10.1016/j.ejca.2022.02.025.
Brancaccio G, Balato A, Malvehy J, Puig S, Argenziano G, Kittler H. Artificial intelligence in skin cancer diagnosis: a reality check. J Invest Dermatol. 2024;144(3):492-499. doi:10.1016/j.jid.2023.10.004.
Giavina-Bianchi M, Vitor WG, Fornasiero de Paiva V, Okita AL, Sousa RM, Machado B. Explainability agreement between dermatologists and five visual explanations techniques in deep neural networks for melanoma AI classification. Front Med (Lausanne). 2023;10:1241484. doi:10.3389/fmed.2023.1241484.
Mahajan A, Whittelsey M, Grullon K, et al. Multimodal large language models for inflammatory skin disease evaluation: A cross-sectional study. J Am Acad Dermatol. 2026;94(6):1788-1790. doi:10.1016/j.jaad.2026.01.079.
Cao DY, Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: Impact on workflow among high-volume versus low-volume specialties compared with dermatology. J Am Acad Dermatol. 2026;94(6):1819-1820. doi:10.1016/j.jaad.2026.02.037.
Cao DY, Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: pilot study assessing the impact on dermatologist workflow and patient encounters. JAAD Int. 2024;15:149-151. doi:10.1016/j.jdin.2024.02.009.
Zarfati M, Nadkarni GN, Glicksberg BS, et al. Exploring the role of large language models in melanoma: a systematic review. J Clin Med. 2024;13(23):7480. doi:10.3390/jcm13237480.
Daneshjou R, Vodrahalli K, Novoa RA, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8(32):eabq6147. doi:10.1126/sciadv.abq6147.
Ward A, Li J, Wang J, et al. Creating an empirical dermatology dataset through crowdsourcing with web search advertisements. JAMA Netw Open. 2024;7(11):e2446615. doi:10.1001/jamanetworkopen.2024.46615.
Duggan MJ, Gervase J, Schoenbaum A, et al. Clinician experiences with ambient scribe technology to assist with documentation burden and efficiency. JAMA Netw Open. 2025;8(2):e2460637.
Haberle T, Cleveland C, Snow GL, et al. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inform Assoc. 2024;31(4):975-979.

Bottom Line

TL;DR: The current role of AI in dermatology is best framed as diagnostic support, explainability review, and education / data augmentation, not replacement of dermatologists. Kremer et al's JAAD eye-tracking study shows meaningful overlap between DEXI dermoscopy AI heat maps and dermatologist gaze heat maps, but similar heat maps do not prove that AI uses the same clinical reasoning as human readers.

For Patients: 5 Common Questions

Q1. Are smartphone mole-check apps accurate?

Available apps vary widely and most lack rigorous prospective clinical validation. The DEXI system used in this study runs on professional dermoscopy images plus a commercial AI workflow — very different from a casual phone photo. Apps can prompt you to see a dermatologist, but they cannot replace clinical diagnosis.

Q2. Can AI safely rule out melanoma for me?

Not yet. In this 2026 JAAD study, DEXI misclassified 10/114 dermoscopic images; critically, 5 out of 20 melanomas were classified as ordinary nevi (25% miss rate). That is not safe enough for AI to act as a standalone rule-out tool. If a mole changes (growth, asymmetry, irregular border, color variation, bleeding, elevation), see a dermatologist.

Q3. AI looks at the same areas dermatologists do — does that mean AI is reading skin like a human?

A heat map shows regions associated with model output, but it does not prove the model uses the same clinical reasoning. Prior work has shown AI being misled by ruler marks, hair, or illumination artifacts. "Heat maps look like dermatologists" is a promising starting point, not a finish line.

Q4. Can I use generative AI images to compare my rash at home?

No. Lipner notes in the same JAAD Reviews issue that generative AI images in dermatology still struggle with skin-tone representativeness and morphologic accuracy. They suit teaching material and data augmentation — not lay self-diagnosis.

Q5. Will future visits be AI-only?

Realistically, AI is heading toward a "second set of eyes" role. AI can flag regions for a closer look or help triage higher-risk lesions, but history, holistic visual assessment, and biopsy when needed remain physician tasks. AI changes the workflow, not the role of the dermatologist.

30-second takeaways

Lipner's JAAD Reviews commentary frames dermatology AI along two active fronts: diagnosis / image analysis and medical education / data augmentation.
Kremer et al compared eye-tracking heat maps from 4 dermatologists with DEXI AI heat maps across 114 dermoscopic images.
The median dermatologist-DEXI pixel-wise correlation was r = 0.540, approaching inter-dermatologist agreement at r = 0.591, and higher than null comparisons at r = 0.434.
DEXI misclassified 10/114 images; importantly, 5/20 melanomas were misclassified as nevi. This should prevent any over-reading that AI can independently rule out melanoma.
Generative AI may support medical education and data augmentation, but Lipner emphasizes ongoing problems with bias, representativeness, and domain-specific accuracy.

Three Current Roles for AI in Dermatology

If we reduce dermatology AI to "take a phone photo of a mole and get a melanoma answer," we miss the real landscape. Lipner's 2026 JAAD Reviews commentary links two complementary directions: explainable diagnostic AI and generative models for education and data augmentation.

Role	Reasonable use today	Main risk
Diagnostic support	Analyze dermoscopic images as a second-read or risk-stratification aid.	Dataset bias, image-quality variation, underrepresentation of rare patterns, and missed melanoma.
Explainability tool	Use heat maps or saliency maps to ask whether the model is attending to clinically meaningful areas.	Heat maps are not causal proof; different methods may produce different maps and may attend to artifacts.
Education and data augmentation	Create teaching material, augment rare-disease imagery, and support training sets.	Generated images may perpetuate bias and may not preserve true morphologic or pathologic features.

Taiwan NHI context

This article is about AI roles and research methodology, not drug treatment. It therefore does not include Taiwan NHI drug reimbursement criteria. If a future article discusses melanoma treatment, immune checkpoint inhibitors, or BRAF/MEK targeted therapy, reimbursement criteria should be checked separately against pathology, staging, BRAF testing, ECOG status, imaging, and prior-authorization requirements.

How Was the Kremer 2026 JAAD Study Designed?

Kremer et al asked a focused question: do AI-generated heat maps in dermoscopy highlight the same regions that dermatologists visually inspect? Four dermatologists, blinded to diagnosis, reviewed dermoscopic images while their eye movements were tracked. The same images were analyzed by DEXI to generate class activation maps, and the overlap was measured using pixel-wise rank correlation.

Design element	Details
Images and lesion types	Mainly HAM10000, with a small additional contribution from MSKCC and BCN200000; melanoma, BCC, SCC, nevi, benign keratoses, and vascular lesions.
Final analysis set	Six images were excluded for technical reasons, leaving 114 images: 60 benign and 54 malignant.
Readers	Four dermatologists, including one dermoscopy expert with over 35 years of experience and three younger dermatologists.
AI system	DEXI (Dermoscopy EXplainable Intelligence), implemented through Vectra software.
Main analysis	Pixel-wise rank correlation between dermatologist gaze maps and DEXI maps; inter-dermatologist correlations served as the upper reference, and non-homologous pairings as the lower reference.

How Should We Read the Main Numbers?

0.540Median correlation between dermatologist gaze maps and DEXI heat maps.

0.591Median correlation among dermatologists, the upper reference.

0.434Median null correlation between DEXI and non-homologous dermatologist maps.

The most conservative interpretation is that DEXI heat maps substantially overlap with dermatologist visual attention and approach the agreement seen among dermatologists themselves. This supports potential interpretability: the model is not simply attending to areas obviously unrelated to human diagnostic inspection.

It does not mean that AI diagnoses like a dermatologist. DEXI misclassified 10/114 images; 5/20 melanomas were misclassified as nevi. That is clinically consequential and should prevent any conclusion that AI can independently exclude melanoma.

Why did gaze maps differ from fixation maps?

Dermatologist-DEXI correlation was higher for gaze maps than fixation maps (about r = 0.53 vs r = 0.46, P < .001). The authors suggest that diagnostically important anchors may be seen briefly during early global visual processing, whereas later fixation patterns may reflect individual confirmation strategies and therefore vary more between readers.

Why was overlap higher in incorrectly diagnosed lesions?

Dermatologist-DEXI heat map correlation was higher for incorrectly diagnosed lesions than correctly diagnosed lesions (r = 0.568 vs 0.521). The likely interpretation is not "higher overlap equals greater accuracy," but rather that difficult lesions may trigger longer and broader visual search, creating more overlap with AI maps.

Limits: Plausible Heat Maps Are Not Causal Explanations

The most important methodological warning is that a heat map is not a recording of model reasoning. It can show regions associated with model output, but it does not prove that the model used the same dermoscopic logic as a human reader. Kremer et al note that different saliency methods can yield different maps, and prior work has shown that models may attend to image artifacts rather than lesion features.

Methodological limits

Small sample size within each lesion subtype.
Lesion size data were unavailable and may have influenced correlations.
Heat-map overlap does not prove shared clinical reasoning.
The study did not test whether AI improved clinician diagnostic accuracy, confidence, or patient outcomes.
DEXI is a specific commercial AI system and workflow; the results should not be generalized to every dermoscopy app.

Resident Takeaways for Journal Club

1. This is an explainability paper, not primarily an accuracy paper

The key question is not "Is DEXI accurate?" but "Does DEXI attend to areas that human dermatologists visually inspect?"

2. Explainability is a clinical adoption threshold

If a model only gives a score, clinicians cannot tell whether it noticed pigment network, asymmetry, border irregularity, color heterogeneity, or whether it was distracted by ruler marks, hair, illumination, or other artifacts.

3. Generative AI may help education, but dermatologists must audit it

Lipner's summary of text-to-image models is appropriately cautious: they may help rare disease imaging, data augmentation, and medical education, but bias and domain-specific accuracy remain unresolved.

Clinical Position: AI as a Second Set of Eyes

At present, AI is best used as a second set of eyes: to flag regions worth review, support teaching, and make model output more auditable. For patients, AI apps cannot replace history, clinical examination, dermoscopy, longitudinal follow-up, and biopsy when indicated. For clinicians, AI should sharpen rather than replace dermoscopy training.

References

Lipner SR. Highlights from JAAD Reviews: May 2026 - Artificial intelligence: Where are we now? J Am Acad Dermatol. 2026;94(5):1434-1435. doi:10.1016/j.jaad.2026.02.070.
Kremer N, Polo-Silveira L, Bajaj S, et al. Comparing dermatologists' and artificial intelligence heat maps in dermoscopic image analysis via eye tracking. J Am Acad Dermatol. 2026;94(5):1461-1468. doi:10.1016/j.jaad.2025.12.104.
Chanda T, Haggenmueller S, Bucher TC, et al. Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study. Nat Commun. 2025;16(1):4739. doi:10.1038/s41467-025-59532-5.
Hauser K, Kurz A, Haggenmueller S, et al. Explainable artificial intelligence in skin cancer recognition: a systematic review. Eur J Cancer. 2022;167:54-69. doi:10.1016/j.ejca.2022.02.025.
Brancaccio G, Balato A, Malvehy J, Puig S, Argenziano G, Kittler H. Artificial intelligence in skin cancer diagnosis: a reality check. J Invest Dermatol. 2024;144(3):492-499. doi:10.1016/j.jid.2023.10.004.
Giavina-Bianchi M, Vitor WG, Fornasiero de Paiva V, Okita AL, Sousa RM, Machado B. Explainability agreement between dermatologists and five visual explanations techniques in deep neural networks for melanoma AI classification. Front Med (Lausanne). 2023;10:1241484. doi:10.3389/fmed.2023.1241484.

AI 現在能幫皮膚科醫師看病嗎？ 2026 三篇 JAAD 研究的綜合解答