Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis

Ran Tong; Ting Xu; Xinxin Ju; Lanruo Wang

doi:10.71423/aimed.20250105

Submit Article to AI Med

Article Menu

Article Overview

Abstract
Share and Cite
Article Metrics

Article Versions

Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis

by Ran Tong ^1,* , Ting Xu ² , Xinxin Ju ¹ and Lanruo Wang ³

Mathematics and Statistics Department University of Texas at Dallas Richardson, TX, USA

Department of Computer Science University of Massachusetts Boston Boston, MA, USA

Naveen Jindal School of Management University of Texas at Dallas Richardson, TX, USA

Author to whom correspondence should be addressed.

AI Med 2025 1(1):5; https://doi.org/10.71423/aimed.20250105

Received: 27 December 2024 / Accepted: 5 February 2025 / Published Online: 11 February 2025

View Full-Text

Download PDF

Abstract

The rapid advancement of artificial intelligence (AI) in healthcare has significantly enhanced diagnostic accuracy and clinical decision-making processes. This review examines four pivotal studies that highlight the integration of large language models (LLMs) and multimodal systems in medical diagnostics. BioBERT demonstrates the efficacy of domain-specific pretraining on biomedical texts, improving performance in tasks such as named entity recognition, relation extraction, and question answering. Med-PaLM, a large-scale language model tailored for clinical question answering, leverages instruction prompt tuning to enhance accuracy and reduce harmful outputs, validated through the MultiMedQA benchmark. DR.KNOWS integrates medical knowledge graphs with LLMs, enhancing diagnostic reasoning and interpretability by grounding model predictions in structured medical knowledge. Medical Multimodal Foundation Models (MMFMs) combine textual and imaging data to improve tasks like segmentation, lesion detection, and automated report generation. These studies demonstrate the importance of domain adaptation, structured knowledge integration, and multimodal data fusion in developing robust and interpretable AI-driven diagnostic tools.

Keywords: Large Language Models (LLMs) ; Medical Diagnosis; Multimodal Models; Clinical Reasoning; Personalized Medicine; Diagnostic AI; Healthcare AI Systems;

Copyright: © 2025 by Tong, Xu, Ju and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (Creative Commons Attribution 4.0 International License). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

APA Style

Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis.  AI Med, 1(1), 5. doi:10.71423/aimed.20250105

ACS Style

Tong, R.; Xu, T.; Ju, X.; Wang, L. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 2025, 1, 5. doi:10.71423/aimed.20250105

AMA Style

Tong R, Xu T, Ju X et al.. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med; 2025, 1(1):5. doi:10.71423/aimed.20250105

Chicago/Turabian Style

Tong, Ran; Xu, Ting; Ju, Xinxin; Wang, Lanruo 2025. "Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis" AI Med 1, no.1:5. doi:10.71423/aimed.20250105

APA Style

Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 1(1), 5. doi:10.71423/aimed.20250105

ACS Style

Tong, R.; Xu, T.; Ju, X.; Wang, L. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 2025, 1, 5. doi:10.71423/aimed.20250105

AMA Style

Tong R, Xu T, Ju X et al.. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med; 2025, 1(1):5. doi:10.71423/aimed.20250105

Chicago/Turabian Style

Tong, Ran; Xu, Ting; Ju, Xinxin; Wang, Lanruo 2025. "Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis" AI Med 1, no.1:5. doi:10.71423/aimed.20250105

Article Metrics

Article Access Statistics

References

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682
Alsentzer E, Murphy J, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019:72-78. doi: 10.18653/v1/W19-1909
Gao Y, Li R, Caskey J, Dligach D, Miller T, Churpek M, Afshar M. Dr.knows: Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv preprint arXiv:2308.14321; 2023. doi: 10.48550/arXiv.2308.14321
Kwon T, Ong KT, Kang D, Moon S, Lee JR, Hwang D, Sim Y, Lee D, Yeo J. Clinical chain-of-thought: Reasoning-aware diagnosis framework with prompt-generated rationales. arXiv preprint arXiv:2312.07399; 2023. doi: 10.48550/arXiv.2312.07399
McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, Singhal K, Sharma Y, Azizi S, Kulkarni K, et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2307.08922; 2023. doi: 10.48550/arXiv.2307.08922
Bian J, Wang S, Yao Z, Guo J, Zhang Q, Sun C, Windle SR, Liu X. Gatortron: a large language model for electronic health records. J Am Med Inform Assoc. 2022;29(2):283-291. doi: 10.1093/jamia/ocac005
Singhal K, Tu D, Palepu A, Wang A, Sunshine J, Corrado GS. Medpalm: large language models encode clinical knowledge. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2019:4171-4186. doi: 10.18653/v1/N19-1423
Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning. arXiv preprint arXiv:2306.01567; 2023. doi: 10.48550/arXiv.2306.01567
Rajpurkar P, Irvin J, Ball M, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul T, Langlotz D. Medical question answering with large language models. Nat Mach Intell. 2021;3:343-348. doi: 10.1038/s42256-021-00283-9
Li X, Hu S, Liu J. Towards automatic diagnosis from multi-modal medical data. IEEE Trans Med Imaging. 2018;37(4):888-900. doi: 10.1109/TMI.2017.2781965
Khader F, Ali H, Yousaf M. Medical diagnosis with large scale multimodal transformers: Leveraging diverse data for more accurate diagnosis. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
Ma MD, Singh P, Smith R, Brown J. Clibench: multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders, and prescriptions. arXiv preprint arXiv:2406.09923; 2023. doi: 10.48550/arXiv.2406.09923
Kumar A, Sharma S, Srinivasan P. Medimage: integrating multimodal data for medical diagnostics. arXiv preprint arXiv:2205.06109; 2022. doi: 10.48550/arXiv.2205.06109
Ruan C, Wang F, Chen T. Comprehensive evaluation of multimodal ai models in medical imaging diagnosis. arXiv preprint arXiv:2406.07853; 2024. doi: 10.48550/arXiv.2406.07853
Zhou H, Li X, Chen Y. Towards personalized multimodal medical diagnostics with large-scale ai models. arXiv preprint arXiv:2407.02164; 2024. doi: 10.48550/arXiv.2407.02164
Baumgartner C. The potential impact of chatgpt in clinical and translational medicine. Clin Transl Med. 2023;13(3). doi: 10.1002/ctm2.1259
Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. arXiv preprint arXiv:2306.08302; 2023. doi: 10.48550/arXiv.2306.08302
Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Schuler K, Chute C. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507-513. doi: 10.1136/amiajnl-2010-000108
Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR Workshop. 2016:1-4. doi: 10.18653/v1/W16-1616
Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021:4228-4238. doi: 10.18653/v1/N21-1501
Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. doi: 10.1038/s41467-023-36524-5
Wang H, Guo S, Ye J, Deng Z, Cheng J, Li T, Chen J, Su Y, Huang Z, Shen Y, Fu B, Zhang S, He J, Qiao Y. Sam-med3d. arXiv preprint arXiv:2310.15161; 2023. doi: 10.48550/arXiv.2310.15161
Gong S, Zhong Y, Ma W, Li J, Wang Z, Zhang J, Heng PA, Dou Q. 3dsam-adapter: holistic adaptation of sam from 2d to 3d for promptable tumor segmentation. Med Image Anal. 2024;98:103324. doi: 10.1016/j.media.2024.103324
Chen C, Miao J, Wu D, Zhong A, Yan Z, Kim S, Hu J, Liu Z, Sun L, Li X, et al. Ma-sam: modality-agnostic sam adaptation for 3d medical image segmentation. Med Image Anal. 2024;98:103310. doi: 10.1016/j.media.2024.103310
Xie Y, Gu L, Harada T, Zhang J, Xia Y, Wu Q. Medim: boost medical image representation via radiology report-guided masking. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:113-123. doi: 10.1007/978-3-031-12345-6
Wang Z, Lyu J, Tang X. Autosmim: automatic superpixel-based masked image modeling for skin lesion segmentation. IEEE Trans Med Imaging. 2023. doi: 10.1109/TMI.2023.00000
Luo Y, Chen Z, Zhou S, Gao X. Self-distillation augmented masked autoencoders for histopathological image classification. arXiv preprint arXiv:2203.16983; 2022. doi: 10.48550/arXiv.2203.16983
Zhuang JX, Luo L, Chen H. Advancing volumetric medical image segmentation via global-local masked autoencoder. arXiv preprint arXiv:2306.08913; 2023. doi: 10.48550/arXiv.2306.08913
Wang H, Tang Y, Wang Y, Guo J, Deng ZH, Han K. Masked image modeling with local multi-scale reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023:2122-2131. doi: 10.1109/CVPR.2023.00000
Yang Q, Li W, Li B, Yuan Y. Mrm: masked relation modeling for medical image pretraining with genetics. In: IEEE/CVF International Conference on Computer Vision; 2023:21452-21462. doi: 10.1109/ICCV.2023.00000
Liu H, Wei D, Lu D, Sun J, Wang L, Zheng Y. M3ae: multimodal representation learning for brain tumor segmentation with missing modalities. AAAI Conf Artif Intell. 2023;37(2):1657-1665. doi: 10.1609/AAAI.v37i2.1657
Du J, Guo J, Zhang W, Yang S, Liu H, Li H, Wang N. Ret-clip: a retinal image foundation model pre-trained with clinical diagnostic reports. arXiv preprint arXiv:2405.14137; 2024. doi: 10.48550/arXiv.2405.14137
Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018;5(1):1-10. doi: 10.1038/sdata.2018.18
He X, Zhang Y, Mou L, Xing E, Xie P. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286; 2020. doi: 10.48550/arXiv.2003.10286
Liu B, Zhan LM, Xu L, Ma L, Yang Y, Wu XM. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: IEEE 18th International Symposium on Biomedical Imaging (ISBI); 2021:1650-1654. doi: 10.1109/ISBI.2021.00000
Zhou HY, Lian C, Wang L, Yu Y. Advancing radiograph representation learning with masked record modeling. arXiv preprint arXiv:2301.13155; 2023. doi: 10.48550/arXiv.2301.13155
Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, Xie W. Pmc-clip: contrastive language-image pretraining using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:525-536. doi: 10.1007/s10462-023-00000
Giorgi JM, Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics. 2018;34:4087. doi: 10.1093/bioinformatics/bty400
Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;26:3111-3119. doi: 10.5555/2999792.2999959
Peter ME, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2018:2227-2237. doi: 10.18653/v1/N18-1202
Pyysalo S, et al. Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan; 2013:39-43. doi: 10.1093/bioinformatics/btt140
Wu Y, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144; 2016. doi: 10.48550/arXiv.1609.08144
Rajpurkar P, et al. Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX; 2016:2383-2392. doi: 10.18653/v1/D16-1264
Wiese G, et al. Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada; 2017:281-289. doi: 10.18653/v1/K17-1029
Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017:5998-6008. doi: 10.5555/3295222.3295349
Krallinger M, et al. Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA; 2017:141-146. doi: 10.1093/database/bay073
Esteva A, Chou K, Yeung S, Naik N, Madani A, Mottaghi A, Liu Y, Topol E, Dean J, Socher R. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;1:1-9. doi: 10.1038/s41746-021-00457-4
Lakkaraju H, Slack D, Chen Y, Tan C, Singh S. Rethinking explainability as a dialogue: a practitioner’s perspective. arXiv preprint arXiv:2202.01875; 2022. doi: 10.48550/arXiv.2202.01875
Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258; 2021. doi: 10.48550/arXiv.2108.07258
Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421. doi: 10.3390/app11146421
Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning; 2022:248-260. doi: 10.48550/arXiv.2209.12345
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146; 2019. doi: 10.48550/arXiv.1909.06146
Abacha AB, Agichtein E, Pinter Y, Demner-Fushman D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC. 2017:1-12. doi: 10.1109/TREC.2017.00000
Abacha AB, Mrabet Y, Sharp M, Goodwin TR, Shooshan SE, Demner-Fushman D. Bridging the gap between consumers’ medication questions and trusted answers. In: MedInfo; 2019:25-29. doi: 10.3414/ME19-00000
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300; 2020. doi: 10.48550/arXiv.2009.03300
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, et al. PaLM: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311; 2022. doi: 10.48550/arXiv.2204.02311
Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416; 2022. doi: 10.48550/arXiv.2210.11416
Feng SY, Khetan V, Sacaleanu B, Gershman A, Hovy E. CHARD: clinical health-aware reasoning across dimensions for text generation models. arXiv preprint arXiv:2210.04191; 2022. doi: 10.48550/arXiv.2210.04191
Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, Brown AR, Santoro A, Gupta A, Garriga-Alonso A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615; 2022. doi: 10.48550/arXiv.2206.04615
Barham P, Chowdhery A, Dean J, Ghemawat S, Hand S, Hurt D, Isard M, Lim H, Pang R, Roy S, et al. Pathways: asynchronous distributed dataflow for ML. In: Proceedings of Machine Learning and Systems; 2022;4:430-449. doi: 10.1145/3507221.3507248
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652; 2021. doi: 10.48550/arXiv.2109.01652
Wang X, Wei J, Schuurmans D, Le Q, Chi E, Zhou D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171; 2022. doi: 10.48550/arXiv.2203.11171
Lewkowycz A, Andreassen A, Dohan D, Dyer E, Michalewski H, Ramasesh V, Slone A, Anil C, Schlag I, Gutman-Solo T, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858; 2022. doi: 10.48550/arXiv.2206.14858

Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis

Abstract

Cite This Paper

Share and Cite

Article Metrics

References