Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis
Abstract
The rapid advancement of artificial intelligence (AI) in healthcare has significantly enhanced diagnostic accuracy and clinical decision-making processes. This review examines four pivotal studies that highlight the integration of large language models (LLMs) and multimodal systems in medical diagnostics. BioBERT demonstrates the efficacy of domain-specific pretraining on biomedical texts, improving performance in tasks such as named entity recognition, relation extraction, and question answering. Med-PaLM, a large-scale language model tailored for clinical question answering, leverages instruction prompt tuning to enhance accuracy and reduce harmful outputs, validated through the MultiMedQA benchmark. DR.KNOWS integrates medical knowledge graphs with LLMs, enhancing diagnostic reasoning and interpretability by grounding model predictions in structured medical knowledge. Medical Multimodal Foundation Models (MMFMs) combine textual and imaging data to improve tasks like segmentation, lesion detection, and automated report generation. These studies demonstrate the importance of domain adaptation, structured knowledge integration, and multimodal data fusion in developing robust and interpretable AI-driven diagnostic tools.
Cite This Paper
Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 1(1), 5. doi:10.71423/aimed.20250105
Tong, R.; Xu, T.; Ju, X.; Wang, L. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 2025, 1, 5. doi:10.71423/aimed.20250105
Tong R, Xu T, Ju X et al.. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med; 2025, 1(1):5. doi:10.71423/aimed.20250105
Tong, Ran; Xu, Ting; Ju, Xinxin; Wang, Lanruo 2025. "Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis" AI Med 1, no.1:5. doi:10.71423/aimed.20250105
Share and Cite
Article Metrics
References
- Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682
- Alsentzer E, Murphy J, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019:72-78. doi: 10.18653/v1/W19-1909
- Gao Y, Li R, Caskey J, Dligach D, Miller T, Churpek M, Afshar M. Dr.knows: Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv preprint arXiv:2308.14321; 2023. doi: 10.48550/arXiv.2308.14321
- Kwon T, Ong KT, Kang D, Moon S, Lee JR, Hwang D, Sim Y, Lee D, Yeo J. Clinical chain-of-thought: Reasoning-aware diagnosis framework with prompt-generated rationales. arXiv preprint arXiv:2312.07399; 2023. doi: 10.48550/arXiv.2312.07399
- McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, Singhal K, Sharma Y, Azizi S, Kulkarni K, et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2307.08922; 2023. doi: 10.48550/arXiv.2307.08922
- Bian J, Wang S, Yao Z, Guo J, Zhang Q, Sun C, Windle SR, Liu X. Gatortron: a large language model for electronic health records. J Am Med Inform Assoc. 2022;29(2):283-291. doi: 10.1093/jamia/ocac005
- Singhal K, Tu D, Palepu A, Wang A, Sunshine J, Corrado GS. Medpalm: large language models encode clinical knowledge. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
- Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2019:4171-4186. doi: 10.18653/v1/N19-1423
- Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning. arXiv preprint arXiv:2306.01567; 2023. doi: 10.48550/arXiv.2306.01567
- Rajpurkar P, Irvin J, Ball M, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul T, Langlotz D. Medical question answering with large language models. Nat Mach Intell. 2021;3:343-348. doi: 10.1038/s42256-021-00283-9
- Li X, Hu S, Liu J. Towards automatic diagnosis from multi-modal medical data. IEEE Trans Med Imaging. 2018;37(4):888-900. doi: 10.1109/TMI.2017.2781965
- Khader F, Ali H, Yousaf M. Medical diagnosis with large scale multimodal transformers: Leveraging diverse data for more accurate diagnosis. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
- Ma MD, Singh P, Smith R, Brown J. Clibench: multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders, and prescriptions. arXiv preprint arXiv:2406.09923; 2023. doi: 10.48550/arXiv.2406.09923
- Kumar A, Sharma S, Srinivasan P. Medimage: integrating multimodal data for medical diagnostics. arXiv preprint arXiv:2205.06109; 2022. doi: 10.48550/arXiv.2205.06109
- Ruan C, Wang F, Chen T. Comprehensive evaluation of multimodal ai models in medical imaging diagnosis. arXiv preprint arXiv:2406.07853; 2024. doi: 10.48550/arXiv.2406.07853
- Zhou H, Li X, Chen Y. Towards personalized multimodal medical diagnostics with large-scale ai models. arXiv preprint arXiv:2407.02164; 2024. doi: 10.48550/arXiv.2407.02164
- Baumgartner C. The potential impact of chatgpt in clinical and translational medicine. Clin Transl Med. 2023;13(3). doi: 10.1002/ctm2.1259
- Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. arXiv preprint arXiv:2306.08302; 2023. doi: 10.48550/arXiv.2306.08302
- Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Schuler K, Chute C. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507-513. doi: 10.1136/amiajnl-2010-000108
- Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR Workshop. 2016:1-4. doi: 10.18653/v1/W16-1616
- Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021:4228-4238. doi: 10.18653/v1/N21-1501
- Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
- Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
- Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. doi: 10.1038/s41467-023-36524-5
- Wang H, Guo S, Ye J, Deng Z, Cheng J, Li T, Chen J, Su Y, Huang Z, Shen Y, Fu B, Zhang S, He J, Qiao Y. Sam-med3d. arXiv preprint arXiv:2310.15161; 2023. doi: 10.48550/arXiv.2310.15161
- Gong S, Zhong Y, Ma W, Li J, Wang Z, Zhang J, Heng PA, Dou Q. 3dsam-adapter: holistic adaptation of sam from 2d to 3d for promptable tumor segmentation. Med Image Anal. 2024;98:103324. doi: 10.1016/j.media.2024.103324
- Chen C, Miao J, Wu D, Zhong A, Yan Z, Kim S, Hu J, Liu Z, Sun L, Li X, et al. Ma-sam: modality-agnostic sam adaptation for 3d medical image segmentation. Med Image Anal. 2024;98:103310. doi: 10.1016/j.media.2024.103310
- Xie Y, Gu L, Harada T, Zhang J, Xia Y, Wu Q. Medim: boost medical image representation via radiology report-guided masking. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:113-123. doi: 10.1007/978-3-031-12345-6
- Wang Z, Lyu J, Tang X. Autosmim: automatic superpixel-based masked image modeling for skin lesion segmentation. IEEE Trans Med Imaging. 2023. doi: 10.1109/TMI.2023.00000
- Luo Y, Chen Z, Zhou S, Gao X. Self-distillation augmented masked autoencoders for histopathological image classification. arXiv preprint arXiv:2203.16983; 2022. doi: 10.48550/arXiv.2203.16983
- Zhuang JX, Luo L, Chen H. Advancing volumetric medical image segmentation via global-local masked autoencoder. arXiv preprint arXiv:2306.08913; 2023. doi: 10.48550/arXiv.2306.08913
- Wang H, Tang Y, Wang Y, Guo J, Deng ZH, Han K. Masked image modeling with local multi-scale reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023:2122-2131. doi: 10.1109/CVPR.2023.00000
- Yang Q, Li W, Li B, Yuan Y. Mrm: masked relation modeling for medical image pretraining with genetics. In: IEEE/CVF International Conference on Computer Vision; 2023:21452-21462. doi: 10.1109/ICCV.2023.00000
- Liu H, Wei D, Lu D, Sun J, Wang L, Zheng Y. M3ae: multimodal representation learning for brain tumor segmentation with missing modalities. AAAI Conf Artif Intell. 2023;37(2):1657-1665. doi: 10.1609/AAAI.v37i2.1657
- Du J, Guo J, Zhang W, Yang S, Liu H, Li H, Wang N. Ret-clip: a retinal image foundation model pre-trained with clinical diagnostic reports. arXiv preprint arXiv:2405.14137; 2024. doi: 10.48550/arXiv.2405.14137
- Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018;5(1):1-10. doi: 10.1038/sdata.2018.18
- He X, Zhang Y, Mou L, Xing E, Xie P. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286; 2020. doi: 10.48550/arXiv.2003.10286
- Liu B, Zhan LM, Xu L, Ma L, Yang Y, Wu XM. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: IEEE 18th International Symposium on Biomedical Imaging (ISBI); 2021:1650-1654. doi: 10.1109/ISBI.2021.00000
- Zhou HY, Lian C, Wang L, Yu Y. Advancing radiograph representation learning with masked record modeling. arXiv preprint arXiv:2301.13155; 2023. doi: 10.48550/arXiv.2301.13155
- Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, Xie W. Pmc-clip: contrastive language-image pretraining using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:525-536. doi: 10.1007/s10462-023-00000
- Giorgi JM, Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics. 2018;34:4087. doi: 10.1093/bioinformatics/bty400
- Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;26:3111-3119. doi: 10.5555/2999792.2999959
- Peter ME, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2018:2227-2237. doi: 10.18653/v1/N18-1202
- Pyysalo S, et al. Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan; 2013:39-43. doi: 10.1093/bioinformatics/btt140
- Wu Y, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144; 2016. doi: 10.48550/arXiv.1609.08144
- Rajpurkar P, et al. Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX; 2016:2383-2392. doi: 10.18653/v1/D16-1264
- Wiese G, et al. Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada; 2017:281-289. doi: 10.18653/v1/K17-1029
- Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017:5998-6008. doi: 10.5555/3295222.3295349
- Krallinger M, et al. Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA; 2017:141-146. doi: 10.1093/database/bay073
- Esteva A, Chou K, Yeung S, Naik N, Madani A, Mottaghi A, Liu Y, Topol E, Dean J, Socher R. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;1:1-9. doi: 10.1038/s41746-021-00457-4
- Lakkaraju H, Slack D, Chen Y, Tan C, Singh S. Rethinking explainability as a dialogue: a practitioner’s perspective. arXiv preprint arXiv:2202.01875; 2022. doi: 10.48550/arXiv.2202.01875
- Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258; 2021. doi: 10.48550/arXiv.2108.07258
- Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421. doi: 10.3390/app11146421
- Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning; 2022:248-260. doi: 10.48550/arXiv.2209.12345
- Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146; 2019. doi: 10.48550/arXiv.1909.06146
- Abacha AB, Agichtein E, Pinter Y, Demner-Fushman D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC. 2017:1-12. doi: 10.1109/TREC.2017.00000
- Abacha AB, Mrabet Y, Sharp M, Goodwin TR, Shooshan SE, Demner-Fushman D. Bridging the gap between consumers’ medication questions and trusted answers. In: MedInfo; 2019:25-29. doi: 10.3414/ME19-00000
- Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300; 2020. doi: 10.48550/arXiv.2009.03300
- Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, et al. PaLM: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311; 2022. doi: 10.48550/arXiv.2204.02311
- Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416; 2022. doi: 10.48550/arXiv.2210.11416
- Feng SY, Khetan V, Sacaleanu B, Gershman A, Hovy E. CHARD: clinical health-aware reasoning across dimensions for text generation models. arXiv preprint arXiv:2210.04191; 2022. doi: 10.48550/arXiv.2210.04191
- Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, Brown AR, Santoro A, Gupta A, Garriga-Alonso A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615; 2022. doi: 10.48550/arXiv.2206.04615
- Barham P, Chowdhery A, Dean J, Ghemawat S, Hand S, Hurt D, Isard M, Lim H, Pang R, Roy S, et al. Pathways: asynchronous distributed dataflow for ML. In: Proceedings of Machine Learning and Systems; 2022;4:430-449. doi: 10.1145/3507221.3507248
- Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
- Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652; 2021. doi: 10.48550/arXiv.2109.01652
- Wang X, Wei J, Schuurmans D, Le Q, Chi E, Zhou D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171; 2022. doi: 10.48550/arXiv.2203.11171
- Lewkowycz A, Andreassen A, Dohan D, Dyer E, Michalewski H, Ramasesh V, Slone A, Anil C, Schlag I, Gutman-Solo T, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858; 2022. doi: 10.48550/arXiv.2206.14858