Open Access Review

Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis

by Ran Tong 1,* Ting Xu 2 Xinxin Ju 1  and  Lanruo Wang 3
1
Mathematics and Statistics Department University of Texas at Dallas Richardson, TX, USA
2
Department of Computer Science University of Massachusetts Boston Boston, MA, USA
3
Naveen Jindal School of Management University of Texas at Dallas Richardson, TX, USA
*
Author to whom correspondence should be addressed.
Received: 27 December 2024 / Accepted: 5 February 2025 / Published Online: 11 February 2025

Abstract

The rapid advancement of artificial intelligence (AI) in healthcare has significantly enhanced diagnostic accuracy and clinical decision-making processes. This review examines four pivotal studies that highlight the integration of large language models (LLMs) and multimodal systems in medical diagnostics. BioBERT demonstrates the efficacy of domain-specific pretraining on biomedical texts, improving performance in tasks such as named entity recognition, relation extraction, and question answering. Med-PaLM, a large-scale language model tailored for clinical question answering, leverages instruction prompt tuning to enhance accuracy and reduce harmful outputs, validated through the MultiMedQA benchmark. DR.KNOWS integrates medical knowledge graphs with LLMs, enhancing diagnostic reasoning and interpretability by grounding model predictions in structured medical knowledge. Medical Multimodal Foundation Models (MMFMs) combine textual and imaging data to improve tasks like segmentation, lesion detection, and automated report generation. These studies demonstrate the importance of domain adaptation, structured knowledge integration, and multimodal data fusion in developing robust and interpretable AI-driven diagnostic tools.


Copyright: © 2025 by Tong, Xu, Ju and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) (Creative Commons Attribution 4.0 International License). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Cite This Paper
APA Style
Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 1(1), 5. doi:10.71423/aimed.20250105
ACS Style
Tong, R.; Xu, T.; Ju, X.; Wang, L. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 2025, 1, 5. doi:10.71423/aimed.20250105
AMA Style
Tong R, Xu T, Ju X et al.. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med; 2025, 1(1):5. doi:10.71423/aimed.20250105
Chicago/Turabian Style
Tong, Ran; Xu, Ting; Ju, Xinxin; Wang, Lanruo 2025. "Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis" AI Med 1, no.1:5. doi:10.71423/aimed.20250105

Share and Cite

APA Style
Tong, R., Xu, T., Ju, X., & Wang, L. (2025). Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 1(1), 5. doi:10.71423/aimed.20250105
ACS Style
Tong, R.; Xu, T.; Ju, X.; Wang, L. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med, 2025, 1, 5. doi:10.71423/aimed.20250105
AMA Style
Tong R, Xu T, Ju X et al.. Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis. AI Med; 2025, 1(1):5. doi:10.71423/aimed.20250105
Chicago/Turabian Style
Tong, Ran; Xu, Ting; Ju, Xinxin; Wang, Lanruo 2025. "Progress in Medical AI: Reviewing Large Language Models and Multimodal Systems for Diagonosis" AI Med 1, no.1:5. doi:10.71423/aimed.20250105

Article Metrics

Article Access Statistics

References

  1. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682
  2. Alsentzer E, Murphy J, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019:72-78. doi: 10.18653/v1/W19-1909
  3. Gao Y, Li R, Caskey J, Dligach D, Miller T, Churpek M, Afshar M. Dr.knows: Leveraging a medical knowledge graph into large language models for diagnosis prediction. arXiv preprint arXiv:2308.14321; 2023. doi: 10.48550/arXiv.2308.14321
  4. Kwon T, Ong KT, Kang D, Moon S, Lee JR, Hwang D, Sim Y, Lee D, Yeo J. Clinical chain-of-thought: Reasoning-aware diagnosis framework with prompt-generated rationales. arXiv preprint arXiv:2312.07399; 2023. doi: 10.48550/arXiv.2312.07399
  5. McDuff D, Schaekermann M, Tu T, Palepu A, Wang A, Garrison J, Singhal K, Sharma Y, Azizi S, Kulkarni K, et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2307.08922; 2023. doi: 10.48550/arXiv.2307.08922
  6. Bian J, Wang S, Yao Z, Guo J, Zhang Q, Sun C, Windle SR, Liu X. Gatortron: a large language model for electronic health records. J Am Med Inform Assoc. 2022;29(2):283-291. doi: 10.1093/jamia/ocac005
  7. Singhal K, Tu D, Palepu A, Wang A, Sunshine J, Corrado GS. Medpalm: large language models encode clinical knowledge. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
  8. Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
  9. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2019:4171-4186. doi: 10.18653/v1/N19-1423
  10. Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning. arXiv preprint arXiv:2306.01567; 2023. doi: 10.48550/arXiv.2306.01567
  11. Rajpurkar P, Irvin J, Ball M, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul T, Langlotz D. Medical question answering with large language models. Nat Mach Intell. 2021;3:343-348. doi: 10.1038/s42256-021-00283-9
  12. Li X, Hu S, Liu J. Towards automatic diagnosis from multi-modal medical data. IEEE Trans Med Imaging. 2018;37(4):888-900. doi: 10.1109/TMI.2017.2781965
  13. Khader F, Ali H, Yousaf M. Medical diagnosis with large scale multimodal transformers: Leveraging diverse data for more accurate diagnosis. arXiv preprint arXiv:2212.09162; 2022. doi: 10.48550/arXiv.2212.09162
  14. Ma MD, Singh P, Smith R, Brown J. Clibench: multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders, and prescriptions. arXiv preprint arXiv:2406.09923; 2023. doi: 10.48550/arXiv.2406.09923
  15. Kumar A, Sharma S, Srinivasan P. Medimage: integrating multimodal data for medical diagnostics. arXiv preprint arXiv:2205.06109; 2022. doi: 10.48550/arXiv.2205.06109
  16. Ruan C, Wang F, Chen T. Comprehensive evaluation of multimodal ai models in medical imaging diagnosis. arXiv preprint arXiv:2406.07853; 2024. doi: 10.48550/arXiv.2406.07853
  17. Zhou H, Li X, Chen Y. Towards personalized multimodal medical diagnostics with large-scale ai models. arXiv preprint arXiv:2407.02164; 2024. doi: 10.48550/arXiv.2407.02164
  18. Baumgartner C. The potential impact of chatgpt in clinical and translational medicine. Clin Transl Med. 2023;13(3). doi: 10.1002/ctm2.1259
  19. Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. arXiv preprint arXiv:2306.08302; 2023. doi: 10.48550/arXiv.2306.08302
  20. Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Schuler K, Chute C. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507-513. doi: 10.1136/amiajnl-2010-000108
  21. Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR Workshop. 2016:1-4. doi: 10.18653/v1/W16-1616
  22. Liu F, Shareghi E, Meng Z, Basaldella M, Collier N. Self-alignment pretraining for biomedical entity representations. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021:4228-4238. doi: 10.18653/v1/N21-1501
  23. Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
  24. Sun K, Xue S, Sun F, Sun H, Luo Y, Wang L, Wang S, Guo N, Liu L, Zhao T, Wang X, Yang L, Jin S, Yan J, Dong J. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: applications, challenges, and future directions. arXiv preprint arXiv:2412.02621; 2024. doi: 10.48550/arXiv.2412.02621
  25. Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. doi: 10.1038/s41467-023-36524-5
  26. Wang H, Guo S, Ye J, Deng Z, Cheng J, Li T, Chen J, Su Y, Huang Z, Shen Y, Fu B, Zhang S, He J, Qiao Y. Sam-med3d. arXiv preprint arXiv:2310.15161; 2023. doi: 10.48550/arXiv.2310.15161
  27. Gong S, Zhong Y, Ma W, Li J, Wang Z, Zhang J, Heng PA, Dou Q. 3dsam-adapter: holistic adaptation of sam from 2d to 3d for promptable tumor segmentation. Med Image Anal. 2024;98:103324. doi: 10.1016/j.media.2024.103324
  28. Chen C, Miao J, Wu D, Zhong A, Yan Z, Kim S, Hu J, Liu Z, Sun L, Li X, et al. Ma-sam: modality-agnostic sam adaptation for 3d medical image segmentation. Med Image Anal. 2024;98:103310. doi: 10.1016/j.media.2024.103310
  29. Xie Y, Gu L, Harada T, Zhang J, Xia Y, Wu Q. Medim: boost medical image representation via radiology report-guided masking. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:113-123. doi: 10.1007/978-3-031-12345-6
  30. Wang Z, Lyu J, Tang X. Autosmim: automatic superpixel-based masked image modeling for skin lesion segmentation. IEEE Trans Med Imaging. 2023. doi: 10.1109/TMI.2023.00000
  31. Luo Y, Chen Z, Zhou S, Gao X. Self-distillation augmented masked autoencoders for histopathological image classification. arXiv preprint arXiv:2203.16983; 2022. doi: 10.48550/arXiv.2203.16983
  32. Zhuang JX, Luo L, Chen H. Advancing volumetric medical image segmentation via global-local masked autoencoder. arXiv preprint arXiv:2306.08913; 2023. doi: 10.48550/arXiv.2306.08913
  33. Wang H, Tang Y, Wang Y, Guo J, Deng ZH, Han K. Masked image modeling with local multi-scale reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023:2122-2131. doi: 10.1109/CVPR.2023.00000
  34. Yang Q, Li W, Li B, Yuan Y. Mrm: masked relation modeling for medical image pretraining with genetics. In: IEEE/CVF International Conference on Computer Vision; 2023:21452-21462. doi: 10.1109/ICCV.2023.00000
  35. Liu H, Wei D, Lu D, Sun J, Wang L, Zheng Y. M3ae: multimodal representation learning for brain tumor segmentation with missing modalities. AAAI Conf Artif Intell. 2023;37(2):1657-1665. doi: 10.1609/AAAI.v37i2.1657
  36. Du J, Guo J, Zhang W, Yang S, Liu H, Li H, Wang N. Ret-clip: a retinal image foundation model pre-trained with clinical diagnostic reports. arXiv preprint arXiv:2405.14137; 2024. doi: 10.48550/arXiv.2405.14137
  37. Lau JJ, Gayen S, Ben Abacha A, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018;5(1):1-10. doi: 10.1038/sdata.2018.18
  38. He X, Zhang Y, Mou L, Xing E, Xie P. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286; 2020. doi: 10.48550/arXiv.2003.10286
  39. Liu B, Zhan LM, Xu L, Ma L, Yang Y, Wu XM. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: IEEE 18th International Symposium on Biomedical Imaging (ISBI); 2021:1650-1654. doi: 10.1109/ISBI.2021.00000
  40. Zhou HY, Lian C, Wang L, Yu Y. Advancing radiograph representation learning with masked record modeling. arXiv preprint arXiv:2301.13155; 2023. doi: 10.48550/arXiv.2301.13155
  41. Lin W, Zhao Z, Zhang X, Wu C, Zhang Y, Wang Y, Xie W. Pmc-clip: contrastive language-image pretraining using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2023:525-536. doi: 10.1007/s10462-023-00000
  42. Giorgi JM, Bader GD. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics. 2018;34:4087. doi: 10.1093/bioinformatics/bty400
  43. Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013;26:3111-3119. doi: 10.5555/2999792.2999959
  44. Peter ME, et al. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1; 2018:2227-2237. doi: 10.18653/v1/N18-1202
  45. Pyysalo S, et al. Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan; 2013:39-43. doi: 10.1093/bioinformatics/btt140
  46. Wu Y, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144; 2016. doi: 10.48550/arXiv.1609.08144
  47. Rajpurkar P, et al. Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX; 2016:2383-2392. doi: 10.18653/v1/D16-1264
  48. Wiese G, et al. Neural domain adaptation for biomedical question answering. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada; 2017:281-289. doi: 10.18653/v1/K17-1029
  49. Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017:5998-6008. doi: 10.5555/3295222.3295349
  50. Krallinger M, et al. Overview of the BioCreative VI chemical-protein interaction track. In: Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA; 2017:141-146. doi: 10.1093/database/bay073
  51. Esteva A, Chou K, Yeung S, Naik N, Madani A, Mottaghi A, Liu Y, Topol E, Dean J, Socher R. Deep learning-enabled medical computer vision. NPJ Digit Med. 2021;1:1-9. doi: 10.1038/s41746-021-00457-4
  52. Lakkaraju H, Slack D, Chen Y, Tan C, Singh S. Rethinking explainability as a dialogue: a practitioner’s perspective. arXiv preprint arXiv:2202.01875; 2022. doi: 10.48550/arXiv.2202.01875
  53. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258; 2021. doi: 10.48550/arXiv.2108.07258
  54. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11:6421. doi: 10.3390/app11146421
  55. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning; 2022:248-260. doi: 10.48550/arXiv.2209.12345
  56. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146; 2019. doi: 10.48550/arXiv.1909.06146
  57. Abacha AB, Agichtein E, Pinter Y, Demner-Fushman D. Overview of the medical question answering task at TREC 2017 LiveQA. TREC. 2017:1-12. doi: 10.1109/TREC.2017.00000
  58. Abacha AB, Mrabet Y, Sharp M, Goodwin TR, Shooshan SE, Demner-Fushman D. Bridging the gap between consumers’ medication questions and trusted answers. In: MedInfo; 2019:25-29. doi: 10.3414/ME19-00000
  59. Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, Steinhardt J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300; 2020. doi: 10.48550/arXiv.2009.03300
  60. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, et al. PaLM: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311; 2022. doi: 10.48550/arXiv.2204.02311
  61. Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416; 2022. doi: 10.48550/arXiv.2210.11416
  62. Feng SY, Khetan V, Sacaleanu B, Gershman A, Hovy E. CHARD: clinical health-aware reasoning across dimensions for text generation models. arXiv preprint arXiv:2210.04191; 2022. doi: 10.48550/arXiv.2210.04191
  63. Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, Brown AR, Santoro A, Gupta A, Garriga-Alonso A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615; 2022. doi: 10.48550/arXiv.2206.04615
  64. Barham P, Chowdhery A, Dean J, Ghemawat S, Hand S, Hurt D, Isard M, Lim H, Pang R, Roy S, et al. Pathways: asynchronous distributed dataflow for ML. In: Proceedings of Machine Learning and Systems; 2022;4:430-449. doi: 10.1145/3507221.3507248
  65. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877-1901. doi: 10.5555/3454287.3454612
  66. Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652; 2021. doi: 10.48550/arXiv.2109.01652
  67. Wang X, Wei J, Schuurmans D, Le Q, Chi E, Zhou D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171; 2022. doi: 10.48550/arXiv.2203.11171
  68. Lewkowycz A, Andreassen A, Dohan D, Dyer E, Michalewski H, Ramasesh V, Slone A, Anil C, Schlag I, Gutman-Solo T, et al. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858; 2022. doi: 10.48550/arXiv.2206.14858