Chest X-ray Image Captioning Using Vision Transformer and Biomedical Language Models with GRU and Optuna Tuning
##plugins.themes.bootstrap3.article.main##
摘要
Chest X-ray (CXR) interpretation is time-intensive and contributes to radiologist
workload and potential diagnostic delays. We propose a multimodal deep learning framework integrating a Vision Transformer (ViT) for global visual feature extraction, a biomedical pre-trained language model (ClinicalBERT) for domain-specific semantic encoding, and a Gated Recurrent Unit (GRU) decoder for sequential report generation. Images from the Indiana University CXR dataset were converted from DICOM to PNG and enhanced with contrast-limited adaptive histogram equalization (CLAHE); reports were cleaned, tokenized, and augmented. Hyperparameters—GRU size, learning rate, and batch size—were optimized using Optuna. On the test set, the ViT + ClinicalBERT + GRU configuration achieved BLEU-4 = 0.278, METEOR = 0.221, ROUGE-L = 0.434, CIDEr = 0.846, and SPICE = 0.530, outperforming CNN–RNN baselines and remaining competitive with transformerbased approaches while being computationally efficient.
##plugins.themes.bootstrap3.article.details##
参考
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning (ICML); 2015.p. 2048-2057.
Jing B, Xie P, Xing E. On the automatic generation of medical imaging reports. In: Proceedings of the Association for Computational Linguistics (ACL); 2018.p. 2577-2586.
Chen M, Li C, Cheng J, Li J, Liu Y, Wang Y. R2Gen: a transformer-based approach for medical report generation. Med Image Anal. 2022;73:102161.
Dosovitskiy A, et al. An image is worth 16×16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR); 2021.
Alsentzer E, Murphy J, Boag W, Weng W, Jin D, Naumann T, McDermott M, Szolovits B. Publicly available clinical BERT embeddings. arXiv [Preprint]. 2019 Apr 6 [cited 2025 Aug 13];
arXiv:1904.03323. Available from: https://arxiv.org/abs/1904.03323
Boecking E, Vu TAT, Moens SEDR, et al. BioViL: vision-language pretraining for biomedical tasks. arXiv [Preprint]. 2024 Sep 3 [cited 2025 Aug 13]; arXiv:2209.01309. Available from:
https://arxiv.org/abs/2209.01309
Johnson A, et al. CheXzero: clinical radiology report generation and zero-shot classification. NPJ Digit Med. 2023;6:70.
Liu J, Cao X, Ma Y, Ding S, Wu X. Swin transformer for medical image captioning. Med Image Anal. 2024;92:103567.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2019. p. 2623-2631.
Demner-Fushman S, Chapman MD, Mc Donald AR. Automatic categorization of medical images for information retrieval. In: Proceedings of the AMIA Annual Symposium; 2006. p. 71-75.
U.S. National Library of Medicine. Open-i: open access biomedical image search engine [Internet]. Bethesda (MD): National Library of Medicine (US); [cited 2025 Aug 13]. Available from: https://openi.nlm.nih.gov/
Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judg ments. In: Proceedings of the ACL Work shop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and Summarization; 2005. p. 65-72.
Lu M, Chen H, Chen Q, Wang Y. Data augmentation for medical text classification using NLG and NLP techniques. BMC Med Inform Decis Mak. 2021;21(1):1-13.
van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579-2605.
Zuiderveld K. Contrast limited adaptive histogram equalization. In: Heckbert PS, editor. Graphics Gems IV. San Diego (CA): Academic Press Professional; 1994. p. 474-485.
Hutter F, Kotthoff L, Vanschoren J. Automated machine learning: methods, systems, challenges. Cham (Switzerland): Springer; 2019.
Li Y, Zhang J, Huang J, Hu X. Knowledge-driven encode, retrieve, paraphrase for medical image report generation. Med Image Anal. 2020;65:101797.