Development of a Semantic-based Image Retrieval Model Using Contrastive Pre-trained between Image and Text

Authors

  • Chakkarin Santirattanaphakdi Institute of Digital Arts and Science, Suranaree University of Technology, Department of Digital Business Technology, Faculty of Business Administrator, Vongchavalitkul University
  • Suphakit Niwattanakul Institute of Digital Arts and Science, Suranaree University of Technology

Keywords:

Image retrieval, Pre-trained, Contrastive learning, Deep learning

Abstract

The objective of this research is to develop and evaluate the effectiveness of a semantic image retrieval model using contrastive pre-trained. The approach involves 3 modules: 1) the image description set generation module, which trains to encode both image content and textual content in various embedding before estimating the output probability distribution using the softmax function. It then calculates the loss to compare the meaning evaluation of images by experts with the predicted likelihoods of labels from the model. This step involves adjusting parameters for image meaning learning using the concept of retroactive distribution learning and self-paced learning, transferring the learned knowledge to label data for images based on the high-level abstract concept of images obtained from meaning-based similarity learning. This is followed by creating feature vectors for image characteristics, 2) the natural language processing module, which encodes user's natural language queries to generate feature vectors for query characteristics.  And 3) the feature matching module, which matches image feature vectors and query feature vectors based on vector similarity values. Then, it ranks the results according to relevance and presents the image retrieval results to the user. The evaluation of semantic image retrieval performance reveals that: The mean reciprocal rank (MRR) values for the top k retrievals on the Flickr30k dataset and the self-collected dataset are 0.628 and 0.617, respectively, at  k = 5, and the precision at k(Pgif.latex?@k) values for  k = 1, 3, and 5 on the Flickr30k dataset are 0.585, 0.664, and 0.761, respectively, when compared to the self-collected dataset. While the precision at k values slightly decrease, the results show a consistent trend. The outcomes of this research will aid in addressing the semantic gap problem and support users in their natural language queries, which are linked to the image semantics rather than following the grammatical rules of the language.

References

Broz M. Number of Photos (2023): Statistics, Facts, & Predictions. [Internet]. 2023 [cited 2023 May 28]. Available from: https://photutorial.com/photos-statistics/

Tyagi V. Content-Based Image Retrieval Ideas, Influences, and Current Trends. Gateway East: Springer; 2017.

Alkhawlani M, Elmogy M, Elbakry H. Content-Based Image Retrieval using Local Features Descriptors and Bag-of-Visual Words. Int J Adv Comput Sci Appl 2015;6(9):212-9.

Marques O, Furht B. Content-based image and video retrieval. New York: Springer; 2002.

Barz B. Semantic and Interactive Content-based Image Retrieval. Ph.D. Dissertation, Friedrich Schiller University Jena. Germany; 2020.

Goodfellow I, Bengio Y, Courville A. Deep Learning. Massachusetts: MIT Press; 2016.

Aggarwal CC. Neural Networks and Deep Learning A Textbook. Cham: Springer; 2018.

McConnell S. Rapid Development: Taming Wild Software Schedules. Washington, D.C.: Microsoft Press; 1996.

Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.

Perret B. Hierarchical image analysis: theory, algorithms, and applications. [Internet]. 2021 [cited 2023 May 28]. Available from: https://hal.science/tel-03231061/file/HDR_BP.pdf

Liu Y, Zhang J, Tjondronegoro D, Geve S. A Shape Ontology Framework for Bird Classification. In: proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications, December 3-5, 2007; South Australia, Australia; 2007. p. 478-84.

Liu Y, Huang Y, Zhang S, Zhang D, Ling N. Integrating object ontology and region semantic template for crime scene investigation image retrieval. In: proceedings of the 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), June 18-20, 2017; Siem Reap, Cambodia; 2017. p. 149-53.

Dong H, Wang Z, Qiu Q, Sapiro G. Using Text to Teach Image Retrieval. [Internet]. 2020 [cited 2023 May 28]. Available from: https://arxiv.org/pdf/2011.09928.pdf

Mikolajczyk A, Grochowski M. Data augmentation for improving deep learning in image classification problem. In: proceedings of the 2018 International Interdisciplinary PhD Workshop (IIPhDW), May 9-12, 2018; Swinoujscie, Poland; 2018. p. 117-22.

Roh Y, Heo G, Whang SE. A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Trans Knowl Data Eng 2019;33(4):1328-47.

Althnian A, AlSaeed D, Al-Baity H, Samha A, Dris AB, Alzakari N, Elwafa AA, Kurdi H. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Appl Sci 2021;11(2):796-813.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I. Attention Is All You Need. In: proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), December 4 - 9, 2017; NY, USA; 2017. p. 6000-10.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: proceedings of the 9th International Conference on Learning Representations 2021 (ICLR 2021), May 3 - 7, 2021; Virtual Event, Austria; 2021. p. 1-21.

Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. [Internet]. 2015 [cited 2023 May 28]. Available from: https://arxiv.org/pdf/1505.04870.pdf

Brase CH, Brase CP. Understanding Basic Statistics. Boston: Cengage Learning; 2018.

Mumuni A, Mumuni F. Data augmentation: A comprehensive survey of modern approaches. Array 2022;16(2022):100258.

Xu M, Yoon S, Fuentes A, Park DS. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern Recognition 2023;137(2023):109347.

Mitchell R. Web Scraping with Python: Collecting Data from the Modern Web. 2nd ed. Sebastopol, CA: O'Reilly Media Inc.; 2018.

Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning Transferable Visual Models From Natural Language Supervision. [Internet]. 2021 [cited 2023 May 28]. Available from: https://arxiv.org/pdf/2103.00020.pdf

Baltrusaitis T, Ahuja C, Morency L. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell 2019;41(2):423-33.

Weers F, Shankar V, Katharopoulos A, Yang Y, Gunte T. Masked Autoencoding Does Not Help Natural Language Supervision at Scale. In: proceedings of the 2023 Conference on Computer Vision and Pattern Recognition (CVPR), June 18-22, 2023; Vancouver, Canada; 2023. p. 1-19.

Zizka J, Darena F, Svoboda A. Text Mining with Machine Learning Principles and Techniques. Boca Raton: CRC Press; 2020.

Nwankpa C, Ijomah W, Gachagan A, Marshall S. Activation Functions: Comparison of trends in Practice and Research for Deep Learning. [Internet]. 2018 [cited 2023 May 28]. Available from: https://arxiv.org/pdf/1811.03378.pdf

Le HD, Nguyen QQ, Nguyen VA, Nguyen TD, Chung NM, Thái T, Ha SV. Tracked-Vehicle Retrieval by Natural Language Descriptions with Domain Adaptive Knowledge. In: proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 19-20, 2022; Los Angeles, USA; 2022. p. 3300-9.

Xie C, Sun S, Xiong X, Zheng Y, Zhao D, Zhou J. RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training. In: proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 17-24, 2023; Vancouver, Canada; 2023. p. 19265-74.

Barz B. Denzler J. Content-based Image Retrieval and the Semantic Gap in the Deep Learning Era. In: proceedings of the International Workshop on Content-Based Image Retrieval: where have we been, and where are we going (CBIR 2020), January 10, 2021; Milan, Italy; 2021. p. 2-19.

Srisa-ard O. Validation of measurement tools by experts. JEM-MSU 2018;1(1):45-9.

Christian B. The Most Human Human: What Talking with Computers Teaches Us About What It Means to Be Alive. New York: Doubleday; 2011.

Rosebrock A. Deep Learning for Computer Vision with Python. New York: PYIMAGESEARCH; 2017.

Sawarka K. Deep Learning with PyTorch Lightning Swiftly build high-performance Artificial Intelligence (AI) models using Python. Birmingham: Packt Publishing; 2022.

Chaudhary A. Evaluation Metrics For Information Retrieval. [Internet]. 2023 [cited 2023 May 28]. Available from: https://amitness.com/2020/08/information-retrieval-evaluation/

Carnevali L, Briggs J. Metrics in Information Retrieval. [Internet]. 2023 [cited 2023 May 28]. Available from: https://www.pinecone.io/learn/offline-evaluation/

Sangsuriyong R. Risks of Error in the Quantitative Sociology Research. JHSS BUU 2022;30(1):158-85.

Jansen BJ, Spink A. How are we searching the world wide web? A comparison of nine search engine transaction logs. Inform Process Manag 2006;42(1):248-63.

Jansen BJ, Spink A. An Analysis of Web Documents Retrieved and Viewed. In: proceedings of the 4th International Conference on Internet Computing, April 26-28, 2003; Chennai, India; 2003. p. 65-9.

Dix A, Finlay J, Abowd GD, Beale R. Human–Computer Interaction. 3rd ed. Harlow: Pearson; 2004.

Bianchi F, Attanasio G, Pisoni R, Terragni S, Sarti G, Lakshmi S. Contrastive Language-Image Pre-training for the Italian Language. [Internet]. 2021 [cited 2023 May 28]. Available from: https://arxiv.org/pdf/2108.08688.pdf

Liu C, Song G. A Method of Measuring the Semantic Gap in Image Retrieval: Using the Information Theory. In: proceedings of the 2011 International Conference on Image Analysis and Signal Processing (IASP 2011), October 21-23, 2011; Hubei, China; 2011. p. 1-5.

Downloads

Published

2023-12-20

How to Cite

Santirattanaphakdi, C., & Niwattanakul, S. (2023). Development of a Semantic-based Image Retrieval Model Using Contrastive Pre-trained between Image and Text. Huachiew Chalermprakiet Science and Technology Journal, 9(2), 34–51. retrieved from https://ph02.tci-thaijo.org/index.php/scihcu/article/view/249669

Issue

Section

Research Articles