การสร้างคำบรรยายภาพด้วยแบบจำลอง CLIP Prefix Caption บนชุดข้อมูล Traffy Fondue

Main Article Content

วสิศ ลิ้มประเสริฐ

Abstract

Traffy Fondue is a grievance system provided by the Bangkok Metropolitan Administration to receive opinions and suggestions that citizens have on the city. However, due to the amount of information from many users, there is still unclear reporting, such as descriptions and images are inconsistent, which makes it difficult for the receiving officer to coordinate and solve the problem. Therefore, our team proposed a data clustering method to increase the ability of clustering data to be more convenient by using data processing techniques. In this research, the presenter applied the CLIP Prefix Caption model for creating captions and allowing the system to group words or search for related problems. The CLIP, CLIP Prefix Caption and GPT-2 models were created captions using images from Traffy Fondue. The experimental results can be summarized as follows: BLEU 0.93% and ROUGE-1 16.39%. The result is not good enough for real applications. The experiment was further reorganized by proposing to group images using vectors ​​from Prefix Embeddings instead of captioning directly from the image. The results indicate embedding can be applied for further development.

Article Details

Section
บทความวิจัย

References

Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., Zhang, J., Huang, S., Huang, F., Zhou, J., & Si, L. (2022). mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. arXiv:2205.12005v2 [cs.CL]. https://doi.org/10.48550/ arXiv.2205.12005

Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(86), 2579−2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426v3 [stat.ML]. https://doi.org/10.48550 /arXiv.1802.03426

Meister, C., Vieira, T., & Cotterell, R. (2020). Best-first beam search. Transactions of the Association for Computational Linguistics 8, 795-809. https://doi.org/10.1162/tacl_a_ 00346/96473.

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv:2111.09734v1. https://doi.org/10.48550/arXiv.2111.09734

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020v1. https://doi.org/10.48550/arXiv.2103. 00020

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (17-23 Jul 2022). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning Vol. 162 (pp. 23318–23340). https://proceedings.mlr.press/v162/wang22al/ wang22al.pdf