การสร้างคำบรรยายภาพด้วยแบบจำลอง CLIP Prefix Caption บนชุดข้อมูล Traffy Fondue
Main Article Content
Abstract
Traffy Fondue is a grievance system provided by the Bangkok Metropolitan Administration to receive opinions and suggestions that citizens have on the city. However, due to the amount of information from many users, there is still unclear reporting, such as descriptions and images are inconsistent, which makes it difficult for the receiving officer to coordinate and solve the problem. Therefore, our team proposed a data clustering method to increase the ability of clustering data to be more convenient by using data processing techniques. In this research, the presenter applied the CLIP Prefix Caption model for creating captions and allowing the system to group words or search for related problems. The CLIP, CLIP Prefix Caption and GPT-2 models were created captions using images from Traffy Fondue. The experimental results can be summarized as follows: BLEU 0.93% and ROUGE-1 16.39%. The result is not good enough for real applications. The experiment was further reorganized by proposing to group images using vectors from Prefix Embeddings instead of captioning directly from the image. The results indicate embedding can be applied for further development.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., Zhang, J., Huang, S., Huang, F., Zhou, J., & Si, L. (2022). mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. arXiv:2205.12005v2 [cs.CL]. https://doi.org/10.48550/ arXiv.2205.12005
Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9(86), 2579−2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426v3 [stat.ML]. https://doi.org/10.48550 /arXiv.1802.03426
Meister, C., Vieira, T., & Cotterell, R. (2020). Best-first beam search. Transactions of the Association for Computational Linguistics 8, 795-809. https://doi.org/10.1162/tacl_a_ 00346/96473.
Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv:2111.09734v1. https://doi.org/10.48550/arXiv.2111.09734
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020v1. https://doi.org/10.48550/arXiv.2103. 00020
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (17-23 Jul 2022). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning Vol. 162 (pp. 23318–23340). https://proceedings.mlr.press/v162/wang22al/ wang22al.pdf