Low-Resource Language Text Generation Performance Comparison: LLaMA 3.1-13B-Instruct vs GPT-4-mini with Dataset Augmentation
Main Article Content
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in text generation, yet their effectiveness for Indonesian, a Low-Resource Language (LRL), remains underexplored. This study systematically compares LLaMA 3.1-13B-Instruct and GPT-4-mini for Indonesian text generation using an augmented dataset designed to mitigate data scarcity. Three augmentation strategies—back-translation, synonym substitution via mBERT embeddings, and paraphrasing through GPT-4-mini—were employed to expand lexical and syntactic diversity. Quantitative results show that GPT-4-mini achieves higher BLEU (0.55 vs 0.52), ROUGE-L (0.64 vs 0.61), METEOR (0.57 vs 0.54), lower Perplexity (11.5 vs 12.8), and higher mBERTScore (0.90 vs 0.88) compared to LLaMA, indicating stronger lexical and semantic alignment. Conversely, LLaMA exhibits greater lexical diversity (Distinct-2 = 0.34 vs 0.31). Human evaluation involving four native raters confirms that GPT-4-mini excels in fluency (4.5 vs 4.1), coherence (4.4 vs 4.0), and relevance (4.4 vs 4.3), while LLaMA slightly surpasses in factual accuracy (4.5 vs 4.2). These findings highlight the complementary strengths of the models—GPT-4-mini for fluent and coherent generation, and LLaMA for factual precision and lexical richness—demonstrating the positive impact of data augmentation in improving Indonesian LLM performance.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
I/we certify that I/we have participated sufficiently in the intellectual content, conception and design of this work or the analysis and interpretation of the data (when applicable), as well as the writing of the manuscript, to take public responsibility for it and have agreed to have my/our name listed as a contributor. I/we believe the manuscript represents valid work. Neither this manuscript nor one with substantially similar content under my/our authorship has been published or is being considered for publication elsewhere, except as described in the covering letter. I/we certify that all the data collected during the study is presented in this manuscript and no data from the study has been or will be published separately. I/we attest that, if requested by the editors, I/we will provide the data/information or will cooperate fully in obtaining and providing the data/information on which the manuscript is based, for examination by the editors or their assignees. Financial interests, direct or indirect, that exist or may be perceived to exist for individual contributors in connection with the content of this paper have been disclosed in the cover letter. Sources of outside support of the project are named in the cover letter.
I/We hereby transfer(s), assign(s), or otherwise convey(s) all copyright ownership, including any and all rights incidental thereto, exclusively to the Journal, in the event that such work is published by the Journal. The Journal shall own the work, including 1) copyright; 2) the right to grant permission to republish the article in whole or in part, with or without fee; 3) the right to produce preprints or reprints and translate into languages other than English for sale or free distribution; and 4) the right to republish the work in a collection of articles in any other mechanical or electronic format.
We give the rights to the corresponding author to make necessary changes as per the request of the journal, do the rest of the correspondence on our behalf and he/she will act as the guarantor for the manuscript on our behalf.
All persons who have made substantial contributions to the work reported in the manuscript, but who are not contributors, are named in the Acknowledgment and have given me/us their written permission to be named. If I/we do not include an Acknowledgment that means I/we have not received substantial contributions from non-contributors and no contributor has been omitted.
References
Y. Zhao, W. Zhang, G. Chen, K. Kawaguchi, and L. Bing, “How do large language models handle multilingualism?,” In Proc. 38th Conference on Neural Information Processing Systems, 2024, pp. 1-24.
L. Qin, Q. Chen, Y. Zhou, Z. Chen, Y. Li, L. Liao, M. Li, W. Che, and P. S. Yu, “A survey of multilingual large language models,” Patterns, vol. 6, no. 1, pp. 1-30, 2025.
Y. Tu, A. Xue, and F. Shi, “Blessing of multilinguality: a systematic analysis of multilingual in-context learning,” In Proc. 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 6213-6248.
M. A. Ibrahim, Faisal, Z. D. Sulistiya, and T. S. Y. Winarto, “Prompt-based data augmentation with large language models for Indonesian gender-based hate speech detection,” Journal of Computer Science, vol. 20, no. 8, pp. 819-826, 2024.
R. Kimera, D. N. Heo, D. N. Rim, and H. Choi, “Data augmentation with back translation for low resource languages: a case of english and Luganda,” In Proc. 8th International Conference on Natural Language Processing and Information Retrieval, 2025, pp. 142-148.
F. I. Maulana, Y. Heryadi, G. P. Kusuma, and W. Budiharto, “Data augmentation English-Indonesia-Madurese parallel corpus dataset using neural machine translation,” Data in Brief, vol. 62, no. 1, pp. 1-8, 2025.
P. N. Hadiwinoto and D. P. Lestari, “Data augmentation on spontaneous Indonesian automatic speech recognition using statistical machine translation,” In Proc. International Conference on Information Technology and Digital Applications, 2019, pp. 1-8.
F. Muftie and M. Haris, “IndoBERT based data augmentation for Indonesian text classification,” In Proc. International Conference on Information Technology Research and Innovation, 2023, pp. 128-132.
A. H. Nasution, A. Onan, Y. Murakami, W. Monika, and A. Hanafiah, “Benchmarking open-source large language models for sentiment and emotion classification in Indonesian tweets,” IEEE Access, vol. 13, no. 1, pp. 94009-94025, 2025.
A. S. Wijaya and A. S. Girsang, “Augmented-based Indonesian abstractive text summarization using pre-trained model mT5,” International Journal of Engineering Trends and Technology, vol. 71, no. 11, pp. 190-200, 2023.
B. R. Irnawan and R. Adi, “Improving Indonesian informal to formal style transfer via pre-training unlabelled augmented data,” In Proc. 6th International Conference of Computer and Informatics Engineering, 2023, pp. 25-29.
KEMENDIKDASMEN, “Perpustakaan digital,” Ministry of Primary and Secondary Education Republic of Indonesia, 2025. [Online]. Available: https://pustaka-digital.kemendikdasmen.go.id/. [Accessed: Sept. 2, 2025].
MYEDISI, “Buku Sekolah Elektronik (BSE),” My Edisi, 2021. [Online]. Available: https://www.myedisi.com/bse. [Accessed: Sept. 3, 2025].
S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio, X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim, S. Bahar, M. Khodra, A. Purwarianti, and P. Fung, “IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation,” In Proc. Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8875-8898.
B. Wilie, S. Cahyawijaya, and G. I. Winata, “IndoNLG-Indonesian natural language generation,” github.com, 2024. [Online]. Available: https://github.com/IndoNLP/indonlg. [Accessed: Sept. 1, 2025].
INDODATASET, “Indonesian datasets: NLP datasets for Indonesian,” github.com, 2023. [Online]. Available: https://github.com/Wikidepia/indonesian_datasets. [Accessed: Sept. 2, 2025].
INDONESIAGOID, “Portal informasi Indonesia,” Indonesia Gov, 2025. [Online]. Available: https://indonesia.go.id/. [Accessed: Sept. 1, 2025].
KBBI, “Kamus Besar Bahasa Indonesia (KBBI) VI daring,” Indonesian Dictionary, 2025. [Online]. Available: https://kbbi.kemdikbud.go.id/. [Accessed: Sept. 1, 2025].
M. Dwiastuti, “English-Indonesian neural machine translation for spoken language domains,” In Proc. 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2019, pp. 309-314.
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoBERT: Indonesian version of BERT model,” indolem.github.io, 2020. [Online]. Available: https://indolem.github.io/Task/. [Accessed: Sept. 2, 2025].
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “Liputan6: A large-scale Indonesian dataset for text summarization,” In Proc. 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020, pp. 598-608.
S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio, X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim, S. Bahar, M. Khodra, A. Purwarianti, and P. Fung, “IndoGPT Model,” huggingface.co, 2021. [Online]. Available: https://huggingface.co/indobenchmark/indogpt. [Accessed: Sept. 2, 2025].
A. Onan and H. Alhumyani, “Knowledge-enhanced transformer graph summarization (KETGS): integrating entity and discourse relations for advanced extractive text summarization,” Mathematics, vol. 12, no. 23, pp. 1-25, 2024.
A. A. Syahidi and K. Kiyokawa, “Automatic text generation in Banjar language using GPT-4 for low-resource language preservation,” In Proc. 7th IEEE Symposium on Computers and Informatics, 2025, pp. 238-243.
A. Gupta, A. Rastogi, H. Malhotra, and K. Rangarajan, “Comparative evaluation of large language models for translating radiology reports into Hindi,” Indian Journal of Radiology and Imaging, vol. 35, no. 1, pp. 88-96, 2024.
A. A. Citarella, M. Barbella, M. G. Ciobanu, F. D. Marco, L. D. Biasi, and G. Tortora, “Assessing the effectiveness of ROUGE as unbiased metric in extractive vs. abstractive summarization techniques,” Journal of Computational Science, vol. 87, no. 102571, pp. 1-17, 2025.
E. Oro, F. M. Granata, and M. Ruffolo, “A comprehensive evaluation of embedding models and LLMs for IR and QA across English and Italian,” Big Data and Cognitive Computing, vol. 9, no. 141, pp. 1-41, 2025.
A. A. Syahidi, K. Kiyokawa, and S. Nuchitprasitchai, “A fine-tuned GPT-4-based question answering system for e-government services using a custom-built dataset,” In Proc. 7th IEEE Symposium on Computers and Informatics, 2025, pp. 232-237.
M. Mahyoub, Y. Wang, and M. T. Khasawneh, “GPT-4o in radiology: in-context learning based automatic generation of radiology impressions,” Natural Language Processing Journal, vol. 11, no. 100145, pp. 1-10, 2025.
P. Netisopakul and U. Taoto, “Comparison of evaluation metrics for short story generation,” IEEE Access, vol. 11, no. 1, pp. 140253-140269, 2023.
P. S. García-Montero, P. Vizcaíno, I. G. Reyes-Chacón, and M. E. Morocho-Cayamcela, “AI for all: reducing perplexity and boosting accuracy in normative texts with fine-tuned LLMs and RAG,” IEEE Access, vol. 13, no. 1, pp. 179759-179775, 2025.
L. Fang, Y. Wang, Z. Liu, C. Zhang, S. Jegelka, J. Gao, B. Ding, and Y. Wang, “What is wrong with perplexity for long-context language modeling?,” In Proc. 13th International Conference on Learning Representations, 2025, pp. 1-23.
J. Xu, H. Zhang, Y. Yang, L. Yang, Z. Cheng, J. Lyu, B. Liu, X. Zhou, A. Bacchelli, Y. K. Chiam, and T. K. Chiew, “One size does not fit all: investigating efficacy of perplexity in detecting LLM-generated code,” ACM Transactions on Software Engineering and Methodology, vol. 1, no. 1, pp. 1-33, 2025.
A. Aziz, M. A. Hossain, and A. N. Chy, “CSECU-DSG at SemEval-2022 task 3: investigating the taxonomic relationship between two arguments using fusion of multilingual transformer models,” In Proc. 16th International Workshop on Semantic Evaluation, 2022, pp. 255-259.
E. A. Chi, J. Hewitt, and C. D. Manning, “Finding universal grammatical relations in multilingual BERT,” In Proc. 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5564-5577.
S. Liu, S. Sabour, Y. Zheng, P. Ke, X. Zhu, and M. Huang, “Rethinking and refining the distinct metric,” In Proc. 60th Annual Meeting of the Association for Computational Linguistics, 2022, pp. 762-770.
A. Shypula, S. Li, B. Zhang, V. Padmakumar, K. Yin, and O. Bastani, “Evaluating the diversity and quality of LLM generated content,” In Proc. International Conference on Learning Representations, 2025, pp. 1-18.
J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 110-119.
J. Kim and W. Jung, “CARE: A framework for correcting numerical hallucinations in LLM-generated financial texts,” In Proc. IEEE Conference on Artificial Intelligence, 2025, pp. 69-74.
M. Y. Mohammed, S. A. Ali, S. K. Ali, A. A. Majeed, and E. H. Mohamed, “Aftina: enhancing stability and preventing hallucination in AI based islamic fatwa generation using LLMs and RAG,” Neural Computing and Applications, vol. 37, no. 1, pp. 20957-20982, 2025.
I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks,” Computers in Biology and Medicine, vol. 171, no. 108189, pp. 1-23, 2024.
T. Y. C. Tam, S. Sivarajkumar, S. Kapoor, A. V. Stolyar, K. Polanska, K. R. McCarthy, H. Osterhoudt, X. Wu, S. Visweswaran, S. Fu, P. Mathur, G. E. Cacciamani, C. Sun, Y. Peng, and Y. Wang, “A framework for human evaluation of large language models in healthcare derived from literature review,” npj – Digital Medicine, vol. 7, no. 258, pp. 1-20, 2024.
Q. Li, L. Cui, L. Kong, and W. Bi, “Exploring the reliability of large language models as customized evaluators for diverse NLP tasks,” In Proc. 31st International Conference on Computational Linguistics, 2025, pp. 10325-10344.
Z. Li, X. Xu, T. Shen, C. Xu, J-C Gu, Y. Lai, C. Tao, and S. Ma, “Leveraging large language models for NLG evaluation: advances and challenges,” In Proc. Conference on Empirical Methods in Natural Language Processing, 2024, pp. 16028-16045.
G. Tevet and J. Berant, “Evaluating the evaluation of diversity in natural language generation,” In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021, pp. 326-346.
B. Więckowska, K. B. Kubiak, P. Jóźwiak, W. Moryson, and B. Stawińska-Witoszyńska, “Cohen’s kappa coefficient as a measure to assess classification improvement following the addition of a new marker to a regression model,” International Journal of Environmental Research and Public Health, vol. 19, no. 10213, pp. 1-15, 2022.
A. S. Thakur, K. Choudhary, V. S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: evaluating alignment and vulnerabilities in LLMs-as-judges,” In Proc. Fourth Workshop on Generation, Evaluation and Metrics, 2025, pp. 404-430.
V. Hackl, A. E. Müller, M. Granitzer, and M. Sailer M, “Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings,” Frontiers in Education, vol. 8, no. 1272229, pp. 1-8, 2023.
D-W. Zhang, M. Boey, Y. Y. Tan, and A. H. S. Jia, “Evaluating large language models for criterion-based grading from agreement to consistency,” npj - Science of Learning, vol. 9, no. 79, pp. 1-4, 2024.
