Bayesian Meta-Ensemble of Stacked Long Short-Term Memory Convolutional Neural Networks for Detection of AI-Generated Text under Uncertainty

Nirawit Kanthachai; Prompong Sugunnasil

doi:10.55164/ajstr.v29i5.261582

PDF

Published: Apr 30, 2026

DOI: https://doi.org/10.55164/ajstr.v29i5.261582

Keywords:

Bayesian model uncertainty estimation variational inference model interpretability ai-generated text detection

Nirawit Kanthachai

Faculty of Engineering, Chiang Mai University, Chiang Mai, 50200, Thailand

Prompong Sugunnasil

College of Art, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand

Abstract

The rapid misuse of AI-generated text can lead to a serious risk to information integrity, yet most existing detectors rely on deterministic, single-model architectures that lack uncertainty quantification and are computationally demanding. This paper introduces a Bayesian Meta-Ensemble of stacked LSTM-CNNs for detecting AI-generated text, a lightweight probabilistic framework that addresses these limitations by combining ensemble stacking with variational Bayesian inference. Three main novel contributions: (i) a two-tier stacked ensemble trained on resampled subsets feeding into a Bayesian Neural Network (BNN) meta-learner, enabling uncertainty-aware classification, (ii) a statistical analysis of Monte Carlo sampling behavior using Kolmogorov-Smirnov and Mann-Whitney tests to validate predictive stability, and (iii) the integration of parametric and non-parametric hypothesis testing directly into the prediction pipeline to enhance decision reliability. On the MAGE dataset, we benchmarked our model against several state-of-the-art detectors, including FastText, GLTR, Longformer, and DetectGPT. Our model achieves 90.84% human recall, 89.59% average recall, and 0.96 AUROC, outperforming FastText, GLTR, and DetectGPT, and closely matching Longformer while requiring 7.5 times fewer parameters (18.77M vs. 142M). We further investigated predictive stability through normality testing and examined the effect of the Monte Carlo (MC) sample size, confirming that larger samples led to more consistent model outputs. Additionally, we demonstrated the use of statistical inference to enhance decision-making. The results demonstrate that a compact Bayesian ensemble can deliver competitive detection accuracy while accounting for uncertainty estimates, making it suitable for resource-constrained and high-stakes environments.

Issue

Vol. 29 No. 5 (2026): May

Section

Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57. https://doi.org/10.1007/s10462-024-10888-y

Li, Z.; Zhang, W.; Zhang, H.; Fang, X. Global digital compact: A mechanism for the governance of online discriminatory and misleading content generation. Int. J. Hum.-Comput. Interact. 2025, 41(2), 1381–1396. https://doi.org/10.1080/10447318.2024.2314350

Stokel-Walker, C. AI bot ChatGPT writes smart essays—Should professors worry? Nature 2022. https://doi.org/10.1038/d41586-022-04397-7

Ghosal, S. S.; Chakraborty, S.; Geiping, J.; Huang, F.; Manocha, D.; Bedi, A. S. Towards possibilities & impossibilities of AI-generated text detection: A survey. arXiv 2023, arXiv:2310.15264.

Solaiman, I.; et al. Release strategies and the social impacts of language models. arXiv 2019, arXiv:1908.09203

Zellers, R.; et al. Defending against neural fake news. In Advances in Neural Information Processing Systems; 2019; Vol. 32.

Jawahar, G.; Abdul-Mageed, M.; Lakshmanan, L. V. Automatic detection of machine generated text: A critical survey. arXiv 2020, arXiv:2011.01314.

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT; 2019; pp 4171–4186.

Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108.

Liu, Y.; et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692.

Pawlowski, N.; Brock, A.; Lee, M. C.; Rajchl, M.; Glocker, B. Implicit weight uncertainty in neural networks. arXiv 2017, arXiv:1711.01297.

Chen, W.; Li, B.; Zhang, R.; Li, Y. Bayesian computation in deep learning. arXiv 2025, arXiv:2502.18300.

Li, Y.; et al. MAGE: Machine-generated text detection in the wild. arXiv 2024, arXiv:2305.13242. https://doi.org/10.48550/arXiv.2305.13242

Zhang, J.; Li, Y.; Tian, J.; Li, T. LSTM-CNN hybrid model for text classification. In Proc. IEEE IAEAC; 2018; pp 1675–1680. https://doi.org/10.1109/IAEAC.2018.8577620

Xiao, L.; Wang, G.; Zuo, Y. Research on patent text classification based on Word2Vec and LSTM. In Proc. ISCID; 2018; pp 71–74. https://doi.org/10.1109/ISCID.2018.00023

Xie, J.; Chen, B.; Gu, X.; Liang, F.; Xu, X. Self-attention-based BiLSTM model for short text fine-grained sentiment classification. IEEE Access 2019, 7, 180558–180570. https://doi.org/10.1109/ACCESS.2019.2957510

Wadud, M. A. H.; Kabir, M. M.; Mridha, M. F.; Ali, M. A.; Hamid, M. A.; Monowar, M. M. Managing offensive text in social media: A text classification approach using LSTM-BOOST. Int. J. Inf. Manage. Data Insights 2022, 2(2), 100095. https://doi.org/10.1016/j.jjimei.2022.100095

Anggrainingsih, R.; Hassan, G. M.; Datta, A. Evaluating BERT-based language models for detecting misinformation. Neural Comput. Appl. 2025. https://doi.org/10.1007/s00521-025-11101-z

O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458.

Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. https://doi.org/10.48550/arXiv.1408.5882

Alshubaily, I. TextCNN with attention for text classification. arXiv 2021, arXiv:2108.01921. https://doi.org/10.48550/arXiv.2108.01921

Zhao, W.; Zhu, L.; Wang, M.; Zhang, X.; Zhang, J. WTL-CNN: A news text classification method based on weighted word embedding. Connect. Sci. 2022, 34(1), 2291–2312. https://doi.org/10.1080/09540091.2022.2117274

Ran, Y.; Han, H. Text classification algorithm based on sparse distributed representation. In Proc. IEEE AEECA; 2020; pp 876–880. https://doi.org/10.1109/AEECA49918.2020.9213479

Li, C.; Zhan, G.; Li, Z. News text classification based on improved Bi-LSTM-CNN. In Proc. IEEE ITME; 2018; pp 890–893. https://doi.org/10.1109/ITME.2018.00199

Jang, B.; Kim, M.; Harerimana, G.; Kang, S.; Kim, J. W. Bi-LSTM model to increase accuracy in text classification. Appl. Sci. 2020, 10 (17), 5841. https://doi.org/10.3390/app10175841

Abdullah, A. A.; Hassan, M. M.; Mustafa, Y. T. Bayesian deep learning in healthcare: Applications and challenges. IEEE Access 2022, 10, 36538–36562. https://doi.org/10.1109/ACCESS.2022.3163384

Graves, A. Practical variational inference for neural networks.

Abburi, H.; et al. A simple yet efficient ensemble approach for AI-generated text detection. In Proc. GEM Workshop; Association for Computational Linguistics: Singapore, 2023; pp 413–421.

Aggarwal, K.; Singh, S.; Parul; Pal, V.; Yadav, S. S. Enhancing accuracy in AI-generated text detection using ensemble modelling. In Proc. IEEE TENSYMP; 2024; pp 1–8. https://doi.org/10.1109/TENSYMP61132.2024.10752173

Massey, F. J., Jr. The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46 (253), 68–78. https://doi.org/10.1080/01621459.1951.10500769

Two Sample Kolmogorov–Smirnov. Real Statistics Using Excel. https://real-statistics.com/non-parametric-tests/goodness-of-fit-tests/two-sample-kolmogorov-smirnov-test/ (accessed Jul 23, 2025).

Divine, G. W.; Norton, H. J.; Barón, A. E.; Juarez-Colunga, E. The Wilcoxon–Mann–Whitney procedure fails as a test of medians. Am. Stat. 2018, 72(3), 278–286.

Feltovich, N. Nonparametric tests of differences in medians. Exp. Econ. 2003, 6(3), 273–297.

Yue, S.; Wang, C. Y. Power of the Mann–Whitney test. Stochastic Environ. Res. Risk Assess. 2002, 16(4), 307–323.

Mishra, P.; Singh, U.; Pandey, C. M.; Mishra, P.; Pandey, G. Application of Student’s t-test. Ann. Card. Anaesth. 2019, 22(4), 407–411.

Lumley, T.; Diehr, P.; Emerson, S.; Chen, L. Importance of normality assumption. Annu. Rev. Public Health 2002, 23, 151–169.

Ratcliffe, J. F. Effect on the t distribution of non-normality. Appl. Stat. 1968, 17(1), 42.

Sawilowsky, S. S.; Blair, R. C. Robustness of the t test. Psychol. Bull. 1992, 111(2), 352–360.

Althouse, L. A.; Ware, W. B.; Ferron, J. M. Detecting departures from normality. 1998.

Razali, N. M.; Yap, B. Power comparisons of normality tests. J. Stat. Model. Anal. 2011, 2.

UMA Technology. The dangers of AI writing and how to spot AI-generated text. https://umatechnology.org/the-dangers-of-ai-writing-and-how-to-spot-ai-generated-text/ (accessed Mar 14, 2025).

AI’s Jurassic Park moment. Commun. ACM. https://cacm.acm.org/blogcacm/ais-jurassic-park-moment/ (accessed Mar 14, 2025).

Vincent, J. AI-generated answers temporarily banned on Stack Overflow. The Verge. https://www.theverge.com/ (accessed Mar 14, 2025).

Dergaa, I.; Chamari, K.; Zmijewski, P.; Saad, H. B. AI-generated text in academic writing. Biol. Sport 2023, 40(2), 615–622.

Bohannon, J. Who’s afraid of peer review? Science 2013, 342(6154), 60–65.

Fagerland, M. W. t-tests and non-parametric tests. BMC Med. Res. Methodol. 2012, 12, 78.

Zhang, J.; et al. Improving Bayesian neural networks by adversarial sampling. Proc. AAAI 2022, 36(9), 10110–10117.

Article Sidebar

Main Article Content

Abstract

Article Details

References