Bayesian Meta-Ensemble of Stacked Long Short-Term Memory Convolutional Neural Networks for Detection of AI-Generated Text under Uncertainty
Main Article Content
Abstract
The rapid misuse of AI-generated text can lead to a serious risk to information integrity, yet most existing detectors rely on deterministic, single-model architectures that lack uncertainty quantification and are computationally demanding. This paper introduces a Bayesian Meta-Ensemble of stacked LSTM-CNNs for detecting AI-generated text, a lightweight probabilistic framework that addresses these limitations by combining ensemble stacking with variational Bayesian inference. Three main novel contributions: (i) a two-tier stacked ensemble trained on resampled subsets feeding into a Bayesian Neural Network (BNN) meta-learner, enabling uncertainty-aware classification, (ii) a statistical analysis of Monte Carlo sampling behavior using Kolmogorov-Smirnov and Mann-Whitney tests to validate predictive stability, and (iii) the integration of parametric and non-parametric hypothesis testing directly into the prediction pipeline to enhance decision reliability. On the MAGE dataset, we benchmarked our model against several state-of-the-art detectors, including FastText, GLTR, Longformer, and DetectGPT. Our model achieves 90.84% human recall, 89.59% average recall, and 0.96 AUROC, outperforming FastText, GLTR, and DetectGPT, and closely matching Longformer while requiring 7.5 times fewer parameters (18.77M vs. 142M). We further investigated predictive stability through normality testing and examined the effect of the Monte Carlo (MC) sample size, confirming that larger samples led to more consistent model outputs. Additionally, we demonstrated the use of statistical inference to enhance decision-making. The results demonstrate that a compact Bayesian ensemble can deliver competitive detection accuracy while accounting for uncertainty estimates, making it suitable for resource-constrained and high-stakes environments.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57. https://doi.org/10.1007/s10462-024-10888-y
Li, Z.; Zhang, W.; Zhang, H.; Fang, X. Global digital compact: A mechanism for the governance of online discriminatory and misleading content generation. Int. J. Hum.-Comput. Interact. 2025, 41(2), 1381–1396. https://doi.org/10.1080/10447318.2024.2314350
Stokel-Walker, C. AI bot ChatGPT writes smart essays—Should professors worry? Nature 2022. https://doi.org/10.1038/d41586-022-04397-7
Ghosal, S. S.; Chakraborty, S.; Geiping, J.; Huang, F.; Manocha, D.; Bedi, A. S. Towards possibilities & impossibilities of AI-generated text detection: A survey. arXiv 2023, arXiv:2310.15264.
Solaiman, I.; et al. Release strategies and the social impacts of language models. arXiv 2019, arXiv:1908.09203
Zellers, R.; et al. Defending against neural fake news. In Advances in Neural Information Processing Systems; 2019; Vol. 32.
Jawahar, G.; Abdul-Mageed, M.; Lakshmanan, L. V. Automatic detection of machine generated text: A critical survey. arXiv 2020, arXiv:2011.01314.
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT; 2019; pp 4171–4186.
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108.
Liu, Y.; et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692.
Pawlowski, N.; Brock, A.; Lee, M. C.; Rajchl, M.; Glocker, B. Implicit weight uncertainty in neural networks. arXiv 2017, arXiv:1711.01297.
Chen, W.; Li, B.; Zhang, R.; Li, Y. Bayesian computation in deep learning. arXiv 2025, arXiv:2502.18300.
Li, Y.; et al. MAGE: Machine-generated text detection in the wild. arXiv 2024, arXiv:2305.13242. https://doi.org/10.48550/arXiv.2305.13242
Zhang, J.; Li, Y.; Tian, J.; Li, T. LSTM-CNN hybrid model for text classification. In Proc. IEEE IAEAC; 2018; pp 1675–1680. https://doi.org/10.1109/IAEAC.2018.8577620
Xiao, L.; Wang, G.; Zuo, Y. Research on patent text classification based on Word2Vec and LSTM. In Proc. ISCID; 2018; pp 71–74. https://doi.org/10.1109/ISCID.2018.00023
Xie, J.; Chen, B.; Gu, X.; Liang, F.; Xu, X. Self-attention-based BiLSTM model for short text fine-grained sentiment classification. IEEE Access 2019, 7, 180558–180570. https://doi.org/10.1109/ACCESS.2019.2957510
Wadud, M. A. H.; Kabir, M. M.; Mridha, M. F.; Ali, M. A.; Hamid, M. A.; Monowar, M. M. Managing offensive text in social media: A text classification approach using LSTM-BOOST. Int. J. Inf. Manage. Data Insights 2022, 2(2), 100095. https://doi.org/10.1016/j.jjimei.2022.100095
Anggrainingsih, R.; Hassan, G. M.; Datta, A. Evaluating BERT-based language models for detecting misinformation. Neural Comput. Appl. 2025. https://doi.org/10.1007/s00521-025-11101-z
O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458.
Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. https://doi.org/10.48550/arXiv.1408.5882
Alshubaily, I. TextCNN with attention for text classification. arXiv 2021, arXiv:2108.01921. https://doi.org/10.48550/arXiv.2108.01921
Zhao, W.; Zhu, L.; Wang, M.; Zhang, X.; Zhang, J. WTL-CNN: A news text classification method based on weighted word embedding. Connect. Sci. 2022, 34(1), 2291–2312. https://doi.org/10.1080/09540091.2022.2117274
Ran, Y.; Han, H. Text classification algorithm based on sparse distributed representation. In Proc. IEEE AEECA; 2020; pp 876–880. https://doi.org/10.1109/AEECA49918.2020.9213479
Li, C.; Zhan, G.; Li, Z. News text classification based on improved Bi-LSTM-CNN. In Proc. IEEE ITME; 2018; pp 890–893. https://doi.org/10.1109/ITME.2018.00199
Jang, B.; Kim, M.; Harerimana, G.; Kang, S.; Kim, J. W. Bi-LSTM model to increase accuracy in text classification. Appl. Sci. 2020, 10 (17), 5841. https://doi.org/10.3390/app10175841
Abdullah, A. A.; Hassan, M. M.; Mustafa, Y. T. Bayesian deep learning in healthcare: Applications and challenges. IEEE Access 2022, 10, 36538–36562. https://doi.org/10.1109/ACCESS.2022.3163384
Graves, A. Practical variational inference for neural networks.
Abburi, H.; et al. A simple yet efficient ensemble approach for AI-generated text detection. In Proc. GEM Workshop; Association for Computational Linguistics: Singapore, 2023; pp 413–421.
Aggarwal, K.; Singh, S.; Parul; Pal, V.; Yadav, S. S. Enhancing accuracy in AI-generated text detection using ensemble modelling. In Proc. IEEE TENSYMP; 2024; pp 1–8. https://doi.org/10.1109/TENSYMP61132.2024.10752173
Massey, F. J., Jr. The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46 (253), 68–78. https://doi.org/10.1080/01621459.1951.10500769
Two Sample Kolmogorov–Smirnov. Real Statistics Using Excel. https://real-statistics.com/non-parametric-tests/goodness-of-fit-tests/two-sample-kolmogorov-smirnov-test/ (accessed Jul 23, 2025).
Divine, G. W.; Norton, H. J.; Barón, A. E.; Juarez-Colunga, E. The Wilcoxon–Mann–Whitney procedure fails as a test of medians. Am. Stat. 2018, 72(3), 278–286.
Feltovich, N. Nonparametric tests of differences in medians. Exp. Econ. 2003, 6(3), 273–297.
Yue, S.; Wang, C. Y. Power of the Mann–Whitney test. Stochastic Environ. Res. Risk Assess. 2002, 16(4), 307–323.
Mishra, P.; Singh, U.; Pandey, C. M.; Mishra, P.; Pandey, G. Application of Student’s t-test. Ann. Card. Anaesth. 2019, 22(4), 407–411.
Lumley, T.; Diehr, P.; Emerson, S.; Chen, L. Importance of normality assumption. Annu. Rev. Public Health 2002, 23, 151–169.
Ratcliffe, J. F. Effect on the t distribution of non-normality. Appl. Stat. 1968, 17(1), 42.
Sawilowsky, S. S.; Blair, R. C. Robustness of the t test. Psychol. Bull. 1992, 111(2), 352–360.
Althouse, L. A.; Ware, W. B.; Ferron, J. M. Detecting departures from normality. 1998.
Razali, N. M.; Yap, B. Power comparisons of normality tests. J. Stat. Model. Anal. 2011, 2.
UMA Technology. The dangers of AI writing and how to spot AI-generated text. https://umatechnology.org/the-dangers-of-ai-writing-and-how-to-spot-ai-generated-text/ (accessed Mar 14, 2025).
AI’s Jurassic Park moment. Commun. ACM. https://cacm.acm.org/blogcacm/ais-jurassic-park-moment/ (accessed Mar 14, 2025).
Vincent, J. AI-generated answers temporarily banned on Stack Overflow. The Verge. https://www.theverge.com/ (accessed Mar 14, 2025).
Dergaa, I.; Chamari, K.; Zmijewski, P.; Saad, H. B. AI-generated text in academic writing. Biol. Sport 2023, 40(2), 615–622.
Bohannon, J. Who’s afraid of peer review? Science 2013, 342(6154), 60–65.
Fagerland, M. W. t-tests and non-parametric tests. BMC Med. Res. Methodol. 2012, 12, 78.
Zhang, J.; et al. Improving Bayesian neural networks by adversarial sampling. Proc. AAAI 2022, 36(9), 10110–10117.