Development of a Python-Based Program for ASVs Integration and Deduplication in Microbiome Datasets

Main Article Content

Tawin Fakaim
Tanaporn Uengwetwanit
Monthol Fak-Aim
Piyapong Olranthichachat
Arnon Jankasem

Abstract

Integrating amplicon sequence variant (ASV) datasets from multiple experiments for microbiome analysis often leads to data redundancy, resulting in slow processing, inaccurate taxonomic classification, and an increased burden of manual data management. This study aimed to develop and evaluate the performance of a Python-based program for ASV integration and deduplication. The program was developed under the System Development Life Cycle (SDLC) framework and employs an exact-match method to preserve ASV sequence resolution. It supports flexible operation through a graphical user interface (GUI), Jupyter environment, and command-line interface (CLI).


Technical performance was evaluated using 2 to 10 datasets, comprising 33,445 to 117,678 ASV sequences. The program achieved 100% deduplication accuracy, with average processing times ranging from 4.29 to 35.58 seconds, respectively. User satisfaction with the program was at a good level (mean score = 4.29), with the highest satisfaction reported for the accuracy of the results. Overall, the program effectively reduces ASV redundancy and enhances the reliability of microbiome dataset preparation for downstream bioinformatics analyses.

Article Details

How to Cite
Fakaim, T., Uengwetwanit, T., Fak-Aim, M., Olranthichachat, P., & Jankasem, A. (2025). Development of a Python-Based Program for ASVs Integration and Deduplication in Microbiome Datasets. Journal of Applied Information Technology, 11(2), 18–32. retrieved from https://ph02.tci-thaijo.org/index.php/project-journal/article/view/260477
Section
Articles

References

Lee, J.-Y. (2023). The principles and applications of high-throughput sequencing technologies. Development & Reproduction, 27(1), 9–24. https://doi.org/10.12717/DR.2023.27.1.9

Liu, Y.-X., Qin, Y., Chen, T., Lu, M., Qian, X., Guo, X., & Bai, Y. (2021). A practical guide to amplicon and metagenomic analysis of microbiome data. Protein & Cell, 12(5), 315–330. https://doi.org/10.1007/s13238-020-00724-8

Dacey, D. P., & Chain, F. J. J. (2021). Concatenation of paired-end reads improves taxonomic classification of amplicons for profiling microbial communities. BMC Bioinformatics, 22, Article 493. https://doi.org/10.1186/s12859-021-04410-2

Fasolo, A., Deb, S., Stevanato, P., Concheri, G., & Squartini, A. (2024). ASV vs OTUs clustering: Effects on alpha, beta, and gamma diversities in microbiome metabarcoding studies. PLOS ONE, 19(10). https://doi.org/10.1371/journal.pone.0309065

Chiarello, M., McCauley, M., Villéger, S., & Jackson, C. R. (2022). Ranking the biases: The choice of OTUs vs. ASVs in 16S rRNA amplicon data analysis has stronger effects on diversity measures than rarefaction and OTU identity threshold. PLOS ONE, 17(2), e0264443. https://doi.org/10.1371/journal.pone.0264443

Lin, Q., Dorsett, Y., Mirza, A., Tremlett, H., Piccio, L., Longbrake, E. E., Ni Choileain, S., Hafler, D. A., Cox, L. M., Weiner, H. L., Yamamura, T., Chen, K., Wu, Y., & Zhou, Y. (2024). Meta-analysis identifies common gut microbiota associated with multiple sclerosis. Genome Medicine, 16(1), Article 94. https://doi.org/10.1186/s13073-024-01364-x

Muller, E., Algavi, Y. M., & Borenstein, E. (2021). A meta-analysis study of the robustness and universality of gut microbiome–metabolome associations. Microbiome, 9(1), Article 203. https://doi.org/10.1186/s40168-021-01149-z

Estaki, M., Jiang, L., Bokulich, N. A., McDonald, D., González, A., Kosciolek, T., Martino, C., Zhu, Q., Birmingham, A., Vázquez-Baeza, Y., Dillon, M. R., Bolyen, E., Caporaso, J. G., & Knight, R. (2020). QIIME 2 enables comprehensive end-to-end analysis of diverse microbiome data and comparative studies with publicly available data. Current Protocols in Bioinformatics, 70(1), Article e100.https://doi.org/10.1002/cpbi.100

Xiao, L., Zhang, F., & Zhao, F. (2022). Large-scale microbiome data integration enables robust biomarker identification. Nature Computational Science, 2(5), 307–316. https://doi.org/10.1038/s43588-022-00247-8

อรยา ปรีชาพานิช. (2557). คู่มือเรียน การวิเคราะห์และออกแบบระบบ (System Analysis and Design) ฉบับสมบูรณ์. ไอดีซีฯ.

Python Software Foundation. (n.d.). What is Python? Executive summary. Retrieved from https://www.python.org/doc/

Amazon Web Services. (n.d.). What is Python? Retrieved from https://aws.amazon.com/th/what-is/python/

Köster, J., & Rahmann, S. (2021). Sustainable data analysis with Snakemake. F1000Research, 10, 33.https://doi.org/10.12688/f1000research.29032.2

The Python Packaging Authority. (n.d.). Python Packaging User Guide: Tool recommendations. Retrieved from https://packaging.python.org/en/latest/guides/tool-recommendations/

Python Package Index. (n.d.). pandas: Powerful Python data analysis toolkit. Retrieved from https://pypi.org/project/pandas

Pandas via NumFOCUS. (n.d.). Package overview: Pandas. Retrieved from https://pandas.pydata.org/docs/getting_started/overview.html

NumPy Developers. (n.d.). What is NumPy? Retrieved from https://numpy.org/devdocs/user/whatisnumpy.html

The Matplotlib Development Team. (n.d.). Pyplot tutorial. Retrieved from https://matplotlib.org/stable/tutorials/pyplot.html

International Collaboration of Volunteer Developers. (n.d.). Python tools for computational molecular biology (Biopython). Retrieved from https://biopython.org/

Python Software Foundation. (n.d.). tkinter: Python interface to Tcl/Tk. Retrieved from https://docs.python.org/3/library/tkinter.html

Lex, A. (n.d.). UpSet. Retrieved from https://upset.app/

Wang, C., Liu, C., Zhang, Y., & Wei, R. (2023). An independent evaluation in a CRC patient cohort of microbiome 16S rRNA sequence analysis methods: OTU clustering, DADA2, and Deblur. Frontiers in Microbiology, 14, 1178744. https://doi.org/10.3389/fmicb.2023.1178744

Özkurt, E., Fritscher, J., Soranzo, N., Ng, D. Y. K., Davey, R. P., Bahram, M., & Hildebrand, F. (2022). LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis. Microbiome, 10, Article 176. https://doi.org/10.1186/s40168-022-01365-1

Lu, Y., Zhou, G., Ewald, J., Pang, Z., Shiri, T., & Xia, J. (2023). MicrobiomeAnalyst 2.0: Comprehensive statistical, functional and integrative analysis of microbiome data. Nucleic Acids Research, 51(W1), W310–W318. https://doi.org/10.1093/nar/gkad407

QIIME 2 Development Team. (2024). Artifact API (Using QIIME 2 with Python). Retrieved from https://docs.qiime2.org/2024.10/interfaces/artifact-api/

Chaiyapechara, S., Uengwetwanit, T., Arayamethakorn, S., Bunphimpapha, P., Phromson, M., Jangsutthivorawat, W., Tala, S., Karoonuthaisiri, N., & Rungrassamee, W. (2022). Understanding the host-microbe-environment interactions: Intestinal microbiota and transcriptomes of black tiger shrimp Penaeus monodon at different salinity levels. Aquaculture, 546, Article 737371. https://doi.org/10.1016/j.aquaculture.2021.737371

Angthong, P., Uengwetwanit, T., Uawisetwathana, U., Koehorst, J. J., Arayamethakorn, S., Schaap, P. J., Martins Dos Santos, V., Phromson, M., Karoonuthaisiri, N., Chaiyapechara, S., & Rungrassamee, W. (2023). Investigating host-gut microbial relationship in Penaeus monodon upon exposure to Vibrio harveyi. Aquaculture, 567, Article 739252. https://doi.org/10.1016/j.aquaculture.2023.739252

Uengwetwanit, T., Uawisetwathana, U., Arayamethakorn, S., Khudet, J., Chaiyapechara, S., Karoonuthaisiri, N., & Rungrassamee, W. (2020). Multi-omics analysis to examine microbiota, host gene expression and metabolites in the intestine of black tiger shrimp (Penaeus monodon) with different growth performance. PeerJ, 8, Article e9646. https://doi.org/10.7717/peerj.9646

Angthong, P., Uengwetwanit, T., Arayamethakorn, S., Chaitongsakul, P., Karoonuthaisiri, N., & Rungrassamee, W. (2020). Bacterial analysis in the early developmental stages of the black tiger shrimp (Penaeus monodon). Scientific Reports. https://doi.org/10.1038/s41598-020-61559-1

กาญจนา รูปต่ำ. (2565). การพัฒนาระบบทดสอบออนไลน์สำหรับวิทยาลัยเทคโนโลยีประจวบคีรีขันธ์ (วิทยานิพนธ์ปริญญามหาบัณฑิต). มหาวิทยาลัยธุรกิจบัณฑิตย์. https://libdoc.dpu.ac.th/thesis/Kanjana.Rupt.pdf