Development and Evaluation of a Thai Automatic Speech Recognition Model Using the Conformer Model

Siwakorn Kaewwichai; Kwanchiva Thangthai; Pattara Tipakorn; Wasit Limprasert

PDF

Published: Sep 29, 2025

Keywords:

Conformer Fast conformer Streaming ASR Thai automatic speech recognition

Siwakorn Kaewwichai

Data Science and Innovation, College of Interdisciplinary Studies, Thammasat University, Pathum Thani 12120, Thailand

Kwanchiva Thangthai

National Electronics and Computer Technology Center, NSTDA, Pathum Thani 12120, Thailand

Pattara Tipakorn

National Electronics and Computer Technology Center, NSTDA, Pathum Thani 12120, Thailand

Wasit Limprasert

Data Science and Innovation, College of Interdisciplinary Studies, Thammasat University, Pathum Thani 12120, Thailand

Abstract

This research project aims to develop and evaluate the performance of an Automatic Speech Recognition (ASR) system for the Thai language by leveraging the Conformer architecture. Conformers integrate the strengths of Convolutional Neural Networks (CNNs), which effectively capture local acoustic features, and Transformers, which model long-range contextual dependencies. This combination enhances the overall capability of Thai speech transcription. The experiments were conducted using a diverse Thai speech dataset encompassing various accents, speaker demographics, and acoustic conditions. The dataset includes samples from Common Voice, regional dialects, elderly speakers, and audio with background noise from sources such as YouTube and podcasts. Performance evaluation metrics included Word Error Rate (WER), Insertion Error Rate (IER), and Deletion Error Rate (DER), along with model-related factors such as the number of parameters and processing efficiency measured by the Inverse Real-Time Factor (RTFx). In conclusion, the study demonstrates the moderate potential of the Conformer architecture for Thai ASR tasks, highlighting the need for further development. This includes expanding the quantity and diversity of training data to reflect real-world conditions and enhancing model robustness to complex acoustic environments. Moreover, the Fast Conformer model (115M parameters) contains approximately 13 times fewer parameters than comparable Whisper Large models (1.54B parameters) and achieves an Inverse Real-Time Factor (RTFx) of approximately 6400, which is about 44 times faster than a baseline Whisper Large v3 model (RTFx 146). This suggests its strong suitability for streaming and real-time ASR applications.

How to Cite

Siwakorn Kaewwichai, Kwanchiva Thangthai, Pattara Tipakorn, & Wasit Limprasert. (2025). Development and Evaluation of a Thai Automatic Speech Recognition Model Using the Conformer Model. Science & Technology Asia, 30(3), 60–69. retrieved from https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/261609

Issue

Vol.30 No.3 (July-September 2025)

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. arXiv [eess.AS]. 2022 Dec 6 [cited 2025 Jul 29]. Available from: https://cdn.openai.com/papers/whisper.pdf

Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In: Interspeech 2020. Baixas, France: ISCA; 2020 Oct. p. 5036–40.

NVIDIA. NVIDIA NeMo. [cited 2025 Jul 29]. Available from: https://nvidia.github.io/NeMo/publications-/category/automatic-speech recognition/

Kaewwichai S. Development and Evaluation of Thai Automatic Speech Recognition Model using Conformer Model. [cited 2025 Jul 29]. Available from: https://drive.google.com/file/d/1qvW906-GrNsDLcTDUlsWA69TpQSqRiZ8x/view-?usp=drive_link

Rekesh D, et al. Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition. 2023 [cited 2025 Jul 29]. Available from: https://research.nvidia.com/labs/convai/publications/2023/2023-fastconformer/

OpenAI. whisper-large-v3 Model by OpenAI. NVIDIA NIM. [cited 2025 Jul 29]. Available from: https://build.nvidia.com/openai/whisperlarge-

v3/modelcard

University of Florida. NaviGator AI. NaviGator AI Docs. [cited 2025 Jul 29]. Available from: https://docs.ai.it.ufl.edu/docs/navigator_- models/models/oai-whisper-large-v3/

Hugging Face. openai/whisper-large-v3. [cited 2025 Jul 29]. Available from: https://huggingface.co/openai/whisperlarge-v3

GitHub. Papers with code. [cited 2025 Jul 29]. Available from: https://paperswithcode.com/sota/speechrecognition-on-common-voice-thai

SEACrowd. gowajee. [cited 2025 Jul 29]. Available from: https://huggingface.co/datasets/SEACrowd-/gowajee

SLSCU. Thai dialect corpus. GitHub. [cited 2025 Jul 29]. Available from: https://github.com/SLSCU/thai-dialectcorpus

Wang and Data Market. Data Market. [cited 2025 Jul 29]. Available from: https://www.wang.in.th/dataset/64a228ab-41c99c04544f2556

SpeechColab. gigaspeech2. Hugging Face; 2024. doi:10.57967/HF/3107

Article Sidebar

Main Article Content

Abstract

Article Details

References