Development and Evaluation of a Thai Automatic Speech Recognition Model Using the Conformer Model
Main Article Content
Abstract
This research project aims to develop and evaluate the performance of an Automatic Speech Recognition (ASR) system for the Thai language by leveraging the Conformer architecture. Conformers integrate the strengths of Convolutional Neural Networks (CNNs), which effectively capture local acoustic features, and Transformers, which model long-range contextual dependencies. This combination enhances the overall capability of Thai speech transcription. The experiments were conducted using a diverse Thai speech dataset encompassing various accents, speaker demographics, and acoustic conditions. The dataset includes samples from Common Voice, regional dialects, elderly speakers, and audio with background noise from sources such as YouTube and podcasts. Performance evaluation metrics included Word Error Rate (WER), Insertion Error Rate (IER), and Deletion Error Rate (DER), along with model-related factors such as the number of parameters and processing efficiency measured by the Inverse Real-Time Factor (RTFx). In conclusion, the study demonstrates the moderate potential of the Conformer architecture for Thai ASR tasks, highlighting the need for further development. This includes expanding the quantity and diversity of training data to reflect real-world conditions and enhancing model robustness to complex acoustic environments. Moreover, the Fast Conformer model (115M parameters) contains approximately 13 times fewer parameters than comparable Whisper Large models (1.54B parameters) and achieves an Inverse Real-Time Factor (RTFx) of approximately 6400, which is about 44 times faster than a baseline Whisper Large v3 model (RTFx 146). This suggests its strong suitability for streaming and real-time ASR applications.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. arXiv [eess.AS]. 2022 Dec 6 [cited 2025 Jul 29]. Available from: https://cdn.openai.com/papers/whisper.pdf
Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, et al. Conformer: Convolution-augmented Transformer for Speech Recognition. In: Interspeech 2020. Baixas, France: ISCA; 2020 Oct. p. 5036–40.
NVIDIA. NVIDIA NeMo. [cited 2025 Jul 29]. Available from: https://nvidia.github.io/NeMo/publications-/category/automatic-speech recognition/
Kaewwichai S. Development and Evaluation of Thai Automatic Speech Recognition Model using Conformer Model. [cited 2025 Jul 29]. Available from: https://drive.google.com/file/d/1qvW906-GrNsDLcTDUlsWA69TpQSqRiZ8x/view-?usp=drive_link
Rekesh D, et al. Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition. 2023 [cited 2025 Jul 29]. Available from: https://research.nvidia.com/labs/convai/publications/2023/2023-fastconformer/
OpenAI. whisper-large-v3 Model by OpenAI. NVIDIA NIM. [cited 2025 Jul 29]. Available from: https://build.nvidia.com/openai/whisperlarge-
v3/modelcard
University of Florida. NaviGator AI. NaviGator AI Docs. [cited 2025 Jul 29]. Available from: https://docs.ai.it.ufl.edu/docs/navigator_- models/models/oai-whisper-large-v3/
Hugging Face. openai/whisper-large-v3. [cited 2025 Jul 29]. Available from: https://huggingface.co/openai/whisperlarge-v3
GitHub. Papers with code. [cited 2025 Jul 29]. Available from: https://paperswithcode.com/sota/speechrecognition-on-common-voice-thai
SEACrowd. gowajee. [cited 2025 Jul 29]. Available from: https://huggingface.co/datasets/SEACrowd-/gowajee
SLSCU. Thai dialect corpus. GitHub. [cited 2025 Jul 29]. Available from: https://github.com/SLSCU/thai-dialectcorpus
Wang and Data Market. Data Market. [cited 2025 Jul 29]. Available from: https://www.wang.in.th/dataset/64a228ab-41c99c04544f2556
SpeechColab. gigaspeech2. Hugging Face; 2024. doi:10.57967/HF/3107