Enhancement of Character-Level Representation in Bi-LSTM model for Thai NER

Main Article Content

Kitiya Suriyachay
Thatsanee Charoenporn
Virach Sornlertlamvanich
Natsuda Kaothanthong

Abstract

Named Entity Recognition (NER) in the Thai language is a relatively challenging task because the Thai language does not have an explicit word boundary. This normally can cause difficulties in word segmentation, which affects the efficiency in NLP post-processing such as NER tasks. Moreover, one of the important problems is the ambiguity in using common nouns to express named entities. According to the Thai language, most named entities are usually placed close to a verb or a preposition with a specific pattern. This means that the part of speech (POS) can be effectively used as a feature to consider the type of named entity. For these reasons, in this paper, we generate the BiLSTM-CNN-CRF model to investigate the effectiveness of a combination of the features among word, POS, and Thai character clusters (TCCs). We use TCCs instead of characters to minimize word segmentation errors in the corpora and increase the efficiency in generating the model. Experimental results show that our proposed model outperforms other models. The TCC is a suitable unit for character embedding, providing better results than single character embedding.

Article Details

How to Cite
Suriyachay, K. ., Charoenporn, T. ., Sornlertlamvanich, V., & Kaothanthong, N. . (2021). Enhancement of Character-Level Representation in Bi-LSTM model for Thai NER. Science & Technology Asia, 26(2), 61–78. Retrieved from https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/230527
Section
Engineering