Enhancement of Character-Level Representation in Bi-LSTM model for Thai NER
Main Article Content
Abstract
Named Entity Recognition (NER) in the Thai language is a relatively challenging task because the Thai language does not have an explicit word boundary. This normally can cause difficulties in word segmentation, which affects the efficiency in NLP post-processing such as NER tasks. Moreover, one of the important problems is the ambiguity in using common nouns to express named entities. According to the Thai language, most named entities are usually placed close to a verb or a preposition with a specific pattern. This means that the part of speech (POS) can be effectively used as a feature to consider the type of named entity. For these reasons, in this paper, we generate the BiLSTM-CNN-CRF model to investigate the effectiveness of a combination of the features among word, POS, and Thai character clusters (TCCs). We use TCCs instead of characters to minimize word segmentation errors in the corpora and increase the efficiency in generating the model. Experimental results show that our proposed model outperforms other models. The TCC is a suitable unit for character embedding, providing better results than single character embedding.