Information Extraction for Thai Celebrities from Free Text

  • Jian Qu Faculty of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi 11120, Thailand
  • Chinorot Wangtragulsang Faculty of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi 11120, Thailand
Keywords: Information extraction, Named entity, Personal information extraction, Social media, Unstructured data

Abstract

Automatic extraction of Thai-language information still has challenges because of language structure, lack of word segmentation, presence of vowel and intonation, and specific words that are not in a dictionary. Challenges encountered in Thai-language personal information extraction are low candidate recall and candidate ambiguity. This work proposes an automatic personal information extraction approach capable of extracting date of birth, height, heritage, Instagram, Twitter and film names of Thai celebrities from 22,484 Thai-language webpage snippets using novel pattern matching, feature selection and machine learning methods to select the most likely piece of information out of a number of possible candidates. We compare performances of our method with a large, actively maintained website like MThai.com that contains some personal information. In this case, performance of Mthai.com is up to 70% in recall and precision. Further comparison is done with state-of-the-art works on automatic Thai information extraction that used tokenizer and rules-based extraction, which could perform at only 40-50% in terms of recall and precision. According to experiments, our approach can extract date of birth, height, Instagram, and Twitter with recall and precision being between 70-90%. Furthermore, we can extract some heritage and film names where existing methods cannot.

Downloads

Download data is not yet available.
Published
2021-03-16
How to Cite
Qu, J., & Wangtragulsang, C. (2021). Information Extraction for Thai Celebrities from Free Text. Science & Technology Asia, 26(1), 64-83. Retrieved from https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/192017
Section
Engineering