Information Extraction for Thai Celebrities from Free Text

Jian Qu; Chinorot  Wangtragulsang

PDF

Published: Mar 16, 2021

Keywords:

Information extraction Named entity Personal information extraction Social media Unstructured data

Jian Qu

Faculty of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi 11120, Thailand

Chinorot Wangtragulsang

Faculty of Engineering and Technology, Panyapiwat Institute of Management, Nonthaburi 11120, Thailand

Abstract

Automatic extraction of Thai-language information still has challenges because of language structure, lack of word segmentation, presence of vowel and intonation, and specific words that are not in a dictionary. Challenges encountered in Thai-language personal information extraction are low candidate recall and candidate ambiguity. This work proposes an automatic personal information extraction approach capable of extracting date of birth, height, heritage, Instagram, Twitter and film names of Thai celebrities from 22,484 Thai-language webpage snippets using novel pattern matching, feature selection and machine learning methods to select the most likely piece of information out of a number of possible candidates. We compare performances of our method with a large, actively maintained website like MThai.com that contains some personal information. In this case, performance of Mthai.com is up to 70% in recall and precision. Further comparison is done with state-of-the-art works on automatic Thai information extraction that used tokenizer and rules-based extraction, which could perform at only 40-50% in terms of recall and precision. According to experiments, our approach can extract date of birth, height, Instagram, and Twitter with recall and precision being between 70-90%. Furthermore, we can extract some heritage and film names where existing methods cannot.

How to Cite

Qu, J., & Wangtragulsang, C. . (2021). Information Extraction for Thai Celebrities from Free Text. Science & Technology Asia, 26(1), 64–83. retrieved from https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/192017

Issue

Vol.26 No.1 (January-March 2021)

Section

Engineering

Article Sidebar

Main Article Content

Abstract

Article Details