Information Extraction for Thai Celebrities from Free Text
Keywords:Information extraction, Named entity, Personal information extraction, Social media, Unstructured data
Automatic extraction of Thai-language information still has challenges because of language structure, lack of word segmentation, presence of vowel and intonation, and specific words that are not in a dictionary. Challenges encountered in Thai-language personal information extraction are low candidate recall and candidate ambiguity. This work proposes an automatic personal information extraction approach capable of extracting date of birth, height, heritage, Instagram, Twitter and film names of Thai celebrities from 22,484 Thai-language webpage snippets using novel pattern matching, feature selection and machine learning methods to select the most likely piece of information out of a number of possible candidates. We compare performances of our method with a large, actively maintained website like MThai.com that contains some personal information. In this case, performance of Mthai.com is up to 70% in recall and precision. Further comparison is done with state-of-the-art works on automatic Thai information extraction that used tokenizer and rules-based extraction, which could perform at only 40-50% in terms of recall and precision. According to experiments, our approach can extract date of birth, height, Instagram, and Twitter with recall and precision being between 70-90%. Furthermore, we can extract some heritage and film names where existing methods cannot.