TY - JOUR AU - Dangmadee, Apichai AU - Sanguansat, Parinya AU - Haruechaiyasak, Choochart PY - 2019/03/05 Y2 - 2024/03/29 TI - Generalized Information Extraction for Thai Web Boards JF - INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND TECHNOLOGY (ISJET) JA - int. sci. j eng. tech. VL - 2 IS - 1 SE - Research Article DO - UR - https://ph02.tci-thaijo.org/index.php/isjet/article/view/175903 SP - 20-26 AB - <p>Web content extraction is a process to extract user specified information from web pages. Traditionally, the main approaches of web content extraction have been performed via rule based or pattern based. Typically, rule or pattern set is manually prepared by hand-engineering and can only be applied to each individual web site. To increase the efficiency, we have proposed a machine learning based approach by applying Long Short-Term Memory (LSTM) which is a sequence to sequence learning for dynamic extraction of title and content from web pages. Based on our error analysis, misclassified tokens are considered minority among the total correct sequence. To improve the performance, in this paper we propose a post processing technique by merging predicted tokens with minority tags into the majority one in the token sequence. To evaluate the performance, we use the same data set from our previous work which is a collection of web pages from 10 different Thai web boards such as Dek-D, MThai, Sanook and Pantip. The results of our post processing technique helps improve the accuracy up to 99.53%, an improvement of 0.11% from the previous proposed model. The overall improvement may seem little, however, for Title extraction, the accuracy is significantly improved from 88.04% to 100%.</p> ER -