Generalized Information Extraction for Thai Web Boards

Main Article Content

Apichai Dangmadee
Parinya Sanguansat
Choochart Haruechaiyasak

Abstract

Web content extraction is a process to extract user specified information from web pages. Traditionally, the main approaches of web content extraction have been performed via rule based or pattern based. Typically, rule or pattern set is manually prepared by hand-engineering and can only be applied to each individual web site. To increase the efficiency, we have proposed a machine learning based approach by applying Long Short-Term Memory (LSTM) which is a sequence to sequence learning for dynamic extraction of title and content from web pages. Based on our error analysis, misclassified tokens are considered minority among the total correct sequence. To improve the performance, in this paper we propose a post processing technique by merging predicted tokens with minority tags into the majority one in the token sequence. To evaluate the performance, we use the same data set from our previous work which is a collection of web pages from 10 different Thai web boards such as Dek-D, MThai, Sanook and Pantip. The results of our post processing technique helps improve the accuracy up to 99.53%, an improvement of 0.11% from the previous proposed model. The overall improvement may seem little, however, for Title extraction, the accuracy is significantly improved from 88.04% to 100%.

Article Details

How to Cite
Dangmadee, A., Sanguansat, P., & Haruechaiyasak, C. (2019). Generalized Information Extraction for Thai Web Boards. INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND TECHNOLOGY (ISJET), 2(1), 20–26. Retrieved from https://ph02.tci-thaijo.org/index.php/isjet/article/view/175903
Section
Research Article

References

[1] S. M. Al-Ghuribi and S. Alshomrani, “A comprehensive survey on web content extraction algorithms and techniques,”
Int. Conf. Inf. Sci. Appl. ICISA 2013, Jan. 2013.
[2] R. Baumgartner, R. Baumgartner, S. Flesca, G. Gottlob, S. Flesca, and G. Gottlob, “Visual web information extraction
with lixto,” Proc. Int. Conf. Very Large Data Bases, 2001, pp. 119-128.
[3] A.Arasu, H. Garcia-Molina, A. Arasu, and H. Garcia-Molina, “Extracting structured data from Web pages,” ACM SIGMOD Int. Conf. Manag. Data, 2003, pp. 337-348.
[4] S. Soderland, M. Broadhead, M. Banko, M. J. Cafarella, and O. Etzioni, “Open information extraction from the web,”
Int. Jt. Conf. Artif. Intell., 2007, pp. 2670-2676.
[5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content structure for web pages based on visual representation,” Proc. 5th Asia-Pacific web Conf. Web Technol. Appl., 2003, pp. 406-417.
[6] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” Proc. twelfth Int. Conf. World Wide Web WWW 03, 2003, pp. 207.
[7] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems
by gibbs sampling,” Acl, no. 1995, pp. 363-370, 2005.
[8] A. Sun, E.-P. Lim, and W.-K. Ng, “Web classification using support vector machine,” Proc. fourth Int. Work. Web Inf.
data Manag. WIDM 02, vol. 78, pp. 96-99, Apr. 2002.
[9] S. Wu, J. Liu, and J. Fan, “Automatic Web Content Extraction by Combination of Learning and Grouping,” Proc. 24th Int. Conf. World Wide Web–WWW’15, pp. 1264-1274, 2015.
[10] C. Jeenanunta and K. D. Abeyrathn, “Combine Particle Swarm Optimization with Artificial Neural Networks for
Short-Term Load Forecasting,” Int. Sci. J. Eng. Technol., vol. 1, no. 1, pp. 25-30, 2017.
[11] M. Chau and H. Chen, “A machine learning approach to web page filtering using content and structure analysis,” Decis. Support Syst., vol. 44, no. 2, pp. 482-494, 2008.
[12] A. N. Jagannatha and H. Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health Records,” Naacl2016, pp. 473-482, 2016.
[13] Y. Homma, K. Sadamitsu, and K. Nishida, A Hierarchical Neural Network for Information Extraction of Product
Attribute and Condition Sentences, pp. 21–29. [14] A. Dangmadee, P. Sanguansat, and C. Haruechaiyasak, “Web
Content Extraction Using LSTM Networks,” 2017 2nd Int. Conf. Sci. Technol., 2017.
[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” pp. 1-9, 2014.
[16] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, 2015.
[17] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, Neural Architectures for Named Entity Recognition, 2016.
[18] S. Hochreiter and J. Urgen Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
[19] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao, A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding, 2015.
[20] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. Proc., pp. 4470-4474, Aug.
2015.
[21] F. Chollet, “keras,” GitHub repository. GitHub, 2015.
[22] Martin Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015.