Generalized Information Extraction for Thai Web Boards

Apichai Dangmadee; Parinya Sanguansat; Choochart Haruechaiyasak

PDF

Published: Mar 5, 2019

Keywords:

Web Content Extraction LSTM Sequence-to-Sequence Learning Post processing

Apichai Dangmadee

Parinya Sanguansat

Choochart Haruechaiyasak

Abstract

Web content extraction is a process to extract user specified information from web pages. Traditionally, the main approaches of web content extraction have been performed via rule based or pattern based. Typically, rule or pattern set is manually prepared by hand-engineering and can only be applied to each individual web site. To increase the efficiency, we have proposed a machine learning based approach by applying Long Short-Term Memory (LSTM) which is a sequence to sequence learning for dynamic extraction of title and content from web pages. Based on our error analysis, misclassified tokens are considered minority among the total correct sequence. To improve the performance, in this paper we propose a post processing technique by merging predicted tokens with minority tags into the majority one in the token sequence. To evaluate the performance, we use the same data set from our previous work which is a collection of web pages from 10 different Thai web boards such as Dek-D, MThai, Sanook and Pantip. The results of our post processing technique helps improve the accuracy up to 99.53%, an improvement of 0.11% from the previous proposed model. The overall improvement may seem little, however, for Title extraction, the accuracy is significantly improved from 88.04% to 100%.

How to Cite

Dangmadee, A., Sanguansat, P., & Haruechaiyasak, C. (2019). Generalized Information Extraction for Thai Web Boards. INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND TECHNOLOGY (ISJET), 2(1), 20–26. retrieved from https://ph02.tci-thaijo.org/index.php/isjet/article/view/175903

Issue

Vol. 2 No. 1 (2018): January-June

Section

Research Article

เนื้อหาข้อมูล

References

[1] S. M. Al-Ghuribi and S. Alshomrani, “A comprehensive survey on web content extraction algorithms and techniques,”
Int. Conf. Inf. Sci. Appl. ICISA 2013, Jan. 2013.
[2] R. Baumgartner, R. Baumgartner, S. Flesca, G. Gottlob, S. Flesca, and G. Gottlob, “Visual web information extraction
with lixto,” Proc. Int. Conf. Very Large Data Bases, 2001, pp. 119-128.
[3] A.Arasu, H. Garcia-Molina, A. Arasu, and H. Garcia-Molina, “Extracting structured data from Web pages,” ACM SIGMOD Int. Conf. Manag. Data, 2003, pp. 337-348.
[4] S. Soderland, M. Broadhead, M. Banko, M. J. Cafarella, and O. Etzioni, “Open information extraction from the web,”
Int. Jt. Conf. Artif. Intell., 2007, pp. 2670-2676.
[5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content structure for web pages based on visual representation,” Proc. 5th Asia-Pacific web Conf. Web Technol. Appl., 2003, pp. 406-417.
[6] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” Proc. twelfth Int. Conf. World Wide Web WWW 03, 2003, pp. 207.
[7] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems
by gibbs sampling,” Acl, no. 1995, pp. 363-370, 2005.
[8] A. Sun, E.-P. Lim, and W.-K. Ng, “Web classification using support vector machine,” Proc. fourth Int. Work. Web Inf.
data Manag. WIDM 02, vol. 78, pp. 96-99, Apr. 2002.
[9] S. Wu, J. Liu, and J. Fan, “Automatic Web Content Extraction by Combination of Learning and Grouping,” Proc. 24th Int. Conf. World Wide Web–WWW’15, pp. 1264-1274, 2015.
[10] C. Jeenanunta and K. D. Abeyrathn, “Combine Particle Swarm Optimization with Artificial Neural Networks for
Short-Term Load Forecasting,” Int. Sci. J. Eng. Technol., vol. 1, no. 1, pp. 25-30, 2017.
[11] M. Chau and H. Chen, “A machine learning approach to web page filtering using content and structure analysis,” Decis. Support Syst., vol. 44, no. 2, pp. 482-494, 2008.
[12] A. N. Jagannatha and H. Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health Records,” Naacl2016, pp. 473-482, 2016.
[13] Y. Homma, K. Sadamitsu, and K. Nishida, A Hierarchical Neural Network for Information Extraction of Product
Attribute and Condition Sentences, pp. 21–29. [14] A. Dangmadee, P. Sanguansat, and C. Haruechaiyasak, “Web
Content Extraction Using LSTM Networks,” 2017 2nd Int. Conf. Sci. Technol., 2017.
[15] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” pp. 1-9, 2014.
[16] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, 2015.
[17] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, Neural Architectures for Named Entity Recognition, 2016.
[18] S. Hochreiter and J. Urgen Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
[19] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao, A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding, 2015.
[20] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. Proc., pp. 4470-4474, Aug.
2015.
[21] F. Chollet, “keras,” GitHub repository. GitHub, 2015.
[22] Martin Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015.

Article Sidebar

Main Article Content

Abstract

Article Details

References