Headline2Vec: A CNN-based Feature for Thai Clickbait Headlines Classification

Main Article Content

Natsuda Kaothanthong
Sarawoot Kongyoung
Thanaruk Theeramunkong

Abstract

AbstractClickbait is an article title or a social media post that attracts readers to follow a link to the article’s content. It is one of the major contributors to the spread of fake news. To prevent a wide spread of fake news, it should be detected as soon as possible. This paper presents a content-based feature called headline2vec that is extracted from a concatenation layer of a convolutional neural network (CNN) on the well-known word2vec model for high dimensional word embeddings, to improve an automatic detection of Thai clickbait headlines. A pioneer dataset for Thai clickbait headlines is collected using a top-down strategy. In the experiment, we evaluate the headline2vec feature for Thai clickbait news detection using 132,948 Thai headlines where the CNN features are constructed using a non-static modeling technique with 50 dimensions of word2vec embedding with a window size of two, three, and four with the epoch of 5. Using the proposed features, we compare three classifiers, naïve Bayes, support vector machine, and multilayer perceptron. The result shows that the headline2vec with multilayer perceptron achieves up to 93.89% accuracy and it outperforms the sequential features that utilize n-gram with tf-idf.

Downloads

Download data is not yet available.

Article Details

How to Cite
Kaothanthong, N., Kongyoung, S., & Theeramunkong, T. (2021). Headline2Vec: A CNN-based Feature for Thai Clickbait Headlines Classification. INTERNATIONAL SCIENTIFIC JOURNAL OF ENGINEERING AND TECHNOLOGY (ISJET), 5(1), 20-31. Retrieved from https://ph02.tci-thaijo.org/index.php/isjet/article/view/240815
Section
Research Article

References

[1] C. Erik and B. White, “Jumping NLP curves: A review of natural language processing research,” IEEE Computational intelligence magazine, Vol. 9, no. 2, pp. 48-57. 2014.

[2] C. D. Manning, M. Surdeanu, J. Bauer et al., “The Stanford corenlp natural language processing toolkit,” in Proc. The 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55-60.

[3] F. Provost and T. Fawcett, Data science for business what you need to know about data mining and data-analytic thinking. Boston, USA: O’Reilly Media, 2013, pp. 1-409.

[4] A. Bondielli and F. Marcelloni, (2020, April 10). A survey on fake news and rumour detection techniques, Information Sciences. [Online]. 497, pp. 38-55. Available: https://doi.org/10.1016/j.ins.2019.05.035

[5] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 election,” Technical Report, National Bureau of Economic Research, vol. 31, no. 2, pp. 211-236, Spring. 2017.

[6] Y. Chen, N. J. Conroy, and V. L. Rubin, “Misleading online content: recognizing clickbait’s false news,” in Proc. ACM Workshop on Multimodal Deception Detection, 2015, pp. 15-19.

[7] C. Silverman, “Lies, damn lies, and viral content. How news websites spread (and debunk) online rumors, unverified claims, and misinformation,” Tow Center for Digital Journalism, vol. 168, pp. 1-155, Feb. 2015.

[8] K. El-Arini and J. Tang. (2020, April 10). Click-Baiting: Facebook Newsroom. [Online]. Available https://newsroom.fb.com/news/2014/08 /news-feed-fyi-click-baiting

[9] A. Peysakhovich and K. Hendrix. (2020, April 10). News Feed FYI: Further Reducing Clickbait in Feed, In Facebook newsroom. [Online]. available http://newsroom.fb.com/news/2016/08/news-feed-fyi-further-reducing-click baitin-feed/

[10] M. Potthast, S. Kopsel, B. Stein, and M. Hagen, “Clickbait Detection, in Proc,” in Proc. The 38th European Conference on Machine Learning, 2016, pp. 810-817.

[11] A. Anand and T. Chakraborty, and N. Park, “We used neural networks to detect clickbait’s: you won’t believe
what happened next,” in Proc. European Conference on Information Retrieval, 2017, pp. 541-547.

[12] A. Chakraborty, B. Paranjape, S. Kakarla, and N. Ganguly, “Stop clickbait: Detecting and preventing clickbait’s in online news media,” in Proc. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2016, pp. 9-16.

[13] J. Han, M. Kamber, and J. Pei, Classification: Basic Concepts, Massachusetts, Massachusetts. USA: The Morgan Kaufmann, 2012, pp 327-391.

[14] T. Joachims, “Text Categorization with Support Vector Machines:Learning with Many Relevant Features,” in Proc. The 10th European Conference on Machine Learning, 1998, pp. 137-142.

[15] D. Pandey, G. Verma, and S. Nagpal, Clickbait Detection Using Swarm Intelligence, Singapore, SG: Springer, 2019, pp. 64-76.

[16] M. Potthast, T. Gollub, M. Hagen, and B. Stein. (2020, Sep 10). The Clickbait Challenge 2017: Towards a Regression Model for Clickbait Strength. [Online]. Available: https://arxiv.org/abs/1812.10847

[17] W. Wei and X. Wan, “Learning to identify ambiguous and misleading news headline,” in Proc. The 26th International Joint Conference on Artificial Intelligence, 2017, pp. 4172-4178.

[18] B. D. Horne and S. Adali. (2020, Mar 20). This just in: fake news packs a lot in title, uses simpler, repetitive content intext body, more similar to a tire than real news. [Online]. Available: https://arxiv.org/abs/1703 .09398]

[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. The International Conference on Learning Representations, 2013, pp. 1-15.

[20] S. Richard, P. Alex, W. Jean, et.al., “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank,” in Proc. The 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1631-1642.

[21] A. Agrawal, “Clickbait detection using deep learning,” in Proc. 2016 2nd International Conference on Next Generation Computing Technologies, 2016, pp. 268-272

[22] S. Kongyoung, A. Rugchatjaroen, and N. Kaothanthong, Automatic Feature extraction and Classification model for detecting Thai clickbait headline using convolutional Neural Network. Amsterdam, Netherland:IOS Press, 1991, pp. 184-194.

[23] K. Kosawat, “BEST 2009: Thai Word Segmentation Software Contest,” in Proc. The 8th International Symposium on Natural Language Processing, 2009, pp. 83-89.

[24] T. Suwanapong, T. Theeramunkong, and E. Nantajeewarawat, “Name-alias relationship identification in Thai news articles: A comparison of co-occurrence matrix construction methods,” Chiang Mai Journal of Science, vol. 44, no. 4, pp. 1805-1821, 2017.

[25] Y. Kim, “Convolutional Neural Networks for Sentence Classification,” in Proc. The 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1746-1751.

[26] K. Shu, A. Sliva, S. Wang et al., “Fake news detection on social media: a data mining perspective,” ACM SIGKDD Explor. Newslett, vol. 19, no. 1, pp. 22-36. Sep. 2017.

[27] Y. Qin, D. Wurzer, V. Lavrenko, and C. Tang. (2020, April 10). Spotting rumors via novelty detection. [Online]. Available: https://www.semanticscholar.org/paper/Spotting-Rumorsvia- Novelty-Detection-QinWurzer/739d05c6ed0fdb92226924c5cb9866a5c7c9a50

[28] A. Zubiaga, M. Liakata, and R. Procter. (2020, April 20). Learning reporwting dynamics during breaking news
for rumour detection in social media. Researchgate. [Online]. Available: https://www.researchgate.net/
publication/309402969_Learning_Reporting_Dynamics_during_Breaking_News_for_Rumour_Detection_in_Social_
Media

[29] C. Castillo, M. Mendoza, and B. Poblete, “Information credibility on twitter,” in Proc. 2013 International Conference on Social Computing, 2013, pp. 675-684.

[30] E. Cambria, S. Poria, A. Gelbukh, and M. Thelwall, “Sentiment Analysis Is a Big Suitcase,” IEEE Intelligent
Systems, vol. 32, no. 6, pp. 74-80. Dec. 2017.

[31] R. Rehurek, and P. Sojka, “Software framework for topic modeling with large corpora,” in Proc. The 7th International Conference on Language Resources and Evaluation, 2010, pp. 46-50.

[32] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proc. The 31st International Conference on International Conference on Machine Learning, 2014, pp. 1188-1196.

[33] A. Zubiaga, A. Aker, B. Bontcheva et. al., “Detection and resolution of rumours in social media: a survey,” ACM Comput. Surv, vol. 5, no. 2, pp. 1-36, Apr. 2018.

[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. The 25th International Conference on Neural Information Processing Systems, 2012, pp. 1097-1105.

[35] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806-813.

[36] C. N. Dos Santos and M. Gatti, “Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts.” in Proc. The 25th International Conference on Computational Linguistics, 2014, pp. 69-78.

[37] N. Kalchbrenner, E. Grefenstette, and P. Blunsom. “A Convolutional Neural Network for Modelling Sentences,” in Proc. The 52nd Annual Meeting of the Association for Computational Linguistics, 2014, pp. 655-665.

[38] C. Haruechaiyasak and A. Kongthon, “LexToPlus: A Thai Lexeme Tokenization and Normalization Tool”, in Proc. The 4th Workshop on South and Southeast Asia Natural Language Processing, 2013, pp. 9-16.

[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding,” Computer Science - arXiv, vol. 2, pp. 1-16, May. 2019.

[40] T. Wolf, L. Debut, V. Sanh et al. (2020, April 20). HuggingFace’s Transformers: State-of-the-art Natural
Language Processing. [online]. Available: https://www.semanticsecholar