Applying an extremely imbalanced technique on big data: A case study of web intrusion

Main Article Content

Kesinee Boonchuay
Sureerat Kaewkeeree
Youppadee Intasorn


 A web intrusion is a type of network intrusion that occurs frequently. A web log can be used to identify this type of intrusion. However, it tends to involve a huge amount of data which is difficult to be processed on a stand-alone computer. Moreover, this data is also an imbalanced dataset since the number of intrusion threats and normal accesses are extremely different. In order to handle a web intrusion, the two main topics that have to be involved include big data and an imbalanced problem. Therefore, this research applies an imbalanced technique based on big data to improve the performance on the web intrusion dataset. It is based on Apache Spark which is a popular open-source big data framework. The goal of this paper is to enhance the efficiency of intrusion prediction which is also categorized as an extremely imbalanced problem. The idea of minority class instance broadcasting is applied to improve the performance of prediction for web intrusion threats. According to the results, overall performance when applying an imbalanced technique with decision tree improves over a standard decision tree. For comparing by F-measure and geometric mean on 7 partitions, performances when applying an extremely imbalanced technique highly improve at 0.92 and 0.81 for F-measure and geometric mean respectively. For logistic regression, the application of an imbalanced technique does not show statistical improvement.

Article Details

How to Cite
Boonchuay, K., Kaewkeeree, S., & Intasorn, Y. (2018). Applying an extremely imbalanced technique on big data: A case study of web intrusion. Interdisciplinary Research Review, 13(2). Retrieved from
Research Articles