Using k-means Clustering to Confirm Provincial COVID-19 Cases during the Omicron Epidemic in Thailand

Authors

  • Worrawate Leela-apiradee Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Sathinee Wareepornthep Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand
  • Chutikan Premboon Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani, Thailand

Keywords:

Dynamic time warping, k-means clustering, pearson correlation coefficient, time series k-means

Abstract

The Novel Coronavirus 2019 (COVID-19) pandemic has infected and killed millions of people internationally. This work uses k-means clustering and a time series k-means algorithm to present an overview of cases and deaths from COVID-19 in grouped provinces of Thailand before entering the post-pandemic period on 1 July 2022. The study is divided into two parts: the first uses k-means clustering with Euclidean distance measure to analyze confirmed cases and deaths per 100,000 population by province that cumulated from 1 January 2022 to 30 June 2022, during the Omicron (B.1.1.529) outbreak. Based on the elbow method, optimal numeric value for clusters (groups of provinces) is k = 5. The second cluster, consisting of two provinces: Phuket, and Samut Sakhon, is reached the highest cluster mean of the confirmed cases and deaths. We investigate the linear relationship between the confirmed cases (deaths) and 12 different feature variables associated with social, economic, health and environmental factors. Pearson correlation analysis indicates four feature variables correlated positively with confirmed cases and deaths: Gross Regional and Provincial Product (GPP) per capita; number of medical personnel per 100,000 population (pop.); average monthly household income; and number of dengue cases per 100,000 pop. In the second part, k-means clustering with dynamic time warping distance measure is applied to time series data, namely daily confirmed cases per 100,000 people by province gathered during the same time interval as the first part for 181 days, with optimal cluster number being k = 3. The time series of infections attained its apogee in the third cluster, consisting of three provinces: Phuket, Samut Songkhram, and Samut Sakhon. In addition, these findings provide a record of the COVID-19 pandemic in Thailand during the first half of 2022, as illustrated in choropleth maps, for potential governmental use of these provincial groupings for future public health service budget allocation decisions related to the COVID-19 pandemic.

References

Abdullah D, Susilo S, Ahmar AS, Rusli R, Hidayat R. The application of K-means clustering for province clustering in Indonesia of the risk of the COVID-19 pandemic based on COVID-19 data. Qual Quant. 2022; 56: 1283-1291.

Afzal A, Ansari Z, Alshahrani S, Raj AK, Kuruniyan MS, Saleel CA, Nisar KS. Clustering of COVID-19 data for knowledge discovery using c-means and fuzzy c-means. Results Phys. 2021; 29: 104639.

Ampornphan P. Association analysis of COVID-19 outbreak in Thailand using data mining techniques. PSAKU Int J Interdisc Res. 2021; 10: 21-33.

Asuero AG, Sayago A, Gonzlez A. The correlation coefficient: An overview. Crit Rev Anal Chem.

; 36: 41-59.

Best JW, Kahn JV. Research in education. Pearson Education India; 2016.

Bucci A, Ippoliti L, Valentini P, Fontanella S. Clustering spatio-temporal series of confirmed COVID-19 deaths in Europe. Spat Stat - Neth. 2022; 49:100543.

Cai B, Huang G, Samadiani N, Li G, Chi CH. Efficient time series clustering by minimizing dynamic time warping utilization. IEEE Access. 2021; 9: 46589-46599.

Cerqueti R, Ficcadenti V. Combining rank-size and k-means for clustering countries over the COVID-19 new deaths per million. Chaos Solit Fractals. 2022; 158:111975.

DUrso P, De Giovanni L, Vitale V. Spatial robust fuzzy clustering of COVID-19 time series based on B-splines. Spat Stat - Neth. 2022; 49:100518.

Huang X, Ye Y, Xiong L, Lau RY, Jiang N, Wang S. Time series k-means: A new k-means type smooth subspace clustering for time series data. Inform Sciences. 2016; 367: 1-13.

Hutagalung J, Ginantra NLWSR, Bhawika GW, Parwita WGS, Wanto A, Panjaitan PD. COVID-19 cases and deaths in Southeast Asia clustering using k-means algorithm. In: Annual Conference on Science and Technology Research (ACOSTER). Journal of Physics: Conference Series: IOP Publishing; 2021. p. 012027.

Izakian H, Pedrycz W, Jamal I. Fuzzy clustering of time series data using dynamic time warping distance. Eng Appl Artif Intel. 2015; 39: 235-244.

Jeong YS, Jeong MK, Omitaomu OA. Weighted dynamic time warping for time series classification. Pattern Recogn. 2011; 44: 2231-2240.

Keogh E, Ratanamahatana CA. Exact indexing of dynamic time warping. Knowl Inf Syst. 2016; 7:

-386.

Lee J, Yoo S, Kim H, Chung Y. The spatial and temporal variation in passenger service rate and its impact on train dwell time: A time-series clustering approach using dynamic time warping. Int J Sustain Trans. 2018; 12: 725-736.

Li H, Liu J, Yang Z, Liu RW, Wu K, Wan Y. Adaptively constrained dynamic time warping for time series classification and clustering. Inform Sciences. 2020; 534:97-116.

Li M, Zhu Y, Zhao T, Angelova M. Weighted dynamic time warping for traffic flow clustering. Neurocomputing. 2022; 472: 266-279.

Liao TW. Clustering of time series data–a survey. Pattern Recogn. 2005; 38: 1857-1874.

MacQueen J. Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability; 1967. p. 281-297.

Mattera R. A weighted approach for spatio-temporal clustering of COVID-19 spread in Italy. Spat Spatiotemporal Epidemiol. 2022; 41:100500.

Ratner B. The correlation coefficient: Its values range between +1/-1, or do they? J Target Meas Anal Market. 2009; 17: 139-142.

Rizvi SA, Umair M, Cheema MA. Clustering of countries for COVID-19 cases based on disease prevalence, health systems and environmental indicators. Chaos Solit Fractals. 2021;

:111240.

Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesth

Analg. 2018; 126:1763-1768.

Sivaraks H, Sathianwiriyakhun P, Janyalikit T, Ratanamahatana C. Accurate time series classification using partial dynamic time warping. In: Second International Conference on Advances in Applied Science and Environmental Technology (ASET); 2015. p. 31-35.

Watanabe N. A k-means method for trends of time series: An application to time series of COVID-19 cases in Japan. Jpn J Stat Data Sci. 2022; 5:303-319.

Yao Y, Zhao X, Wu Y, Zhang Y, Rong J. Clustering driver behavior using dynamic time warping and hidden Markov model. J Intell Transport S. 2021; 25: 249-262.

Zubair M, Iqbal A, Shil A, Haque E, Moshiul Hoque M, Sarker IH. An efficient K-means clustering algorithm for analysing COVID-19. In: International Conference on Hybrid Intelligent Systems: Springer; 2020. p. 422-432.

Downloads

Published

2024-06-29

How to Cite

Leela-apiradee, W. ., Wareepornthep, S. ., & Premboon, C. . (2024). Using k-means Clustering to Confirm Provincial COVID-19 Cases during the Omicron Epidemic in Thailand. Thailand Statistician, 22(3), 594–609. Retrieved from https://ph02.tci-thaijo.org/index.php/thaistat/article/view/254770

Issue

Section

Articles