Apply Monte Carlo Simulation to Synthesize Data

Main Article Content

Orawan Hensirisak
Chaiyaporn Khemapatpapan

บทคัดย่อ

This research presents the data synthesization by using monte carlo simulation. Six datasets were synthesized and categorized into two types: (1) datasets with more categorical variables than numerical variables, and (2) datasets with more numerical variables than categorical variables. Synthesize data 1500 rows for each dataset then compared between real data and data synthesization using 1) The Kolmogorov-Smirnov Two-Sample Test, 2) T-test, 3) Cosine Similarity Test, 4) Multiple Linear Regression Analysis, and 5) Direct Data Comparison. The results showed that the Monte Carlo method was the most efficient for synthesizing data, especially for categorical variable data. Based on the coefficients of determination, the Monte Carlo simulation was 60.47% more efficient than Generative Adversarial Networks (GANs) and 52.41% more efficient than Variational Autoencoders (VAEs). Additionally, the Monte Carlo simulation method allows for adjustments to better represent the population in cases where the sample group does not fully cover it.

Article Details

รูปแบบการอ้างอิง
[1]
O. Hensirisak และ Chaiyaporn Khemapatpapan, “Apply Monte Carlo Simulation to Synthesize Data”, JIST, ปี 15, ฉบับที่ 2, น. 15–23, ธ.ค. 2025.
ประเภทบทความ
บทความวิจัย Soft Computing:

เอกสารอ้างอิง

Turing, “Synthetic Data Generation: Definition, Types, Techniques, and Tools,” TURING [online].

https://www.turing.com/kb/synthetic-data-generation-techniques#what-is-synthetic-data? (Accessed on: 12 February 2025).

A. Beduschi, “Synthetic data protection: Towards a paradigm change in data regulation”, Law School, University of Exeter, Exeter, UK, 2024.

W. Phusomsai, “Extending GANs’ Latent Space for Diverse Image Generation from Sketches,” M.S. thesis, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand, 2019.

B. Brenninkmeijer, "On the Generation and Evaluation of Tabular Data using GANs," Radboud University, Houtlaan 4, 6525 XZ Nijmegen, Netherlands, 2019.

K. Laosirikul, “The Performance of Imbalanced Data Handling Methods for Classification under Different Conditions,” M.S. thesis, Dept. of Statistics, Chulalongkorn University, Bangkok, Thailand, 2022.

Mathijs van Brer, "On the Generation and Evaluation of Tabular Data using GANs," Radboud

University, Houtlaan 4, 6525 XZ Nijmegen, Netherlands, 2019.

C. Noikhamyang, “A Comparison of Parametric and Nonparametric Statistical Tests for Identifying Differences between Two Independent Populations,” M.S. project, Dept. of Statistics, Srinakharinwirot University, Bangkok, Thailand, 2009.

Chiang Mai University, Statistics and Health Data Analysis, Chiang Mai, Thailand, 2020.

L. Xu and K. Veeramachaneni, "Synthesizing Tabular Data using Generative Adversarial Networks," LIDS, MIT, Cambridge, MA, USA, 2018.

L. Locowic, Alessandro Monteverdi, "Synthetic Data Generation from Real Data Sources using Monte Carlo Tree Search and Large Language Models," arXiv preprint arXiv:2401.12345, 2024. Available: https://d197for5662m48.cloudfront.net/documents/publicationstatus/224165/preprint_pdf/3c3ef1837

b4cf3bb7cfd68385de99.pdf

Noey. “Delaware Births,” kaggle.

https://www.kaggle.com/datasets/noeyislearning/delaware-births/data (accessed February 12, 2025).

G. Dutta. “Heart Rate Forecasting,” kaggle.

https://www.kaggle.com/datasets/gauravduttakiit/heart-rate-forecasting (accessed February 12, 2025).

A. Rafiee. “Stock Market Data of USA,” kaggle.

https://www.kaggle.com/datasets/ahmadrafiee/stock-market (accessed February 12, 2025).

R. Sandiani. “Census Income,” kaggle.

https://www.kaggle.com/datasets/uciml/adult-census-income/data (accessed February 12, 2025).

U. Zia. “Car Classification Dataset,” kaggle.

https://www.kaggle.com/datasets/stealthtechnologies/car-evaluation-classification (accessed February 12, 2025).

Vopani. “NIFTY-50 Stock Market Data (2000 - 2021),” kaggle. https://www.kaggle.com/datasets/rohanrao/nifty50-stock-market-data (accessed February 12, 2025).

R. S. Rana. “Employee/HR Dataset (All in One),” kaggle. https://www.kaggle.com/datasets/ravindrasinghrana/employeedataset/data?select=recruitment_data.csv (accessed May 31, 2025).