Thai Text Compression Algorithm Employing Word-Formation Creation

Main Article Content

Prayat Le-wan
Chouvalit Khancome

Abstract

The compression of text without data loss is a fundamental aspect of computer science, crucial for minimizing the storage space required for large datasets. This principle has been continuously developed and has consistently attracted the interest of researchers. This research article presents a highly efficient design for a new text compression method specifically tailored for compressing Thai language text. The procedural mechanism involves the creation of a new dictionary-like structure termed the "Pre-Processing Section" based on the patterns of word formation in the Thai language. This structure is utilized for referencing terms during compression and decompression processes. The data compression is executed by storing information in a binary file using the newly developed Word-Formation Thai Text Compression Algorithm (WFTTCA). The compression process following this newly developed method can achieve compression rates in theoretical terms, represented by ASCII- TIS620 encoding, ranging from 37.50% to 79.17%, with a maximum average of 63.75%. For Unicode encoding, compression rates range from 68.75% to 89.58%, with a maximum average of 81.88%. In the case of UTF-8 encoding, compression rates range from 79.17% to 93.06%, with a maximum average of 87.92%. These compression rates correspond to a range of 3.51 to 10.50 times the original data size. The experimental results from the development of the program based on the new method, using actual Thai language data randomly sampled from 1Kb-100Kb and imported from news websites, reveal that the program is capable of compressing data encoded with ASCII-TIS620 by percentages ranging from 78.09% to 84.55%. For Unicode encoding, the compression rates range from 81.50% to 86.62%. Similarly, for UTF-8 encoding, the compression rates range from 88.09% to 91.11%. When comparing the compression efficiency achieved with popular current compression software, it is found that the program developed from the new method can achieve significantly higher compression rates, both in terms of percentage compression and compression ratios.

Article Details

How to Cite
[1]
P. Le-wan and C. Khancome, “Thai Text Compression Algorithm Employing Word-Formation Creation”, JIST, vol. 14, no. 1, pp. 9–21, Jun. 2024.
Section
Research Article: Soft Computing (Detail in Scope of Journal)

References

Z. Karim Zia, D. Fayzur Rahman, and C. Mofizur Rahman. Two-Level Dictionary-Based Text Compression Scheme . Proceedings of 11th International Conference on Computer and Information Technology (ICCIT 2008) 25-27 December, 2008, Khulna, Bangladesh, 13-18.

W. Wen-Yen and J. W. Mao-Jiun, "Two-dimensional object recognition through two-stage string matching," Image Processing, IEEE Transactions on, vol. 8, 978-981, 1999.

F. Amar Mukherjee. Data Compression Using Encrypted Text Robert. Proceedings of ADL ’96 ,1996, 130-138.

G. Hwee Ong and S. Ying Huang. A Data Compression Scheme for Chinese Text Files Using Huffman Coding and a Two-Level Dictionary. INFORMATION SCIENCES 84, 85 99 (1995) 85-99.

A. A. Sharieh. An enhancement of Huffman coding for the compression of multimedia file. Transactions of Engineering Computing and Technology, Vol. 3, No. 1, 2004, 303-305.

C. Khancome. Bit-level Text Compression Algorithm Using Position of Characters. 2010 2nd International Conference on Information and Multimedia Technology (ICIMT 2010). Vol. 1-242, 2010, 242-245.

C. Khancome. New Full Text Compression Algorithm Based on Position of Character. 2010 3rd International Conference on Computer and Electrical Engineering (ICCEE 2010). IEEE Conference, Vol. 5, 2010, 631-634.

ประหยัด เลวัน เชาวลิต ขันคำ, "ขั้นตอนวิธีการบีบอัดข้อความภาษาไทยด้วยรูปแบบสระ" The 15th National Conference on Information Technology (NCIT2023), เชียงราย, ประเทศไทย, 2566, หน้า 50-55.

สัญฉกร วุฒิสิทธิกุลกิจ, สุวิทย์ นาคพีระยุทธ, ปิติฉัตร สุทธาโรจน์ และ สมภพ โชคชัยธรรม. เทคโนโลยีการบีบอัดข้อมูลเบื้องต้น, สำนักพิมพ์จุฬาลงกรณ์มหาวิทยาลัย: กรุงเพพฯ, 2549.

M. Crochemore, and W. Rytter, (2023, March, 18). Text Algorithms. Available: http://monge.univ-mlv.fr/~mac/REC/ text-algorithms.pdf.

A. Mofat, and R.Y.K. Isal. Word-based text compression using the burrows-wheeler transform. Information Processing and Management, Vol. 41, No. 5, 2005, 1175-1192.

J. Adiego, and P. de. la Feunte, On the use of words as source alphabet symbols in PPM. In Proceedings of Data Compression Conference, IEEE, 2006, 435.

J. Lánský and M. Žemlička. Text compression: Syllables. In Proceedings of the Dateso Workshop on Database, Texts, Specifications and Objects, 2005, 32-45.

H. Al-Bahadili and S. M. Hussain. An adaptive character wordlength algorithm for data compression. Computers & Mathematics with Applications, Vol. 55, No. 6, 2008, 1250-1256.

S. Nofal. Bit-level text compression. In Proceedings of the 1st International Conference on Digital Communications and Computer Applications, Irbid, Jordan, 2007, 486-488.

A. Rababáa. An Adaptive Bit-Level Text Compression Scheme Based on the HCDC Algorithm. M.Sc., dissertation, Amman Arab University for Graduate Studies, Amman, Jordan, 2008.

H. Al-Bahadili and S. M. Hussain. A Bit-level Text Compression Scheme Based on the ACW Algorithm. International Journal of Automation and Computing, Vol. 7 No. 1, 2010, 123-131.

C. Khancome. Text Compression Algorithm Using Bits for Character Representation. International Journal of Advanced Computer Science. Vol. 1, No. 6, 2010, 215-219.

เศกสิทธิ์ พจมารและจารี ทองคำ “การเปรียบเทียบขั้นตอนวิธีการบีบอัดข้อมูลแบบไม่สูญเสียข้อมูลบนเว็บแอปพลิเคชัน” RMUTT JOURNAL Science and Technology Vol.13 No. 3, pp 120-133, Sep-Dec. 2020.

ชนาภา ศิลาวงษ์และธนภัทร์ อนุศาสน์อมรกุล “การศึกษาเปรียบเทียบวิธีบีบอัดข้อมูลที่เหมาะสมสำหรับแต่ละประเภทข้อมูล” วารสารวิศวกรรม มก.. ฉบับที่ 91 ปีที่ 28 หน้า 83-92 มกราคม-มีนาคม 2558.

บรรพต ดลวิทยา “ศึกษาการบีบอัดเอกสารเอชทีเอ็มแอลบนฝั่งเซิร์ฟเวอร์ด้วยการขั้นตอนวิธีแบบ Huffman” วิทยานิพนธ์วิทยาศาสตร์มหาบัญฑิต สาขาวิทยการคอมพิวเตอร์ มหาวิทยาลัยศิลปากร 2550.

ทีมงานทรูปลูกปัญญา, หลักการสร้างคำในภาษาไทย, (Access 4 ธ.ค. 66), [Online] Available: https://www.trueplookpanya.com/learning/detail/34513.

รัฐบาลไทย-ข่าวทำเนียบรัฐบาล. (Access 4 ธ.ค. 66), [Online] Available: https://www.thaigov.go.th/news/ contents/details/31431.