THAI TREEBANK: CONCEPTS, CONSTRUCTION, AND APPLICATIONS Academic Articles

Main Article Content

Theerapol Limsatta

Abstract

Treebank construction is a fundamental resource in natural language processing, leveraging grammatical structures in tree format to ensure accurate sentence interpretation. Treebanks can be created manually or semi-automatically and are categorized into phrase structure treebanks and dependency treebanks. For Thai, notable treebanks include CG Treebank using Categorial Grammar and Suthee’s treebank using dependency grammar. A treebank is a data repository containing natural language sentences with syntactic analysis in tree structures, reflecting grammatical relationships between words or phrases. Key components include original text data, accurate word segmentation, Part-of-Speech tagging with appropriate Thai tag sets, syntactic tree structures, standard annotation guidelines, and data formats. Comparisons with other languages like English (Penn Treebank) and Universal Dependencies highlight unique Thai characteristics such as absence of word spacing, ellipsis of sentence components, and polysemous word usage. This article describes the X-bar theory, which explains internal phrase structures. Thai treebank construction poses challenges due to the language's specific characteristics. The application of X-bar theory to Thai grammar requires adaptation, including handling the absence of clear specifiers, managing nested structures, and accommodating null nodes for omitted elements. Establishing robust annotation standards involves comprehensive guidelines, standardized constituent types and POS tags, validation tools, and diverse annotated examples. Thai treebanks are crucial for advancing NLP technologies, particularly for automatic parsing systems, machine translation improvement, and Thai language education. Beyond technical utility, Thai treebanks serve as valuable linguistic, cultural, and language preservation databases, fostering further research and innovation in Thai computational linguistics.

Article Details

How to Cite
[1]
T. Limsatta, “THAI TREEBANK: CONCEPTS, CONSTRUCTION, AND APPLICATIONS: Academic Articles”, JSCI-SBU, vol. 5, no. 2, pp. 94–105, Dec. 2025.
Section
Academic Article

References

T. Ruangrajitpakorn, K. Trakultaweekoon, and T. Supnithi, "A syntactic resource for Thai: CG treebank," in Proc. of the 7th Workshop on Asian Language Resources, pp. 96-101, 2009. (in Thai)

S. Sudprasert, “A Dependency Tree Annotation Manual for Thai Language (Version 1.4),” 2008. [Online]. Available: http://github.com/crishoj/thcg. [Accessed: May 30, 2024]. (in Thai)

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2, pp. 313–330, 1993.

J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, et al., "Universal Dependencies v1: A multilingual treebank collection," in Proc. of the 10th International Conference on Language Resources and Evaluation (LREC), 2016.

K. Kosawat, M. Boriboon, T. Charoenporn, and V. Sornlertlamvanich, "The Thai National Corpus (TNC): Corpus-based linguistic resources for Thai language processing," in Proc. of the 7th Workshop on Asian Language Resources, 2009.

H. Isahara, C. Kruengkrai, and S. Shirai, “Thai Treebank and applications,” in Proc. 6th Workshop on Asian Language Resources (ALR), Hyderabad, India, pp. 65–72, 2008.

P. Boonkwan, N. Thanachart, and T. Charoenporn, "Thai Dependency Treebank: Annotation guideline and corpus," in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp. 1234-1240, 2016.

T. Aroonmanakun, “Issues in tagging and parsing the Thai language,” in Proc. 17th Pacific Asia Conf. on Language, Information and Computation (PACLIC 17), Sentosa, Singapore, pp. 219–226, 2003.

C. Wirote and V. Sornlertlamvanich, “Thai grammar extraction using statistical and rule-based approach,” in Proc. 4th Int. Conf. on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 2004.

NECTEC, "Thai Treebank Project: Guidelines and Corpus Development," 2010. [Online]. Available: http://www.thaicorpora.net. [Accessed: Jun. 9, 2023]. (in Thai)

K. M. K. Boriboon, K. Kriengket, P. Chootrakool, S. Phaholphinyo, S. Purodakananda, T. Thanakulwarapas, and K. Kosawat, “Best corpus development and analysis,” in Proc. 2009 Int. Conf. on Asian Language Processing, pp. 322–327, 2009.

J. Nivre, M.-C. de Marneffe, F. Ginter, J. Hajič, C. D. Manning, S. Pyysalo, S. Schuster, F. Tyers, and D. Zeman, “Universal Dependencies v2: An evergrowing multilingual treebank collection,” arXiv preprint arXiv:2004.10643, 2020.

S. Sornlertlamvanich, K. Charoenporn, and T. Aroonmanakun, “A deep syntactic parsing approach for Thai using Universal Dependencies,” in Proc. 34th Pacific Asia Conf. on Language, Information and Computation (PACLIC 34), 2020.

A. Piamsa-Nga, “Improving Thai-English machine translation via syntactic reordering based on treebank,” Kasetsart Journal of Social Sciences, vol. 39, no. 2, pp. 235–244, 2018. (in Thai)

D. Li, N. Noordin, L. Ismail, and D. Cao, “A systematic review of corpus-based instruction in EFL classroom,” Heliyon, vol. 11, no. 2, pp. 1–14, 2025.