A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts
Main Article Content
Abstract
This work focuses on how to extract a verbal unit, which is a group of words to express an action or state of being. A verbal unit is a basic and fundamental element of a clause or a sentence. In this work, we define three layers of verbal units including verbal sequences, verbal phrases (i.e., verbal chunks, causative forms and event occurrences), and elementary discourse units (EDUs). For the first layer, a verbal sequence is defined as a single verb or a sequence of contiguous verbs without any interrupting nouns or particles. As the second layer, a verb phrase (i.e., a causative form and event occurrence form) is defined as a phrase that may include auxiliary verbs, verbs and nouns as subjects or objects. In the third layer, a Thai elementary discourse unit is defined as a sentence-like or clause-like unit which includes only one actual verb per unit. We propose a hybrid approach by combining a statistical-based method and a rule-based method to chunk Thai verbal units. The statistical-based method used is based on a conditional random field while the rule-based method utilizes grammatical rules with chart parsing. These two methods can help each other to improve correctness. Compared are three approaches: statistical-based, a rule-based, and a hybrid approach. The experimental results show that the hybrid approach is the best approach to chunk verbal units.
Keywords: Chunking, Thai Verbal Sequence, CRF, Grammatical Rules, Hybrid Approach