Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations
Keywords:Natural language processing, machine learning, artificial neural networks, sequence tagging model, Thai sentence segmentation, Thai language
A sentence is typically treated as the minimal syntactic unit used to extract valuable information from long text. However, in written Thai, there are no explicit sentence markers. Some prior works use machine learning; however, a deep learning approach has never been employed. We propose a deep learning model for sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate two techniques that allow us to utilize unlabeled data: Cross-View Training (CVT) as a semi-supervised learning technique, and a pre-trained language model (ELMo) to improve word representation. In the experiments, our model reduced the relative error by 7.4% and 18.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. Ablation studies revealed that the main contributing factor was adopting n-gram features, which were further analyzed using the interpretation technique and indicated that the model utilizes the features in the same way that humans do.
Authors who publish with Engineering Journal agree to transfer all copyright rights in and to the above work to the Engineering Journal (EJ)'s Editorial Board so that EJ's Editorial Board shall have the right to publish the work for nonprofit use in any media or form. In return, authors retain: (1) all proprietary rights other than copyright; (2) re-use of all or part of the above paper in their other work; (3) right to reproduce or authorize others to reproduce the above paper for authors' personal use or for company use if the source and EJ's copyright notice is indicated, and if the reproduction is not made for the purpose of sale.