The Effect of POS Tag Information on Sentence Boundary Detection in Turkish Texts


Bektas Y., ÖZEL S. A.

Innovations in Intelligent Systems and Applications Conference (ASYU), Adana, Türkiye, 4 - 06 Ekim 2018, ss.161-165 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/asyu.2018.8554031
  • Basıldığı Şehir: Adana
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.161-165
  • Çukurova Üniversitesi Adresli: Evet

Özet

Recently, Natural language processing (NLP) applications have been crucial by the increase in the amount of digitized written and oral text documents. As sentence boundary detection is the first step of most of the NLP applications, it has high importance. In this study, the effects of using POS (Part-of-Speech) tags on the performance of machine learning methods-based sentence boundary detection from Turkish texts have been studied. To reach our goal, a dataset which contains 30000 instances such that 15000 of them are sentences and the remaining 15000 instances are non-sentence samples has been drawn from a subset of TNC (Turkish National Corpus). The sub-corpus has 10.000.000 words in total, and to develop the dataset, the characters which may represent the end of a sentence are searched from the sub-corpus, then the text is divided into pieces from these characters. Each piece is checked manually to label as sentence and non-sentence, and randomly 30000 instances are selected to form the dataset. Each instance in the dataset is converted to a vector by using total 9 attributes that are used in the rule-based sentence boundary detection studies and proposed in this study. After that two more attributes that are POS tags of the terms before and after the character that may represent the end of the sentence are included to the attribute set, and then the dataset is again converted to vectors by using these 11 attributes. The two datasets are classified by using Back Propagation Neural Network, RBF Network, Naive Bayes classifier, Decision Tree, and Support Vector Machines to evaluate the performance of supervised learning methods on the sentence boundary detection. After the experimental evaluation we observed that, when POS tags are included, success of sentence boundary detection increases for all classifiers, and the most successful classifier is decision tree with classification accuracy which is improved from 84.7% to 86.2% when POS tags are considered.