An Evaluation of Character Level N-gram Termsets in Text Categorization


Coban O., ÖZEL S. A.

International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Türkiye, 28 - 30 Eylül 2018 identifier

  • Cilt numarası:
  • Basıldığı Şehir: Malatya
  • Basıldığı Ülke: Türkiye

Özet

Text categorization is a text mining process and it aims to discover relevant information and relationship in a huge amounts of text data. Feature extraction is an important preprocessing step of text categorization, as extracted features are used to represent texts. Several methods, feature models, and algorithms are needed to extract useful features from textual contents. One of the these methods is frequent itemset mining which is a basic data mining technique that employed to find interesting patterns in data. As the frequent itemsets (termsets) reflect strong associations between items they provide more underlying contextual semantic than an individual word. Therefore, it is used in text mining domain for different purposes (e.g., frequent itemset based text clustering). In this paper, we employ termsets for text representation, and use both binary and cardinality based approaches for termset weighting. Unlike existing studies, we use character level n-grams to represent items in a transaction in addition to the traditional bag of words model. Through our experimental results on Mod-Apte split of the Reuters-21578 dataset, we determined that performing document-transaction conversion at level of character n-grams improves the performance of the Support Vector Machine classifier.