Utilizing Language Model for Term Weighting in Text Categorization


Coban O., ÖZEL S. A.

International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Türkiye, 28 - 30 Eylül 2018 identifier

  • Cilt numarası:
  • Basıldığı Şehir: Malatya
  • Basıldığı Ülke: Türkiye

Özet

In information retrieval (IR), using language model is an alternative approach to vector space model and other probabilistic term weighting models. The basic principle of the language model is to construct a model for each document and rank the documents by score which is estimated from this model. The score, in this case, represents the likelihood of generation of the query from the given document. To develop new text retrieval strategies, the language model is an attractive approach with the help of its simplicity and effectiveness. In text classification which employs methods from IR domain, documents are generally represented through vector space model (VSM). The success of the VSM depends on term weighting process that is an important step that corresponds the contribution of a term to the semantics of a text. In this paper, we investigate utilizing language model for term weighting and its effect on text classification performance. We compare the language model based term weighting with several popular and traditional term weighting methods including Binary, TF (Term Frequency), and TF* IDF (Term FrequencyInverse Document Frequency) on three different Turkish datasets. Our experimental results revealed that language model based term weighting generally outperforms traditional methods except from binary weighting.