A Web page classification system based on a genetic algorithm using tagged-terms as features

Ozel, SELMA

doi:10.1016/j.eswa.2010.08.126

A Web page classification system based on a genetic algorithm using tagged-terms as features

Ozel S. A.

EXPERT SYSTEMS WITH APPLICATIONS, cilt.38, sa.4, ss.3407-3415, 2011 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 38 Sayı: 4
Basım Tarihi: 2011
Doi Numarası: 10.1016/j.eswa.2010.08.126
Dergi Adı: EXPERT SYSTEMS WITH APPLICATIONS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.3407-3415
Çukurova Üniversitesi Adresli: Evet

Özet

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naive Bayes and k nearest neighbor classifiers. (C) 2010 Elsevier Ltd. All rights reserved.