Performance of Using Tag-based Feature Sets in Web Page Classification


ÜNAL H. E., ÖZEL S. A., ÜNAL İ.

Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, vol.22, no.2, pp.583-594, 2018 (Peer-Reviewed Journal) identifier

Abstract

As the Web is a large collection of data growing daily, an automatic Webpage classification mechanism is needed to effectively reach to useful information.Majority of the Web pages are in the form of HTML documents, therefore the aimof this study is to explore the effect of HTML tags on classification process, and tryto determine the most valuable HTML tags for feature extraction of theclassification task. To achieve this goal, we employ 13 different datasets, and use 5popular classifiers that are SVM, naïve bayes (NB), kNN, C4.5, and OneR. Thestatistical analysis shows that, the features extracted by using solely the anchor,

or tags can be used as an alternative to the features extracted from thewhole Web page. SVM is the best among the classifiers used in this study. Using theHTML tags for feature extraction improves classification accuracy. </div> <div class="tab-pane fade" id="profile" role="tabpanel" aria-labelledby="abstractsecondary-tab"> Web sürekli büyüyen geniş bir veri kümesidir. Buna bağlı olarak yararlı bilgilere etkili bir şekilde erişmek için otomatik bir Web sayfası sınıflandırma mekanizmasına ihtiyaç duyulmaktadır. Web sayfalarının çoğunluğu HTML dokümanları biçimindedir. Bu nedenle bu çalışmanın amacı, HTML etiketlerinin sınıflandırma işlemi üzerindeki etkisini araştırmak ve sınıflandırmanın nitelik çıkarımı aşamasında kullanılabilecek en etkili HTML etiketlerini belirlemektir. Bu amaca ulaşmak için, 13 farklı veri seti ve 5 popüler sınıflayıcı (SVM, Naive Bayes, kNN, C4.5 ve OneR) kullanılmıştır. İstatistiksel analiz sonuçları, “anchor”,”<p>” ve”<title>” etiketlerini kullanarak çıkarılan niteliklerin, tüm Web sayfası kullanılarak çıkarılan niteliklere alternatif olarak kullanılabileceğini göstermektedir. SVM, bu çalışmada kullanılan sınıflandırıcılar arasında en başarılısıdır. Nitelik çıkarımı için HTML etiketlerini kullanmak sınıflandırma doğruluğunu arttırmıştır. </div> </div> </div> </div> </div> <div class="col-md-3" style="margin-bottom: 10px;"> <aside class="sidebar"> <style> .social-media-shares { float: left; margin-right: 5px; } a.social-icon { display: block; width: 32px; background-color: #252525; color: #fff; text-align: center; } a.social-icon i { font-size: 16px; line-height: 32px; } a.social-icon:hover { text-decoration: none; transform: scale(1.25) perspective(1px); } a.facebook-icon { background-color: #0e59a0 !important; } a.twitter-icon { background-color: #0ea4ff !important; } a.linkedin-icon { background-color: #018faf !important; } a.whatsapp-icon { background-color: #25D366 !important; } </style> <strong class="font-size-xl">Share</strong> <div class="ml-sm mt-xs"> <div id="facebook" class="social-media-shares"> <a class="facebook-icon social-icon" title="Share On Facebook" target="_blank" href="https://www.facebook.com/sharer/sharer.php?u=https://avesis.cu.edu.tr/yayin/707d9d29-a9bc-4bce-9235-55f60f75248f/performance-of-using-tag-based-feature-sets-in-web-page-classification"> <i class="fa fa-facebook"></i> </a> </div> <div id="tweeter" class="social-media-shares"> <a class="twitter-icon social-icon" title="Share on Twitter" target="_blank" href="https://twitter.com/share?url=https://avesis.cu.edu.tr/yayin/707d9d29-a9bc-4bce-9235-55f60f75248f/performance-of-using-tag-based-feature-sets-in-web-page-classification&text=Performance of Using Tag-based Feature Sets in Web Page Classification&hashtags=avesis,avesisnetwork"> <i class="fa fa-twitter"></i> </a> </div> <div id="linkedin" class="social-media-shares"> <a class="linkedin-icon social-icon" title="Share On LinkedIn" target="_blank" href="https://www.linkedin.com/shareArticle?mini=true&url=https://avesis.cu.edu.tr/yayin/707d9d29-a9bc-4bce-9235-55f60f75248f/performance-of-using-tag-based-feature-sets-in-web-page-classification&title=Performance%20of%20Using%20Tag-based%20Feature%20Sets%20in%20Web%20Page%20Classification&source=LinkedIn"> <i class="fa fa-linkedin"></i> </a> </div> <div id="whatsapp" class="social-media-shares"> <a class="whatsapp-icon social-icon" title="Share On WhatsApp" target="_blank" href="https://wa.me/?text=Performance of Using Tag-based Feature Sets in Web Page Classification - https://avesis.cu.edu.tr/yayin/707d9d29-a9bc-4bce-9235-55f60f75248f/performance-of-using-tag-based-feature-sets-in-web-page-classification"> <i class="fa fa-whatsapp"></i> </a> </div> </div> </aside> </div> </div> </div> <div id="asyncmodal-container" class="modal fade" role="dialog"> <div id="asyncmodal-dialog" class="modal-dialog"> <div id="asyncmodal-content" class="modal-content"> </div> </div> </div> <footer id="footer" class="color color-quaternary"> <div class="container"> <div class="row"> <div class="col-md-4 hidden-xs"> <div class="newsletter"> <img src="/Content/images/logo-big.png " alt="" class="aves-logo"> <nav class="footer-mini-nav"> <a href="/">Home page</a> |  <a href="/about">About AVESIS</a> |  <a href="/contact">Contact</a> </nav> </div> </div> <div class="col-md-4 col-md-offset-4"> <div class="contact-details"> <h4>Contact Information</h4> <ul class="contact media-list"> <li class="media"> <div class="media-left"> <i class="fa fa-map-marker"></i> </div> <div class="media-body"> <span>Çukurova Üniversitesi Bilimsel Araştırma Projeleri Koordinasyon Birimi</span> <br /><span>Rektörlük İdari Bina</span> <br /><span>Sarıçam/ADANA</span> </div> </li> <li class="media"> <div class="media-left"> <i class="fa fa-envelope"></i> </div> <div class="media-body"> <a href="mailto:aves@cu.edu.tr">aves@cu.edu.tr</a> </div> </li> <li class="media"> <div class="media-left"> <i class="fa fa-phone"></i> </div> <div class="media-body"> <span> <a href="tel:0322 338 62 03">0322 338 62 03</a> </span> </div> </li> <li class="media"> <div class="media-left"> <i class="fa fa-fax"></i> </div> <div class="media-body"> <span> <a href="tel:0322 338 62 03">0322 338 62 03</a> </span> </div> </li> </ul> </div> </div> <div class="col-md-4 hidden visible-xs"> <div class="newsletter text-center"> <a href="#"> <img src="/Content/images/logo-big.png" alt="" class="aves-logo"> </a> <nav class="footer-mini-nav"> <a href="/">Home page</a> |  <a href="/about">About AVESIS</a> |  <a href="/contact">Contact</a> </nav> </div> </div> </div> </div> <div class="footer-copyright"> <div class="container"> <div class="row "> <div class="col-md-6"> <p> Research Information System </p> </div> <div class="col-md-6 text-right"> <p> <a href="http://www.abisteknoloji.com.tr" target="_blank">Abis Teknoloji</a> © 2024 </p> </div> </div> </div> </div> </footer> <script src="/bundles/mainpage-sync?v=239XX3LFeFKIK5EvBSPxgOlkkECqgE2SGFS-PWKipno1"></script> <script src=" /bundles/mainpage?v=s6ndMHyw4dRDSOYuKGvmbenscR0pNfZTCALf6IdtgME1" defer></script> <script src=" /bundles/mainpage-theme?v=jkvvpYGjzpYVsNTHSZaKZCR4-uFPElS10GjjfOojNTM1" defer></script> <script type="text/javascript"> var resultHit; var resultList = {}; var resultdata = null; var userResumeRouteKey; var foundRecodTypesUrl = 'Found Record Types' + "[0]"; var localization = currentCulture == "tr" ? "primary" : "secondary"; if (!isMobile()) { $("#searchBox").on("keyup", function () { var searchBoxValue = $(this).val().trim(); if (searchBoxValue.length < 3 || searchBoxValue.length === 0) { $("#resultsContainer").hide(); $("#resultCategories li").remove(); $("#resultsList li").remove(); $("#categoryHeader").hide(); } if (searchBoxValue.length >= 3) { siteSearch(searchBoxValue, 0, 300, searchSuccess); } }); $("#bgOverlay").on("click", function () { $("#resultsContainer").hide(); }); } function searchSuccess(result) { resultdata = result; $("#resultsList li").remove(); $("#moreResult li").remove(); $("#resultCategories li").remove(); $("#categoryHeader").hide(); if (currentCulture == "tr") { resultList = resultdata.aggregations.type_primary.filtered_categories.buckets; } else { resultList = resultdata.aggregations.type_secondary.filtered_categories.buckets; } var searchUrl = "/search?scope=All"+"&q=" + $("#searchBox").val().trim(); resultList.forEach(function (type) { if (type.key === "Researchers" || type.key === "User") { $("#resultCategories").append( $("<li/>").data("category-id", type.key).append( $("<a/>") .attr('target', '_blank') .attr("href", searchUrl + "&" + foundRecodTypesUrl + "=" + type.key) .text(type.key + " (" + type.doc_count + ")") .attr('style', 'color:#22395C') ).hover(function () { resultListItem($(this)); }) ); } }); for (var i = 0; i < resultList.length; i++) if (resultList[i].key === "Researchers" || resultList[i].key === "User") { resultListItem(resultList[i], 1); resultList.splice(i, 1); break; } else { resultListItem(resultList[0], 1); } resultList.slice(0, 8).forEach(function (type) { $("#resultCategories").append( $("<li/>").data("category-id", type.key).append( $("<a/>") .attr('target', '_blank') .attr("href", searchUrl + "&" + foundRecodTypesUrl + "=" + type.key) .text(type.key + " (" + type.doc_count + ")") .attr('style', 'color:#22395C') ).hover(function () { $('#resultsRightPane').scrollTop(0); resultListItem($(this)); }) ); }); $("#moreResult").append( $("<li/>").append( $("<a/>") .attr('target', '_blank') .attr("href", searchUrl + "") .text("All Results") .attr('style', 'color:white') ) ) if (resultdata.hits.total === 0) { $("#resultCategories li").remove(); $("#resultsList li").remove(); $("#categoryHeader").text("Search result not found"); $("#categoryHeader").show(); } $("#resultsContainer").show(); } function filterResultsHomePage(data, category) { var filtered = []; data.hits.hits.map(function (el) { if (currentCulture == "tr") { if (el._type === category) { filtered.push(el._source); } } else { if (el._source.type_secondary === category) { filtered.push(el._source); } } }); return filtered; } function resultListItem(item, count) { var key = count != null ? item.key : item.data("category-id"); $("#categoryHeader").hide(); $("#resultsList li").remove(); $("#categoryHeader").text(key); $("#categoryHeader").show(); filterResultsHomePage(resultdata, key).forEach(function (hit) { userResumeRouteKey = (key === "Researchers" || key === "User") ? true : false; var title = hit["title_" + localization]; var resumeRouteKeyUrl = getActivityDetailPageUrl(hit, title); currentCulture == "tr" ? resultHit = hit.reference_primary.DefaultReference : resultHit = hit.reference_secondary.DefaultReference; $("#resultsList").append( $("<li/>").append( $("<a/>") .attr('target', '_blank') .attr("href", resumeRouteKeyUrl) .text(resultHit) .attr('style', 'color:white') ) ); }); } </script> <script> $(function () { var isLoaded = false; var id = '707d9d29-a9bc-4bce-9235-55f60f75248f'; function showAuthors() { $.ajax({ type: "GET", url: "/publication/getmoreauthor", data: {id:id}, success: function (data) { $('.authors-rich-text').append(data); $("#more").hide(); }, }); isLoaded = true; } $(document).on('click', '#more', function () { if (isLoaded == false) { showAuthors(); } $(this).hide() $("#less").show(); $(".more-author").show(); $(".author-comma").show(); }); $(document).on('click', '#less', function () { $(this).hide() $("#more").show(); $(".more-author").hide(); $(".author-comma").hide(); }); }) </script> <script type="text/javascript" src="//cdn.plu.mx/widget-details.js"></script> <script> unsupportBrowserconfig.title = "Your browser is not supported!"; unsupportBrowserconfig.message = "Please use one of the updated Google Chrome, Opera, Mozilla Firefox, Microsoft Edge, Safari... browsers."; </script> <script type="text/javascript"> $(function () { var redirectLogin = ''; if (redirectLogin && JSON.parse(redirectLogin.toLowerCase())) { document.location = '/researcher'; } }); </script> </body> </html>