Similarity Detection between Turkish Text Documents with Distance Metrics

Kaya Keles M., ÖZEL S. A.

2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Türkiye, 5 - 08 Ekim 2017, ss.316-321, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası:
Doi Numarası: 10.1109/ubmk.2017.8093399
Basıldığı Şehir: Antalya
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.316-321
Çukurova Üniversitesi Adresli: Evet

Özet

The aim of this study is to compare the successes of various distance metrics and to determine the most appropriate methods in order to detect similarities among textual documents written in Turkish. Computing similarities between text documents is the basic step of plagiarism detection, and text mining methods like author detection, text classification and clustering. Therefore, plagiarism detection and text mining applications will be more successful by using the distance metrics that are determined according to the results obtained in this study. For this purpose, chunks of texts in different lengths are selected as the experimental dataset in this study. After that, preprocessing methods are applied to the dataset that is used; therefore new and different experimental scenarios are created by removing stopwords and Turkish characters, and stemming words with Zemberek. According to the experimental results, it is observed that the preprocessing phase increases the accuracy of similarity detection. Especially, stemming using Zemberek increases the success rate. In all cases, the Cosine Similarity method has been observed as more successful than other distance metrics, because of producing more realistic results.