IEEE Access, cilt.13, ss.209723-209728, 2025 (SCI-Expanded, Scopus)
Identifying remote homologous proteins is an important field in computational biology. An experimental study was conducted to find a solution to this using machine learning, and natural language processing algorithms. The SCOP 1.53 dataset, which has 54 families, was used. In this study, two different new designs were developed. As a preprocessing step, some numerical features were obtained from protein sequences using the TF-IDF vectorization method. Then, data augmentation was performed using the SMOTE-Tomek algorithm. The same preprocessing steps were used in the both methods. One of our new methods is a classification study using a two-stage Logistic Regression, and Deep Belief Network (LR-DBN), with an average accuracy of 77%, and with an F1 score of 75%. The other is also a classification study using a Logistic Regression method with Bat optimization (LR-B), with an average accuracy of 84%, and with an F1 score of 86%. LR-B with the SMOTE-Tomek method outperformed with an ROC-AUC score of 89%. Although LR-DBN with the SMOTE-Tomek method slightly performed poorly than LR-B with the SMOTE-Tomek method, it performed well in detecting remote homologous proteins.