Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Dokuz, Yesim; TÜFEKCİ, ZEKERİYA

doi:10.1007/s11042-022-12304-5

Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition

Atıf İçin Kopyala

Dokuz Y., TÜFEKCİ Z.

MULTIMEDIA TOOLS AND APPLICATIONS, cilt.81, sa.7, ss.9969-9988, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 81 Sayı: 7
Basım Tarihi: 2022
Doi Numarası: 10.1007/s11042-022-12304-5
Dergi Adı: MULTIMEDIA TOOLS AND APPLICATIONS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
Sayfa Sayıları: ss.9969-9988
Anahtar Kelimeler: Speech recognition, Deep learning, Mini-batch gradient descent, Hybrid sample selection strategies, LSTM
Çukurova Üniversitesi Adresli: Evet

Özet

With the increasing popularity of deep learning, deep learning architectures are being utilized in speech recognition. Deep learning based speech recognition became the state-of-the-art method for speech recognition tasks due to their outstanding performance over other methods. Generally, deep learning architectures are trained with a variant of gradient descent optimization. Mini-batch gradient descent is a variant of gradient descent optimization which updates network parameters after traversing a number of training instances. One limitation of mini-batch gradient descent is the random selection of mini-batch samples from training set. This situation is not preferred in speech recognition which requires training features to collapse all possible variations in speech databases. In this study, to overcome this limitation, hybrid mini-batch sample selection strategies are proposed. The proposed hybrid strategies use gender and accent features of speech databases in a hybrid way to select mini-batch samples when training deep learning architectures. Experimental results justify that using hybrid of gender and accent features is more successful in terms of speech recognition performance than using only one feature. The proposed hybrid mini-batch sample selection strategies would benefit other application areas that have metadata information, including image recognition and machine vision.