-
oa Classification of Bisyllabic Lexical Stress Patterns Using Deep Neural Networks
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, مارس ٢٠١٦, المجلد 2016, ICTPP1469
ملخص
Background and Objectives: As English is a stress-timed language, lexical stress plays an important role in the perception and processing of speech by native speakers. Incorrect stress placement can reduce the intelligibility of the speaker and their ability to communicate more effectively. The accurate identification of lexical stress patterns is thus a key assessment tool of the speaker's pronunciation in applications such as second language (L2) learning, language proficiency testing and speech therapy. With the increasing use of Computer-Aided Language Learning (CALL) and Computer-Aided Speech and Language Therapy (CASLT) tools, the automatic assessment of lexical stress has become an important component of measuring the quality of the speaker's pronunciation. In this work we proposed a Deep Neural Network (DNN) classifier to discriminate between the unequal lexical stress patterns in English words, strong-weak (SW) and weak-strong (WS). The features used in training the deep neural network are derived from the duration, pitch and intensity of each of the two consecutive syllables along with a set of energies of different frequency bands. The robustness of our proposed lexical stress detector has been validated by testing it on the standard TIMIT dataset collected from adult male and female speakers distributed over 8 different dialect regions. Method: Our lexical stress classifier is applied on the speech signal along with the prompted word. Figure 1 shows a block diagram of the overall system. The speech signal is first force aligned with the predetermined phoneme sequence of the word to obtain the time boundaries of each phoneme. The alignment is performed using a Hidden Markov Model (HMM) Viterbi decoder along with set of HMM acoustic models trained from the same corpus to reduce the error caused by inaccurate phone level segmentation. A set of features is then extracted from each syllable and the features of each pair of consecutive syllables combined using the extracted features directly and concatenating them into one wide feature vector.
Lexical stress is identified by the variation in the pitch, energy and duration produced between different syllables in a multi-syllabic word. The stressed syllable is characterized by increased energy and pitch as well as a longer duration compared to the other syllables within the same word. Therefore we extracted seven features f1–f7 related to these characteristics as listed in Table 1. The energy based features (f1, f2, f3) were extracted after applying the non-linear Teager energy operator (TEO) on the speech signal to obtain a better estimation of the speech signal energy and reduce the noise effect. These seven features are commonly used in the detection of the stressed syllable in a word. As the speech signal energy is distributed over different frequency bands, we also computed the energy in the Mel-scale frequency bands in each frame of the syllable nucleus. The speech signal was divided into 10 msec non-overlapped frames and the energy, pitch and the frequency bands energies calculated for each frame.
As seen in Figure 1, to input the raw extracted features directly to the DNN, we concatenate the extracted features into one wide feature vector. Each syllable has 7 scalar values f1–f7 and 27*n Mel-coefficients where n is the number of frames in each syllable's vowel.
To handle variable vowel lengths, we limit the number of input frames provided to the DNN to a maximum N frames for each syllable. This provides the DNN with a fixed length Mel-energy input vector and allows the DNN to use information about the distribution of the Mel-energy bands over the vowel. If the vowel length (n) is greater than N frames, only the middle N frames are used. If the length of the vowel (n) is smaller than N frames, inputs frames are padded to N frames. The final size of the input vector to the DNN is 2*(7+27*N) for a pair of consecutive syllables, with N tuned empirically.
The DNN is trained using the mini-batch stochastic gradient decent method (MSGD) with adaptive learning rate. The learning rate starts with an initial value (typically 0.1) and after each epoch the loss in the error of the validation data set is computed. If the loss is greater than zero (i.e. the error increases) the training continues with the same learning rate.
If the error continues increasing for 10 consecutive epochs, the learning rate is halved and the parameter of the classifier returned to the one that achieved minimum error. Training is terminated when the learning rate reaches its minimum value (typically 0.0001) or after 200 epochs, whichever is earlier. The performance of the DNN is then computed using a separate testing set. Experiments and Results: We extracted raw features from consecutive syllables belonging to the same word from the TIMIT speech corpus. With the TIMIT corpus, we achieved a minimum error rate of 12.6% using a DNN classifier with 6 hidden layers and 100 hidden units per layer. Due to the unavailability of sufficient male and female data, we were unable to build a separate model for each gender. In Fig. 2, we present the error rate for each gender using a model trained on both male and female data. The results show that the classification of the SW is better in male speakers compared to female speakers while the WS error rate is lower for female speakers. However, the overall misclassification rate for both male and female speakers is almost the same.
To study the influence of the dialect on the algorithm, we compared the error rate when testing each dialect using a model trained with the training data of all dialects and when the model was trained with training data from all dialects except the tested one as shown in Fig. 3. As seen, the error rate of most of the dialects remains unchanged except for DR1 where the error rate increased significantly from 4.8 % to 8%. This can be explained by the small amount of test samples for this dialect (only 5% of the test samples). DR4 also shows a considerable increase in the error rate. Although the smallest amount of training samples was from the DR1 (New England) dialect, it produced the lowest error rate among the other dialects. Further work is needed to explain this behavior. Conclusion: In this work we present a DNN classifier to detect bisyllabic lexical stress patterns in multi-syllabic English words. The DNN classifier is trained using a set of features extracted from pairs of consecutive syllables related to pitch, intensity and duration along with energies in different frequency bands. The feature set of each pair of consecutive syllables is combined by concatenating the raw features into one wide vector. When applied on the standard TIMIT adult speech, the algorithm achieved a classification accuracy of 87.4%. The system performance show high stability over different dialects and gender.