Analysis of previous research done on Automatic Speech Recognition Models for Low Resource Language

Tharmakulasingham Inthirakumaaran
10 min readFeb 15, 2024

Today we widely use speech-related applications in many fields like speech topic identification and spoken command recognition. Automatic Speech Recognition (ASR) has evolved widely, and recent research shows human-level performance in some tasks and there are state-of-the-art speech-based user interfaces that recognize free-form speech commands easily. However, creating an ASR system for a language is a resource-consuming task and not all languages have an ASR model like most well-known languages such as English. ASR for Low Resource Languages(LRL) is always a challenge due to the limitations we have on the corpus and data. There have been many efforts taken to improve the ASR for these low languages. Through observation, we have figured out that building models specific to a domain yield better results for LRL. At the same time, there is past research that looked into classifying speech while addressing data scarcity. There are some researches that even use English phoneme-based ASR models as a base and build ASR for LRL. In this article, we will review these methods and analyze researches done on ASR for LRL

Speech recognition started to become inseparable from modern societies with the discovery by Alexander Graham Bell about the process of converting sound waves into electrical impulses and the first speech recognition system developed by Davis et al. [1] for recognizing telephone quality digits spoken at a normal speech rate. Today human’s top innovations like Google Assistant, and Amazon Alexa which dominate smartphone to home automation are applications that depend on speech recognition services. Nowadays these applications are capable of identifying the intent of free-form speech commands given by the user. However in order to achieve this according to Ram et al.[2] we need to have ASR systems and Natural Language Understanding (NLU) systems work together with a very high level of accuracy.

The ASR system converts a sequence of acoustic features into the most likely sequence of words. Normally Mel-frequency Cepstral Coefficients (MFCC) of the speech signals are used as input features. Initial ASR models incorporated an acoustic model, a pronunciation lexicon that maps phones into words, and a language model to rank the likelihood of words[3]. Now we have systems that use deep neural network-based end-to-end ASR models [4], [5]. The advantage of these models is that they are capable of modeling all the acoustic, pronunciation, lexicon, and language into a single model. Then, the output of the ASR module is used as the input for the NLU model. Then, the NLU model processes it and outputs semantic labels for a given text sequence, which is trained with labeled data using supervised learning. There are few types of research that use either an n-best list of the ASR outputs or the intermediate features of the ASRs. This is to eliminate the errors made by using the single best output of the ASR [6], [7], [8].

However languages on which so far Automatic Speech Recognition systems have been developed are just a fraction of existing languages. According to Ethnologue website (​​https://www.ethnologue.com/) the total number of living languages in the world is estimated to be around 7,139. They define a living language as “one that has at least one speaker for whom it is their first language”. Thus extinct languages and languages that are spoken as a second language are excluded from these counts. But this includes both verbal and visual-kinetic spoken languages. Apart from a fraction, most of these languages lack electronic resources for speech and/or language processing. We can refer to them as “Low Resource Language”(LRL). The number for well-resourced languages can be counted by listing how many languages are identified for core technologies and resources, such as Google Translate (around 100 languages involved in 2019), Google search (more than one hundred languages in 2021), Wiktionary5 (~180 languages in 2021), Google Voice Search (30+ languages and accents 2021).

Because of the lack of resources required, innovative data collection methodologies (via crowdsourcing for instance[10]) or models for which information is shared between languages (e.g. multilingual acoustic models [11]). Ideally, we need more than 1000 hours of transcribed speech data to train an accurate ASR model [9]. Then we have the issue of the accuracy of the ASR models. Some social and cultural aspects related to the context of the targeted language bring additional problems, languages with many dialects in different regions, code-switching or code-mixing phenomena (switching from one language to another within the discourse), and the massive presence of non-native speakers. Errors made by the ASR component can propagate into the NLU system and can result in false outputs [7]. Thus building an accurate ASR model for language is another problem.

According to [12] another challenge in LRL is bridging the gap between language experts (the speakers themselves) and technology experts (system developers). Indeed, it is often almost impossible to find native speakers with the necessary technical skills to develop ASR systems in their native language. Moreover, LRL is often poorly addressed in the linguistic literature and very few studies describe them. To bootstrap systems for such languages, one has to borrow resources and knowledge from similar languages, which requires the help of dialectologists (find proximity indices between languages), phoneticians (map the phonetic inventories between the targeted under-resourced language and some more resourced ones, etc.)

Moreover, some LRL is interesting to challenge the paradigms and common practices: is the word the best unit for language modeling? Is the phoneme the best unit for acoustic modeling? In addition, for some (rare, endangered) languages, it is often necessary to work with ethno-linguists in order to access native speakers and in order to collect data in accordance with the basic technical and ethical rules([13] has an extensive study in these issues). All of these aspects make the creation of an ASR model on these languages a multi-disciplinary challenge

The work of Wiesner et al. [16] uses a low-resource ASR development method and a multilingual speech corpus with universal phone annotations [17]. Although the data was very limited training data the results were good. The text output of this ASR is used on a classifier model that can identify the corresponding topic.

The best output of the ASR is not always a correct one and it is harder to select the correct model from what we built. He et al. [8] and Yaman et al. [7] have explored these issues in their research and proposed some frameworks or approaches to overcome them by using the n-best list of the ASR output and joint optimization techniques. However, we still have the overhead of ASR development for the targeted language

Liu et al. [18] and Wiesner et al. [16], have used unsupervised feature extraction methods and do not require speech data with transcripts or annotations in their research. They have made use of features such as phone-like units discovered via acoustic unit discovery or wordlike units discovered via unsupervised term discovery. Still, these unsupervised methods require more data to identify accurate feature representations and have the huge computational power to process data.

Buddhika et al. [19] use MFCC features to build speech intent classifiers for low-resource language. They have restricted their approach to the banking domain and built domain-specific ASR models. They use classifier models such as Support Vector Machines (SVM), and Convolution Neural Networks (CNN) to identify intents from MFCC features. They have achieved a 74% classification accuracy for a 10 hour banking dataset.

Chen et al. [6] has proposed an intent identification method by using intermediate features of a pre-trained English ASR model. They have used character probability values generated by the ASR as features for a CNN-based intent classification model. Their results were better for the call center domain. Similarly, Lugosch et al. [20] also utilized a pre-training strategy to present similar work and obtained good results using a 14.7-hour dataset. They have identified not only the intent, but also the slot values such as action, object, and location mentioned in the speech query. By this, we do not need to worry about 1-best output of the ASR, and can optimize jointly. However, in both types of research, a large English corpus is used to train an ASR model to identify intent on the same language.

Karunanayake et al. [21] made use of a pre-training strategy to build a speech intent identification for LRLs. They have used a pre-trained English model to identify the intent of LRLs speech commands. They have obtained an overall accuracy of 80% by using 1-hour data set. In another research [22] the same authors have presented another phoneme-based domain-specific speech intent identification methodology for ASR in LRLs as well.

Like many researches have focused on either improving the low-resource ASR or focusing on speech intent identification using different features other than ASR text output like intermediate features of the ASR models. Now they even started researching using BERT which is considered as a state-of-the-art for TIMIT phoneme classification.

Although much of the recent progress has resulted in better ASR models for LRLs, still it cannot be compared to high-resource languages and restricted LRLs to have wide applications in Speech Recognition. It is clear that we need more advanced ASR models as well as support from different organizations in those LRLs to address many of the pertinent issues. In particular, progress with the smaller languages and those with extremely limited resources will most likely depend on significant resource sharing. But such sharing will benefit greatly from organizations and facilities that make it easy for researchers and technologists to access available resources in a wide range of languages. Alas, let’s hope that the current wave of interest in LRL will stimulate cooperation along these lines, together with continuing scientific research to support such languages and ultimately support from the native speakers.

References

[1] Davis, K., Biddulph, R., and Balashek, S., “Automatic Recognition of Spoken Digit,” J. Acoust. Soc. Am. 24: Nov 1952, p. 637.

[2] Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604

[3] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011.

[4] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.

[5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.

[6] Y.-P. Chen, R. Price, and S. Bangalore, “Spoken language understanding without speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 6189–6193.

[7] S. Yaman, L. Deng, D. Yu, Y.-Y. Wang, and A. Acero, “An integrative and discriminative technique for spoken utterance classification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 6, pp. 1207–1214, 2008.

[8] X. He and L. Deng, “Speech-centric information processing: An optimization-oriented approach,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1116–1135, 2013.

[9] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 56, pp. 85–100, 2014.

[10]H. Gelas, S. Teferra Abate, L. Besacier and F. Pellegrino. Quality Assessment of crowdsourcing transcriptions for African languages. Interspeech 2011 Florence, Italy. 28–31 August 2011.

[11]T. Schultz and A. Waibel, “Language independent and language adaptive acoustic modeling for speech recognition,” Speech Communication, vol. 35, pp. 31–51, 2001.

[12] Laurent Besacier, Etienne Barnard, Alexey Karpov, Tanja Schultz “Automatic Speech Recognition for Under-Resourced Languages: A Survey”, January 2014

[13] Ramesh, Akshai, Parthasarathy, Venkatesh Balavadhani, Haque, Rejwanul and Way, Andy,”Investigating low­-resource machine translation for English­-to­-Tamil”, (2020)

[14] Navdeep Singh Mata Gujri College, Fatehgarh Sahib, Punjab, “Literature Review on Automatic Speech Recognition”, International Journal of Computer Applications (0975–8887) Volume 41– №8, March 2012

[15] Hemdal, J.F. and Hughes, G.W., A feature based computer recognition program for the modeling of vowel perception, in Models for the Perception of Speech and Visual Form, Wathen-Dunn, W. Ed. MIT Press, Cambridge, MA.

[16] M. Wiesner, C. Liu, L. Ondel, C. Harman, V. Manohar, J. Trmal, Z. Huang, N. Dehak, and S. Khudanpur, “Automatic speech recognition and topic identification for almost-zero-resource languages,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, 2018, pp. 2052–2056.

[17] K. M. Knill, M. J. Gales, A. Ragni, and S. P. Rath, “Language independent and unsupervised acoustic models for speech recognition and keyword spotting,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2014, pp. 16–20. [15] L. Ondel

[18] C. Liu, J. Trmal, M. Wiesner, C. Harman, and S. Khudanpur, “Topic identification for speech without asr,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August, 2017, pp. 2501–2505.

[19] D. Buddhika, R. Liyadipita, S. Nadeeshan, H. Witharana, S. Javasena, and U. Thayasivam, “Domain specific intent classification of sinhala speech data,” in 2018 International Conference on Asian Language Processing (IALP). IEEE, 2018, pp. 197–202.

[20] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio, “Speech model pre-training for end to end spoken language understanding,” arXiv preprint arXiv:1904.03670, 2019.

[21] Y. Karunanayake, U. Thayasivam, and S. Ranathunga, “Transfer learning based free-form speech command classification for low-resource languages,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2019, pp. 288– 294.

[22] Y. Karunanayake, U. Thayasivam, and S. Ranathunga, “Sinhala and Tamil Speech Intent Identification From English Phoneme Based ASR”, 2019

--

--