A Unified Multimodal Approach to Speech Processing with LLMs

19 Jun 2025

Table of Links

Abstract and 1 Introduction

2 Approach

2.1 Architecture

2.2 Multimodal Instruction Finetuning

2.3 Curriculum Learning with Parameter Efficient Finetuning

3 Experiments

4 Results

4.1 Evaluation of SpeechVerse models

4.2 Generalization Across Instructions

4.3 Strategies for Improving Performance

A Appendix

A.1 Audio Encoder Pre-training

A.2 Hyper-parameters

A.3 Tasks

6 Conclusion

In this work, we propose SpeechVerse, a multimodal framework that enables LLMs to follow natural language instructions for performing diverse speech processing tasks. Through supervised instruction finetuning and combining representations from frozen pre-trained speech and text foundation models, SpeechVerse achieves strong zero-shot generalization on unseen tasks. Extensive benchmarking against conventional baselines show SpeechVerse’s superiority on 9 out of 11 tasks, demonstrating its formidable instruction following capability. Crucially, SpeechVerse maintains robust performance on out-of-domain datasets, unseen prompts, and even unseen tasks. This highlights the efficacy of our proposed training methodology in imbuing the model with a generalizable skill for mapping text-based instructions to speech processing outputs. Moving forward, we aim to expand SpeechVerse’s capabilities to follow even more complex instructions and generalize to new domains. By separating task specification from model design, SpeechVerse represents a versatile framework that can dynamically adapt to new tasks through natural language without retraining.

Limitations

While this work demonstrated strong instruction following capabilities for the multitask SpeechVerse model across a variety of tasks, some limitations remain. The study relied on a single underlying LLM architecture (FlanT5) rather than exploring more recent models tailored for instruction following. Additionally, there is a trade-off between generalized capabilities on unseen tasks versus specialized performance on original training tasks that poses challenges for a single multitask model. While the model showed promise in handling diverse unseen tasks, its limitations were not fully characterized across the wide scope of possible instructions and the performance on these unseen tasks is not quantitatively measured.

Ethics Statement

All speech datasets we use have anonymous speakers. We do not have any access to nor try to create any PII (Personal Identifiable Information) of speakers, and our model neither identifies speakers nor uses speaker embeddings. Most of the work used public open-source datasets for both training and testing. The in-house datasets used for pre-training Best-RQ encoder and SNS task are collected via third-party speech data vendors. No additional data collections made concerning to the work carried in this paper.

References

[1] T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.

[2] A. Chowdhery et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.

[3] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018.

[4] J. Achiam et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

[5] H. W. Chung et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.

[6] L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.

[7] H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

[8] R. Huang et al., “AudioGPT: Understanding and generating speech, music, sound, and talking head,” arXiv preprint arXiv:2304.12995, 2023.

[9] T. Gemini et al., “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.

[10] T. Guo et al., “Large language model based multi-agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024.

[11] W. R. Huang et al., “Multilingual and fully non-autoregressive asr with large language model fusion: A comprehensive study,” arXiv preprint arXiv:2401.12789, 2024.

[12] Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in Proc. Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2023, pp. 1–8.

[13] R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can generative large language models perform asr error correction?” arXiv preprint arXiv:2307.04172, 2023.

[14] P. K. Rubenstein et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.

[15] T. Wang et al., “VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation,” arXiv preprint arXiv:2305.16107, 2023.

[16] Y. Chu et al., “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023.

[17] M. Wang et al., “SLM: Bridge the thin gap between speech and text foundation models,” in Proc. Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2023, pp. 1–8.

[18] D. Zhang et al., “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.

[19] J. Ao et al., “SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing,” arXiv preprint arXiv:2110.07205, 2021.

[20] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.

[21] E. J. Hu et al., Lora: Low-rank adaptation of large language models, 2021. arXiv: 2106.09685 [cs.CL].

[22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2015, pp. 5206–5210. [23] R. Ardila et al., “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.

[24] O. Dekel and O. Shamir, “Vox populi: Collecting high-quality labels from a crowd.,” in COLT, 2009.

[25] E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser, “Slurp: A spoken language understanding resource package,” arXiv preprint arXiv:2011.13205, 2020.

[26] P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” in Proceedings of machine translation summit x: papers, 2005, pp. 79–86.

[27] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017.

[28] C. Wang, A. Wu, and J. Pino, “Covost 2 and massively multilingual speech-to-text translation,” arXiv preprint arXiv:2007.10310, 2020.

[29] e. a. Cieri Christopher, “Fisher english training speech part 1 speech ldc2004s13,” Web Download. Philadelphia: Linguistic Data Consortium, 2004.

[30] e. a. Cieri Christopher, “Fisher english training part 2, speech ldc2005s13,” Web Download. Philadelphia: Linguistic Data Consortium, 2005.

[31] R. Taori et al., Stanford alpaca: An instruction-following llama model, 2023. [32] S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021.

[33] C.-C. Chiu, J. Qin, Y. Zhang, J. Yu, and Y. Wu, “Self-supervised learning with randomprojection quantizer for speech recognition,” in International Conference on Machine Learning, PMLR, 2022, pp. 3915–3924.

[34] A. Bapna et al., “Mslam: Massively multilingual joint pre-training for speech and text,” arXiv preprint arXiv:2202.01374, 2022.

[35] Seamless Communication et al., Seamlessm4t: Massively multilingual & multimodal machine translation, 2023. arXiv: 2308.11596 [cs.CL].

[36] X. Li et al., “Multilingual speech translation from efficient finetuning of pretrained models,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Online: Association for Computational Linguistics, Aug. 2021, pp. 827–838. DOI: 10.18653/v1/2021.acl- long.68. [Online]. Available: https://aclanthology.org/2021.acl-long.68.

[37] S. Seo, D. Kwak, and B. Lee, “Integration of pre-trained networks with continuous token interface for end-to-end spoken language understanding,” in Proc. ICASSP, 2022, pp. 7152– 7156.

[38] Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2022.

[39] A. Derington, H. Wierstorf, A. Özkil, F. Eyben, F. Burkhardt, and B. W. Schuller, “Testing speech emotion recognition machine learning models,” arXiv preprint arXiv:2312.06270, 2023.

[40] B. T. Willard and R. Louf, “Efficient guided generation for llms,” arXiv preprint arXiv:2307.09702, 2023.

[41] J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.

[42] L. Kaiser et al., “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.

[43] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.

[44] Y.-C. Chen et al., “Speechnet: A universal modularized model for speech processing tasks,” arXiv preprint arXiv:2105.03070, 2021.

[45] J.-B. Alayrac et al., “Flamingo: A visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.

[46] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, PMLR, 2022, pp. 12 888–12 900.

[47] J. Y. Koh, R. Salakhutdinov, and D. Fried, “Grounding language models to images for multimodal inputs and outputs,” in International Conference on Machine Learning, PMLR, 2023, pp. 17 283–17 300.

[48] Z. Peng et al., “Kosmos-2: Grounding multimodal large language models to the world,” arXiv preprint arXiv:2306.14824, 2023.

[49] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.

[50] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.

[51] Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.

[52] Y. Shu et al., “Llasm: Large language and speech model,” arXiv preprint arXiv:2308.15930, 2023.

[53] J. Iranzo-Sánchez et al., “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8229–8233.

[54] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, Oct. 2019. DOI: 10.1109/TAFFC.2017.2736999.

[55] R. Paturi, S. Srinivasan, and X. Li, “Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction,” in Proc. INTERSPEECH 2023, 2023, pp. 3567–3571. DOI: 10.21437/Interspeech.2023-1982.

[56] e. a. Cieri Christopher, “2000 hub5 english evaluation speech ldc2002s09,” Web Download. Philadelphia: Linguistic Data Consortium, 2002.

Authors:

(1) Nilaksh Das, AWS AI Labs, Amazon and Equal Contributions;

(2) Saket Dingliwal, AWS AI Labs, Amazon([email protected]);

(3) Srikanth Ronanki, AWS AI Labs, Amazon;

(4) Rohit Paturi, AWS AI Labs, Amazon;

(5) Zhaocheng Huang, AWS AI Labs, Amazon;

(6) Prashant Mathur, AWS AI Labs, Amazon;

(7) Jie Yuan, AWS AI Labs, Amazon;

(8) Dhanush Bekal, AWS AI Labs, Amazon;

(9) Xing Niu, AWS AI Labs, Amazon;

(10) Sai Muralidhar Jayanthi, AWS AI Labs, Amazon;

(11) Xilai Li, AWS AI Labs, Amazon;

(12) Karel Mundnich, AWS AI Labs, Amazon;

(13) Monica Sunkara, AWS AI Labs, Amazon;

(14) Daniel Garcia-Romero, AWS AI Labs, Amazon;

(15) Kyu J. Han, AWS AI Labs, Amazon;

(16) Katrin Kirchhoff, AWS AI Labs, Amazon.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Merging Audio Input and Textual Instructions into One Unified Model

Up Next →

How We Pre-Trained a 300M Parameter Audio Encoder With Random Quantization