
What It Takes to Train a Versatile Speech AI System
20 Jun 2025
Explore the key tasks used to train our speech AI model—from transcription and translation to emotion, sentiment, and speaker detection.

How We Pre-Trained a 300M Parameter Audio Encoder With Random Quantization
19 Jun 2025
We pre-trained a 300M parameter audio encoder using the BEST-RQ method on 300K hours of audio, with detailed hyperparameter tuning and data filtering.

A Unified Multimodal Approach to Speech Processing with LLMs
19 Jun 2025
SpeechVerse is a multimodal AI framework that uses natural language prompts to perform diverse speech tasks with strong zero-shot generalization.

Merging Audio Input and Textual Instructions into One Unified Model
19 Jun 2025
SpeechVerse is a unified model using multi-task learning to convert audio and text instructions into textual output, outperforming prior audio-language models

How Constrained and Joint Decoding Improve Multimodal Speech Models
18 Jun 2025
Learn how constrained and joint decoding strategies boost generalization and accuracy in unseen speech classification tasks with the SpeechVerse model.

Can AI Models Follow Instructions They've Never Seen Before?
18 Jun 2025
Multitask-WLM shows strong generalization to new prompts and unseen tasks, proving its robustness across varied instructions and modalities.

SpeechVerse vs. SOTA: Multi-Task Speech Models in Real-World Benchmarks
18 Jun 2025

Evaluating Multimodal Speech Models Across Diverse Audio Tasks
18 Jun 2025
We benchmark multimodal speech models across ASR, SLU, and PSP tasks using diverse datasets and compare them with strong cascaded baselines.

The Science Behind Audio-Aware Language Models
17 Jun 2025
Learn how curriculum learning and LoRA adapters optimize multimodal models for speech tasks using pre-trained audio encoders and language models.