cover

What It Takes to Train a Versatile Speech AI System

20 Jun 2025

Explore the key tasks used to train our speech AI model—from transcription and translation to emotion, sentiment, and speaker detection.

cover

How We Pre-Trained a 300M Parameter Audio Encoder With Random Quantization

19 Jun 2025

We pre-trained a 300M parameter audio encoder using the BEST-RQ method on 300K hours of audio, with detailed hyperparameter tuning and data filtering.

cover

A Unified Multimodal Approach to Speech Processing with LLMs

19 Jun 2025

SpeechVerse is a multimodal AI framework that uses natural language prompts to perform diverse speech tasks with strong zero-shot generalization.

cover

Merging Audio Input and Textual Instructions into One Unified Model

19 Jun 2025

SpeechVerse is a unified model using multi-task learning to convert audio and text instructions into textual output, outperforming prior audio-language models

cover

How Constrained and Joint Decoding Improve Multimodal Speech Models

18 Jun 2025

Learn how constrained and joint decoding strategies boost generalization and accuracy in unseen speech classification tasks with the SpeechVerse model.

cover

Can AI Models Follow Instructions They've Never Seen Before?

18 Jun 2025

Multitask-WLM shows strong generalization to new prompts and unseen tasks, proving its robustness across varied instructions and modalities.

cover

SpeechVerse vs. SOTA: Multi-Task Speech Models in Real-World Benchmarks

18 Jun 2025

cover

Evaluating Multimodal Speech Models Across Diverse Audio Tasks

18 Jun 2025

We benchmark multimodal speech models across ASR, SLU, and PSP tasks using diverse datasets and compare them with strong cascaded baselines.

cover

The Science Behind Audio-Aware Language Models

17 Jun 2025

Learn how curriculum learning and LoRA adapters optimize multimodal models for speech tasks using pre-trained audio encoders and language models.