The RapVerse Dataset: A New Benchmark in Text-to-Music and Motion Generation

cover
7 Aug 2025

Abstract and 1. Introduction

  1. Related Work

    2.1 Text to Vocal Generation

    2.2 Text to Motion Generation

    2.3 Audio to Motion Generation

  2. RapVerse Dataset

    3.1 Rap-Vocal Subset

    3.2 Rap-Motion Subset

  3. Method

    4.1 Problem Formulation

    4.2 Motion VQ-VAE Tokenizer

    4.3 Vocal2unit Audio Tokenizer

    4.4 General Auto-regressive Modeling

  4. Experiments

    5.1 Experimental Setup

    5.2 Main Results Analysis and 5.3 Ablation Study

  5. Conclusion and References

A. Appendix

2.1 Text to Vocal Generation

Text-to-audio Dataset. There exists several singing vocal datasets, yet they each has constraints. For instance, PopCS [32] and OpenSinger [22] are limited to Chinese, while NUS-48E [8] and NHSS [55] have only a few hours of songs. Nevertheless, JVS-MuSiC [58] and NUS-48E [8] offer a few hours of songs from dozens of singers, whereas OpenSinger [22] provides a more extensive collection with tens of hours from a single singer. Notably, our dataset represents the first to be specifically curated for rap songs of multiple singers with 108 hours.

Text-to-vocal Models. Recent advancements in text-to-speech (TTS) models, including WavNet [40], FastSpeech 1 and 2 [52, 51], and EATS [7], have significantly improved the quality of synthesized speech. However, singing voice synthesis (SVS) presents a greater challenge due to its reliance on additional musical scores and lyrics. Recent generation models [32, 16, 16, 74, 26] perform excellently in generating singing voices. However, their performance tends to deteriorate when encountering out-of-distribution data. To handle this, StyleSinger [73] introduces a Residual Style Adaptor and an Uncertainty Modeling Layer Normalization to handle this.

2.2 Text to Motion Generation

Text-to-motion Dataset. Current text-motion datasets, such as KIT [44], AMASS [35], and HumanML3D [13], are constrained by limited data coverage, typically spanning only tens of hours and lacking whole-body motion representation. To overcome these shortcomings, Motion-X [30] was developed, providing an extensive dataset that encompasses whole-body motions.

Text-to-motion Models. Text-to-motion models [43, 71, 14, 13, 34, 4] have gained popularity for their user convenience. Recent advancements have focused on diffusion models [71, 25, 61], which, unlike deterministic generation models [2, 11], enable finer-grained and diverse output generation. Chen et al. [4] proposed a motion latent-based diffusion model to enhance generative quality and reduce computational costs. However, the misalignment between natural language and human motions presents challenges. Lu et al. [34] introduced the Text-aligned Whole-body Motion generation framework to address these issues by generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously.

2.3 Audio to Motion Generation

Audio-to-motion Dataset. Co-Speech Datasets can be classified into 1: pseudo-labeled (PGT) and 2: motion-captured (mocap). PGT datasets [3, 15, 12, 68] utilize disconnected 2D or 3D key points to

Table 1: Comparison of Audio and Motion Datasets. (a) compares our Rap-Vocal Subset with existing singing vocal datasets. (b) compares our Rap-Motion Subset with existing human motion datasets. †: “Whole” means whole-body motion including face, body, and hand.

represent the body, allowing extracting at a lower cost but with limited accuracy. On the other hand, mocap datasets provide annotations for limited body parts, with some focusing solely on the head [9, 5, 65] or the body [57, 10]. Differing from them, BEATX [31] contains mesh data of both head and body. However, these datasets typically focus on speech to motion generation. Our RapVerse, in contrast, contains paired text-vocal-motion data, enabling simultaneous motion and vocal generation.

Audio-to-motion Models. Audio-to-motion models [59, 56, 72, 31, 68] aim to produce human motion from audio innputs. Recognizing the intricate relationship between audio and human face, TalkSHOW [68] separately generates face parts and other body parts. However, this approach exhibits limitations, such as the absence of generating lower body and global motion. EMAGE [31] utilizes masked gestures together with vocals for a generation. Our model, however, goes a step further by being the first to generate paired audio-motion data directly from text.

Figure 2: RapVerse dataset collection pipeline. There are two pathways for rap-vocal and rap-motion subsets, respectively.

Authors:

(1) Jiaben Chen, University of Massachusetts Amherst;

(2) Xin Yan, Wuhan University;

(3) Yihang Chen, Wuhan University;

(4) Siyuan Cen, University of Massachusetts Amherst;

(5) Qinwei Ma, Tsinghua University;

(6) Haoyu Zhen, Shanghai Jiao Tong University;

(7) Kaizhi Qian, MIT-IBM Watson AI Lab;

(8) Lie Lu, Dolby Laboratories;

(9) Chuang Gan, University of Massachusetts Amherst.


This paper is available on arxiv under CC BY 4.0 DEED license.