Parler-TTS Mini v1 ft. ParaSpeechCaps

We finetuned parler-tts/parler-tts-mini-v1 on our ParaSpeechCaps dataset to create a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) with a textual style prompt ('A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').

ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances, supporting 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags. It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled. Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations for such a wide variety of style tags for the first time.

Please take a look at our paper, our codebase and our demo website for more information.

License: CC BY-NC SA 4.0

Usage

Installation

This repository has been tested with Python 3.11 (conda create -n paraspeechcaps python=3.11), but most other versions should probably work.

git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]

Running Inference

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)

For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our codebase.

Citation

If you use this model, the dataset or the repository, please cite our work as follows:

@misc{diwan2025scalingrichstylepromptedtexttospeech,
      title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, 
      author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
      year={2025},
      eprint={2503.04713},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.04713}, 
}
Downloads last month
119
Safetensors
Model size
878M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for ajd12342/parler-tts-mini-v1-paraspeechcaps

Finetuned
(8)
this model

Dataset used to train ajd12342/parler-tts-mini-v1-paraspeechcaps

Space using ajd12342/parler-tts-mini-v1-paraspeechcaps 1