Coqui TTS
In text2speech: Text to Speech Conversion

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Coqui TTS is a text-to-speech (TTS) library that enables the conversion of regular text into speech and is completely free to use. This is not true of the other text to speech engines used by text2speech.

Coqui TTS provides pre-trained tts and vocoder models as part of its package. To get a sense of the best tts and vocoder models, take a look at this GitHub Discussion post. In the Coqui TTS Hugging Face Space, you have the opportunity to experiment with a few of these models by inputting text and receiving corresponding audio output.

The underlying technology of text-to-speech is highly intricate and will not be the focus of this vignette. However, if you're interested in delving deeper into the subject, here are some recommended talks:

Pushing the frontier of neural text to speech, a webinar by Xu Tan at Microsoft Research Asia
Towards End-to-end Speech Synthesis, a talk given by Yu Zhang, Research Scientist at Google
Text to Speech Deep Dive, a talk given during the ML for Audio Study Group hosted by Hugging Face.

Coqui TTS includes pre-trained models like Spectogram models (such as Tacotron2 and FastSpeech2), End-to-End Models (including VITS and YourTTS), and Vocoder models (like MelGAN and WaveGRAD).

Installation

To install Coqui TTS, you will need to enter the following command in the terminal:

$ pip install TTS

Note: If you are using a Mac with an M1 chip, initial step is to execute the following command in terminal:

$ brew install mecab

Afterward, you can proceed to install TTS by executing the following command:

$ pip install TTS

Authentication

library(text2speech)

To use Coqui TTS, text2speech needs to know the correct path to the Coqui TTS executable. This path can be obtained through two methods: manual and automatic.

Manual

You have the option to manually specify the path to the Coqui TTS executable in R. This can be done by setting a global option using the set_coqui_path() function:

set_coqui_path("your/path/to/tts")

To determine the location of the Coqui TTS executable, you can enter the command which tts in the terminal.

Internally, the set_coqui_path() function runs options("path_to_coqui" = path) to set the provided path as the value for the path_to_coqui global option, as long as the Coqui TTS executable exists at that location.

Automatic

The functions tts_auth(service = "coqui"), tts_voices(service = "coqui"), and tts(service = "coqui") incorporate a way to search through a predetermined list of known locations for the Coqui TTS executable. If none of these paths yield a valid TTS executable, an error message will be generated, directing you to use set_coqui_path() to manually set the correct path.

List Voices

The function tts_voices(service = "coqui") is a wrapper for the system command tts --list_models, which lists the released Coqui TTS models.

tts_voices(service = "coqui")

The result is a tibble with the following columns: language, dataset, model_name, and service.

language column contains the language code associated with the speaker.
dataset column indicates the specific dataset on which the text-to-speech model, denoted by model_name, was trained.
model_name column refers to the name of the text-to-speech model.
service column refers to the specific TTS service used (Amazon, Google, Microsoft, or Coqui TTS)

You can find a list of papers associated with some of the implemented models for Coqui TTS here.

By providing the values from this tibble (language, dataset, and model_name) in tts(), you can select the specific voice you want for text-to-speech synthesis.

Text-to-Speech

To convert text to speech, you can use the function tts(text = "Hello world!", service = "coqui").

tts(text = "Hello world!", service = "coqui")

The result is a tibble with the following columns: index, original_text, text, wav, file, audio_type, duration, and service. Some of the noteworthy ones are:

text: If the original_text exceeds the character limit, text represents the outcome of splitting original_text. Otherwise, text remains the same as original_text.
file: The location where the audio output is saved.
audio_type: The format of the audio file, either mp3 or wav.

By default, the function tts(service = "coqui") uses the tacotron2-DDC_ph model and the ljspeech/univnet vocoder. You can specify a different model with the argument model_name, or a different vocoder with the argument vocoder_name.

tts(text = "Hello world, using a different voice!",
    service = "coqui",
    model_name = "fast_pitch",
    vocoder_name = "ljspeech/hifigan_v2")

Another default is that tts(service = "coqui") saves the audio output in a temporary folder and its path is shown in the file column of the resulting tibble. However, a temporary directory lasts only as long as the current R session, which means that when you restart your R session, that path will not exist!

A more sustainable workflow would be to save the audio output in a local folder. To save the audio output in a local folder, set the arguments save_local = TRUE and save_local_dest = /full/path/to/local/folder. Make sure to provide the full path to the local folder.

tts(text = "Hello world! I am saving the audio output in a local folder",
    service = "coqui",
    save_local = TRUE,
    save_local_dest = "/full/path/to/local/folder")