Multilingual Text-to-Speech (TTS) with Realistic AI Voices Using Elevenlabs API

Multilingual Text-to-Speech (TTS) with Realistic AI Voices Using Elevenlabs API
Photo by Kane Reinholdtsen / Unsplash

After my recent article on mass translating text to different languages with ChatGPT and OpenAI, it’s time for the next exciting step—making it talk!

Let’s bring my creation to life and make it talk!

As usual, let's kick things off with a process flow diagram:

Fire-up PowerShell and let's define a variable for the directory path of the text files and another one for the ffmpeg binary:

$directoryPath = "C:\\Users\\gregory.laroche\\Downloads\\Speak2text"
$ffmpegPath = "$directoryPath\\ffmpeg.exe"

Select only the text files that contain "_translated" in their names:

$inputFile = Get-ChildItem -Path $directoryPath -Filter "*_translated*.txt" | Select-Object -First 1

Verify if at least one file in the directory path contains "_translated" in its name:

if ($null -eq $inputFile) {
    Write-Error "No file found containing '_translated' in the name in the directory $directoryPath"
    exit
}

Read the text file and split it into sentences by searching for a dot followed by a space (". "):

$text = Get-Content -Path $inputFile.FullName -Raw
$sentences = $text -split '(?<=\.\s)'

Then group the text into sets of three sentences each:

$sentenceGroups = [System.Collections.Generic.List[System.String]]::new()
for ($i = 0; $i -lt $sentences.Length; $i += 3) {
    $group = $sentences[$i..[math]::Min($i + 2, $sentences.Length - 1)] -join ' '
    $sentenceGroups.Add($group)
}

Establish a connection to the ElevenLabs API:

$url = "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_192"
$headers = @{
    "Content-Type" = "application/json"
    "xi-api-key"   = "your ElevenLabs API key here"
}

don't forget to add your own ElevenLabs API private key

Create a temporary directory to store the audio files for the sentences:

$tempDir = "$directoryPath\\temp"
if (-Not (Test-Path -Path $tempDir)) {
    New-Item -ItemType Directory -Path $tempDir | Out-Null
}

Make the API call to ElevenLab:

foreach ($group in $sentenceGroups) {
    # Ensure the group is treated as a string
    $group = [string]$group

    $body = @{
        "text" = $group
        "voice_settings" = @{
            "use_speaker_boost" = $true
            "stability" = 0.4
            "similarity_boost" = 0.15
        }
        "model_id" = "eleven_multilingual_v2"
    } | ConvertTo-Json -Depth 3

    # Define the output file path for this group
    $outputFilePath = "$tempDir\\$(New-Guid).mp3"

    try {
        # Make the POST request and save the response to the file
        Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $body -OutFile $outputFilePath

        # Add the file path to the list
        $audioFiles += $outputFilePath
    } catch {
        Write-Error "Failed to process group: $group"
        Write-Error $_.Exception.Message
    }
}

the generated file in the response is store in the list for further concatenation

The model uses pre-made voices that are trained for specific use cases, such as narration, news, documentary, or video game voices:

Several languages are available, including French, Spanish, Italian, German, and Polish:

Here is the link to ElevenLabs documentation.

You need to include the correct "voice_id" in the URL of the API endpoint. Make sure to check, as the voice accents can vary from one model to another, along with intonation and speaking style.

I personally use Daniel (voice_id:JBFqnCBsd6RMkjVDRZzb) for French speaking:

$url = "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_192"

the output format at the end of the request depends on the kind of subscription your have

For the voice settings, I’ve defined "Stability," "Similarity," and "Speaker Boost":

There is a few other more. You can find them in the documentation here.

The settings are defined here :

        "voice_settings" = @{
            "use_speaker_boost" = $true
            "stability" = 0.4
            "similarity_boost" = 0.15

Next up, let’s save our awesome creation as "output.mp3":

$finalOutputFilePath = "$directoryPath\\output.mp3"

Concatenate all the audio file in one using ffmpeg:

$finalOutputFilePath = "$directoryPath\\output.mp3"
$concatListFile = "$tempDir\\concat_list.txt"
$audioFiles | ForEach-Object { "file '$_'" } | Set-Content -Path $concatListFile
& $ffmpegPath -f concat -safe 0 -i $concatListFile -c copy $finalOutputFilePath

Then, clean up the temp files & folder:

Remove-Item -Path $tempDir -Recurse -Force

And finally, rename the "output.mp3" with the input name minus "_translated" in the naming:

$baseFileName = $inputFile.Name -replace '_translated', ''
$newFileName = [System.IO.Path]::ChangeExtension($baseFileName, ".mp3")
$newFilePath = Join-Path -Path $directoryPath -ChildPath $newFileName
Rename-Item -Path $finalOutputFilePath -NewName $newFilePath -Force

Et voilà, my creation is alive and talking back to me!😅

a creepy looking man wearing a scarf and a scarf around his neck
Photo by Chris Luengas / Unsplash

As usual, here’s the complete script below :

# Define the directory path
$directoryPath = "C:\\Users\\gregory.laroche\\Downloads\\Speak2text"

# Get the first .txt file containing "_translated" in the name
$inputFile = Get-ChildItem -Path $directoryPath -Filter "*_translated*.txt" | Select-Object -First 1

# Check if the file exists
if ($null -eq $inputFile) {
    Write-Error "No file found containing '_translated' in the name in the directory $directoryPath"
    exit
}

# Read the text from the input file
$text = Get-Content -Path $inputFile.FullName -Raw

# Split text into sentences using regex to match periods followed by a space
$sentences = $text -split '(?<=\.\s)'

# Group sentences into groups of 3
$sentenceGroups = [System.Collections.Generic.List[System.String]]::new()
for ($i = 0; $i -lt $sentences.Length; $i += 3) {
    $group = $sentences[$i..[math]::Min($i + 2, $sentences.Length - 1)] -join ' '
    $sentenceGroups.Add($group)
}

# Define the URL and headers
$url = "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb?output_format=mp3_44100_192"
$headers = @{
    "Content-Type" = "application/json"
    "xi-api-key"   = "your ElevenLab API key here"
}

# Define a temporary directory for storing individual audio files
$tempDir = "$directoryPath\\temp"
if (-Not (Test-Path -Path $tempDir)) {
    New-Item -ItemType Directory -Path $tempDir | Out-Null
}

# Initialize a list to store the paths of the downloaded audio files
$audioFiles = @()

# Process each group of sentences
foreach ($group in $sentenceGroups) {
    # Ensure the group is treated as a string
    $group = [string]$group

    # Define the body of the request
    $body = @{
        "text" = $group
        "voice_settings" = @{
            "use_speaker_boost" = $true
            "stability" = 0.4
            "similarity_boost" = 0.15
        }
        "model_id" = "eleven_multilingual_v2"
    } | ConvertTo-Json -Depth 3

    # Log the body for debugging
    Write-Output "Request Body: $body"

    # Define the output file path for this group
    $outputFilePath = "$tempDir\\$(New-Guid).mp3"

    try {
        # Make the POST request and save the response to the file
        Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $body -OutFile $outputFilePath

        # Add the file path to the list
        $audioFiles += $outputFilePath
    } catch {
        Write-Error "Failed to process group: $group"
        Write-Error $_.Exception.Message
    }
}

# Define the final output file path
$finalOutputFilePath = "$directoryPath\\output.mp3"

# Path to ffmpeg
$ffmpegPath = "$directoryPath\\ffmpeg.exe"

# Concatenate the audio files into a single file
$concatListFile = "$tempDir\\concat_list.txt"
$audioFiles | ForEach-Object { "file '$_'" } | Set-Content -Path $concatListFile

& $ffmpegPath -f concat -safe 0 -i $concatListFile -c copy $finalOutputFilePath

# Clean up the temporary directory
Remove-Item -Path $tempDir -Recurse -Force

# Output a message indicating the file has been saved
Write-Output "The final audio file has been saved to $finalOutputFilePath"

# Remove "_translated" from the $inputFile variable
$baseFileName = $inputFile.Name -replace '_translated', ''

# Define the new file name with .mp3 extension
$newFileName = [System.IO.Path]::ChangeExtension($baseFileName, ".mp3")

# Define the new file path
$newFilePath = Join-Path -Path $directoryPath -ChildPath $newFileName

# Rename the file
Rename-Item -Path $finalOutputFilePath -NewName $newFilePath -Force

# Output a message indicating the file has been renamed
Write-Output "The final audio file has been renamed to $newFilePath"

And below are the script and ffmpeg binary:

Talking about AI voice quality, check out the text below about LLM, with voice generated in English, Spanish, and Italian (download audio files for each language):

📕
"LLMs are artificial neural networks that utilize the transformer architecture, invented in 2017. The largest and most capable LLMs, as of June 2024, are built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data."