Using Gentle forced aligner

In my first post I outlined a few of my interests in creative use of the voice and started to talk about the prep work I’m doing for an upcoming project that will use a large body of sound files consisting of a single spoken word each, that can be called up and played back at will in a Max/MSP patch. Or that was the hope.

That was March 9; it is now late May. I’ve done a lot of work since then but neglected to document it. Time to fix that. I left off with a breezy explanation of how I used Gentle forced aligner to segment a long audio file of spoken text into individual words, but it wasn’t that simple. Let’s start by talking about Gentle.

It’s a forced aligner — a tool that aligns an audio file containing speech with a script or transcript of the expected content, giving precise timing information and additional data. Professor Devaney suggested using one of these tools because it should give me much more accurate results when segmenting individual words, whereas segmenting based on parameters like loudness or silence detection would likely lead to many errors. Like many forced aligners, Gentle is built upon Kaldi, a toolkit for speech recognition for use by researchers. The website has a friendly demonstration of how it works:

Gentle running its distributed alignment example, Alvin Lucier’s “I Am Sitting in a Room.” Note the breakdown of the word “different” into phones, and also the greyed out words “my speech,” which could not be aligned.

and a link to the GUI version of the program, but I wasn’t able to get the GUI to work. In the end this was a boon — being forced to install the command line version of Gentle gave me access to many more possibilities that I would later realized were desirable.

It was also difficult. Compared to other forced aligners, Gentle claims to be “easier to install and use.” If that’s the case, the others must be impossible. I was very rusty with the command line at the start of the semester and assumed that I was simply not doing it right, but it became apparent that others were having similar problems. Much googling led to a thread on Gentle’s GitHub repository full of people complaining about installation issues wherein one user posted their own “bug-free version” and another gave step-by-step installation instructions.

If you have plans to use Gentle, save yourself some hassle and install this version.

It took some playing around with the various scripts, but I eventually figured out how to get the web version running on a local server — one must run the “serve.py” script then open http://localhost:8765/
From there I had access to the same interface as on Gentle’s website. I loaded in my audio file of 1000 words and pasted the transcript into the text box. After running the alignment, I checked the CSV file that was created and found many unaligned words with no timestamps, so I tried using shorter bits. I settled on running 50 words at a time because at that size I was consistently getting no errors.

I’m not sure why Gentle got hung up on larger files, and why sending it the exact same information but in smaller chunks caused no problems. Perhaps it’s the artificial nature of the input. The word lists I’m using have no grammatical structure, nor is the timing of their delivery anything like normal speech — the words are read to a click track at 60 bpm, one word per beat. If part of Gentle’s machine learning approach to alignment wants to match the input to typical speech patterns, I am doing my best to muddy the waters.

After running each set of 50 words through the alignment process, I was given a CSV file with precise timing info for each word. Time to write a script to segment the words into individual sound files.

Leave a comment

Your email address will not be published. Required fields are marked *