So I had my segmented words, now I just needed to do batches of analysis to get information that my Max patch could use to trigger the words. Professor Devaney showed me some of the features I might look to extract and methods in Python to do so. Getting F0 and MFCC content looked particularly promising. It seemed like this would be the end of my need for the Gentle aligner — but in addition to the CSV file that simply tells each word’s start and end time, Gentle also outputs a JSON file that breaks down each word by phone and gives its duration. Prof. Devaney suggested that data might also come in handy, and I should hold on to it.
At first I thought Gentle’s use of the term phone was an abbreviation for phoneme, but looking into it I found these were distinct if related concepts. A phoneme, according to the dubious research source wikipedia, “is a speech sound in a given language that, if swapped with another phoneme, could change one word to another.” Phonemes have meaning, are language-based. Phones, on the other hand, are physical speech sounds outside of, more perhaps before, specific languages or meaning. It makes sense, then, that Gentle, which analyzes a sound file without direct human input, is analyzing and outputting information about the sounds rather than their meaning.
Now I needed to extract and store the phone data, which ending up requiring a few steps. Instead of having a single JSON file with data for each chunk of 50 words, I wanted a file for each audio file, but didn’t want to do this manually for thousands of files. So I wrote some Python code to send a directory of audio files through Gentle.
The aligner expects a .txt file for the transcript, but I don’t have one for each word. So the code strips the file type and info I’ve added (“-i-1” and so on) from the file name, leaving just the word to be transcribed, and makes that the content of a temporary text file, which is fed to Gentle as the transcript. It does something similar for the naming conventions of the JSON file:
for file_name in list_of_files:
# strip everything after the word in the file name for content of temporary file
sep1 = '-'
stripped1 = file_name.split(sep1, 1)[0]
# strip ".wav" from the file name for name of .json file
sep2 = '.'
stripped2 = file_name.split(sep2, 1)[0]
# make a temporary file with the word to be aligned as the content, read the file
with tempfile.NamedTemporaryFile('w+t', suffix='.txt') as fp:
fp.write(stripped1)
fp.seek(0)
# run gentle forced aligner
os.system('python3 align.py ' f'{dir_audio}{file_name} {fp.name} -o {dir_output}{stripped2}.json')
This results in a directory of JSON files with phone information for each word. Let’s look at a representative file:
{
"transcript": "any",
"words": [
{
"alignedWord": "any",
"case": "success",
"end": 0.29000000000000004,
"endOffset": 3,
"phones": [
{
"duration": 0.01,
"phone": "eh_B"
},
{
"duration": 0.01,
"phone": "n_I"
},
{
"duration": 0.12,
"phone": "iy_E"
}
],
"start": 0.15,
"startOffset": 0,
"word": "any"
}
]
}
JSON files, I learned, are a common data storage format that have more flexibility than CSV files because data can be nested in multiple levels. Here, for example, some of the levels under the category “words” have only one entry (“alignedWord”: “any”), whereas the “phones” entry has multiple sub levels — one for each phone, each of which also has a duration tag.
The phones are also tagged with their position in the word, where “_B” is the first one, “_E” the last, and “_I” any in the middle, for “beginning” “ending” and “inside” or “interior” I assume, although this doesn’t seem to be documented and I had to figure it out by going through many JSON files and looking for patterns. So the word we see broken down above, “any,” contains the phones “eh,” “n,” and “iy.”
How many of these phones are there, I wondered? Who is keeping track of them?