Some last thoughts.

I’m struck by how different this project might have turned out if I’d made some divergent choices along the way. I briefly played with other forced aligners besides Gentle that return different kinds of information. Darla, for example, gave me results with the formants for each vowel. Some forced aligners output the data in TextGrid files, which are used primarily with Praat, a software tool for analyzing phonetics. I’m sure those files can be accessed by Python just like CSV or JSON files, but the different experience of working with them may have thrown up other possibilities and obstacles. Before I began this independent study I had a vague inkling of where I wanted to go, and that it should be possible, but I didn’t know enough about this area to make informed decisions about the best path. Johanna Devaney’s guidance was crucial throughout the semester.

There are other Max/MSP libraries besides MuBu that deal with organizing a large corpus of sound, such as FluCoMa. I plan to look into those a bit more and may incorporate multiple approaches into my Max patch.

In a few weeks I’ll meet up with my collaborator Aaron Snyder to put all of this work to compositional use. That will really be the test.

The Max patch

As mentioned in a previous post, I decided to use the MuBu set of Max/MSP objects to organize and access my audio files. A MuBu is a “multi-buffer” that can hold lots of associated data. It’s typically used to store audio files, and analysis data derived from each file. This can be accessed by various related objects to drive playback of the files or their constituent parts — in fact it seems to be largely used for granular and concatenative synthesis. Somewhat perversely, I’m using it to play back entire sound files — my “grains” are the size of words.

MuBu has built-in tools that can analyze audio files in the buffers for the same features I was looking at in Python, like fundamental frequency and MFCCs. The most immediately interesting playback feature is a scatterplot, where the user chooses two features to plot the analyzed sections of sound files as dots on an X-Y surface. In the primary mode, moving the mouse over a dot triggers playback of the associated file section.

Each colored dot represents a part of a sound file (or in my usage an entire sound file).

It took some trial and error to get this working with my somewhat unorthodox approach, at which point I spent a lot of time playing with the parameters given by these typical analysis functions, with disappointing results. I would plot, say, center frequency against spectral centroid, move the mouse around, and not feel like there was any meaningful connection between the selection of sounds. I think there is still room to explore — there are many analysis options that take some work to put in place, and as I come to understand the MuBu tools better I may find methods that work better. But when I turned to the phone data I had gathered, things really opened up.

Plotting the first phoneme against the last phone, there is a clear, visible and audible relationship between adjacent words.

Scatterplot with first phone on X axis, last phone on Y axis.
Phone keys are shown above and to the left.

There is still a lot to try:

  • I want to look further into feature analysis and see if I can integrate that in a meaningful way with the phone data.
  • I want to see if I can bring back the inner phone data, although I’ve tried a couple things that haven’t worked.
  • I want to look at different playback methods. This one is fun and useful to quickly find interesting groupings of words, but it’s limited. I’m interested in collecting a list of words that I like — with the order perhaps created by mousing over the scatterplot, or maybe with another algorithm — and having them played back rhythmically.
  • Another issue with this playback mechanism is that words with identical phone values are plotted directly on top of each other and only one is accessible. The actual playback is driven by a k-nearest neighbors algorithm. At this point I only understand this in the broadest terms, but I imagine if I play with its parameters, or use a different algorithm, I can get access to more of the words in my corpus.
  • One result I’d really like to achieve is to have live incoming audio drive the playback. MuBu has an example patch of “granular mosaicing” that gives really interesting results with small slices of sound. I’d like to adapt it to work on my larger level, to listen for awhile and then give back some words that it thinks are related to the input.

The current state of my Max patch is available here. It includes the MuBu externals that are used, as well as a small selection of audio files and a pre-made phone file for loading into Max for demonstration purposes. I’ll close with a link to a short video demonstrating the patch with a fuller corpus. Apologies, I can’t get it to embed at the moment.

More phones

Well, as phones are linked to possible sounds regardless of language and meaning, it would appear they are the same worldwide. But one organization that has kept track of them is the Speech Group at Carnegie Mellon University. Their CMU Pronouncing Dictionary is a dictionary of mostly English words with pronunciation given in phones. Their github repository has a handy list of those 39 phones which corresponds with those used by Gentle, a good way to start to build up a database of phone information for my words.

At first I worked on building up a database for each phone. The phone “eh,” for example, would have its own CSV file. Each line would list a word in which “eh” was present, its corresponding sound file, and the position of “eh” in the word — B, I, or E, as discussed in the last post. As I began to narrow down my approach in Max/MSP, however, I realized I needed a different approach. I settled on using a suite of objects called MuBu, which can store a variety of data quite handily, but they want each audio file to be paired with data about that file, rather than having databases that refer to many audio files.

The MuBu objects store data in a matrix that expects a fixed number of columns. Words will always have a first and last phone, but as the number of inner phones is variable, I wasn’t sure how to deal with them. For now, then, I have decided to discard those inner phones.

MuBu with phone data for the first and last phones of the word “any”, interpreted as numbers.

With that decision made, I needed to write code that would go through a directory of JSON files with phone data about my words, assign numbers to the first and last phone of each word, and write those into a text file in a format that Max/MSP can understand.

First I gave number values to each phone. The CMU list and Gentle both have them in alphabetical order, which is what I’ve adopted for now — so “aa” has a value of 1 and “zh” is 39. Eventually I may order them by sound and/or sound type (vowel, fricative, etc.).

The code then opens this list as a CSV dictionary. It reads through a directory of JSON files and looks for a few things. Some words weren’t aligned properly by Gentle, and these are given phone values of 0 and 0. In the future I’d like to manually fix such words, but this at least gets them into the Max patch.

For words that were handled well, the code ignores the inner phones, strips “_B” and “_E” off the first and last phone, looks them up in the dictionary, and replaces them with a number. So the word “any” gets the values 11 and 18, for the phones “eh” and “iy.”

elif "_I" not in j:
                    
    # remove the "_B" or "_E" tag before comparing to dictionary
    sep1 = '_'
    j = j.split(sep1, 1)[0]
        
    # Look phoneme up in dictionary, replace with number value
    k = " ".join(lookup_dict.get(ele, ele) for ele in j.split())
                    
    # Add number to phoneme list
    phone_to_num.phones.append(k)

Another Python function writes this data into a text file meant for Max. A few lines might look like this:

1, 0 11 18;
2, 0 11 18;
3, 0 11 24;
4, 0 11 24;

The first value is the index, which corresponds to the order of both the audio files and the JSON files. The zero goes into the “time” column of the MuBu, which tells it that this refers to the beginning of the file, or as I’m treating it, the entire file. The next two numbers are the first and last phone. This snippet of file refers to the words “any” and “anything,” each of which occur twice in my corpus, as they were found on both my list and Aaron’s.

With all of this data collected, I could finally use it. In Max/MSP. For creative purposes.

Phones

So I had my segmented words, now I just needed to do batches of analysis to get information that my Max patch could use to trigger the words. Professor Devaney showed me some of the features I might look to extract and methods in Python to do so. Getting F0 and MFCC content looked particularly promising. It seemed like this would be the end of my need for the Gentle aligner — but in addition to the CSV file that simply tells each word’s start and end time, Gentle also outputs a JSON file that breaks down each word by phone and gives its duration. Prof. Devaney suggested that data might also come in handy, and I should hold on to it.

At first I thought Gentle’s use of the term phone was an abbreviation for phoneme, but looking into it I found these were distinct if related concepts. A phoneme, according to the dubious research source wikipedia, “is a speech sound in a given language that, if swapped with another phoneme, could change one word to another.” Phonemes have meaning, are language-based. Phones, on the other hand, are physical speech sounds outside of, more perhaps before, specific languages or meaning. It makes sense, then, that Gentle, which analyzes a sound file without direct human input, is analyzing and outputting information about the sounds rather than their meaning.

Now I needed to extract and store the phone data, which ending up requiring a few steps. Instead of having a single JSON file with data for each chunk of 50 words, I wanted a file for each audio file, but didn’t want to do this manually for thousands of files. So I wrote some Python code to send a directory of audio files through Gentle.

The aligner expects a .txt file for the transcript, but I don’t have one for each word. So the code strips the file type and info I’ve added (“-i-1” and so on) from the file name, leaving just the word to be transcribed, and makes that the content of a temporary text file, which is fed to Gentle as the transcript. It does something similar for the naming conventions of the JSON file:

for file_name in list_of_files:

    # strip everything after the word in the file name for content of temporary file
    sep1 = '-'
    stripped1 = file_name.split(sep1, 1)[0]
    
    # strip ".wav" from the file name for name of .json file
    sep2 = '.'
    stripped2 = file_name.split(sep2, 1)[0]

    # make a temporary file with the word to be aligned as the content, read the file
    with tempfile.NamedTemporaryFile('w+t', suffix='.txt') as fp:    
        fp.write(stripped1)
        fp.seek(0)

        # run gentle forced aligner
        os.system('python3 align.py ' f'{dir_audio}{file_name} {fp.name} -o {dir_output}{stripped2}.json')

This results in a directory of JSON files with phone information for each word. Let’s look at a representative file:

{
  "transcript": "any",
  "words": [
    {
      "alignedWord": "any",
      "case": "success",
      "end": 0.29000000000000004,
      "endOffset": 3,
      "phones": [
        {
          "duration": 0.01,
          "phone": "eh_B"
        },
        {
          "duration": 0.01,
          "phone": "n_I"
        },
        {
          "duration": 0.12,
          "phone": "iy_E"
        }
      ],
      "start": 0.15,
      "startOffset": 0,
      "word": "any"
    }
  ]
}

JSON files, I learned, are a common data storage format that have more flexibility than CSV files because data can be nested in multiple levels. Here, for example, some of the levels under the category “words” have only one entry (“alignedWord”: “any”), whereas the “phones” entry has multiple sub levels — one for each phone, each of which also has a duration tag.

The phones are also tagged with their position in the word, where “_B” is the first one, “_E” the last, and “_I” any in the middle, for “beginning” “ending” and “inside” or “interior” I assume, although this doesn’t seem to be documented and I had to figure it out by going through many JSON files and looking for patterns. So the word we see broken down above, “any,” contains the phones “eh,” “n,” and “iy.”

How many of these phones are there, I wondered? Who is keeping track of them?

Segmentation and my first Python script

The next thing I did was create a Python script to segment the larger audio file into 50 individual files with useful filenames. I adapted a script that uses the Python bindings for FFmpeg, a tool for converting audio. The script takes the following arguments:

  1. Audio file to be segmented
  2. File with timing information
  3. Speaker name
  4. List number

Aside from argument 1, these are unclear. Let’s go into them.

For timing information, I used the CSV file output by Gentle in the previous step. A few lines of that look like this:

which,which,20.8,21.51
doc,doc,21.91,22.46
https,<unk>,22.94,23.98

The first item of each line is the expected word, followed by the found word, then the timestamps for the start and end of the word respectively. As you can see in the sample above, some words weren’t found by Gentle but were given accurate timing info anyway. I can only assume that the “word” “https” was not in the dictionary that Gentle uses in its alignment process. To avoid issues like this, my script looks only at the expected word and timing info.

It tells FFmpeg to create a new audio file beginning at the start time. It then subtracts the start time from the end time and tells it that the file should last for that length of time.

I foresee that, as the body of word files increases in size, there could be repeated words. We may also involve additional speakers, whose lists of words may overlap, and we may want to do multiple takes where words are spoken with different affect. Hence the speaker name and list number. If the speaker name argument is ‘aaron’ or ‘ian’ (my collaborator and myself, so the two most common speakers), it appends a ‘-a’ or ‘-i’ to the filename. For other speakers it simply passes on their name, like ‘-fernanda’. The list number adds a number after another dash. So a typical file from my list might have the name ‘any-i-1.wav’.

Here is what the command line code to run the script might look like:

python3 splitnames.py 1-50.wav 1-50-gentle.csv i 1

After running each set of 50 words, I listened to the files to make sure they were segmented properly and manually re-edited any that didn’t work. For an entire 1000 word list, there were fewer than ten files that didn’t work — saving me a lot of time over manually chopping each word.

Using Gentle forced aligner

In my first post I outlined a few of my interests in creative use of the voice and started to talk about the prep work I’m doing for an upcoming project that will use a large body of sound files consisting of a single spoken word each, that can be called up and played back at will in a Max/MSP patch. Or that was the hope.

That was March 9; it is now late May. I’ve done a lot of work since then but neglected to document it. Time to fix that. I left off with a breezy explanation of how I used Gentle forced aligner to segment a long audio file of spoken text into individual words, but it wasn’t that simple. Let’s start by talking about Gentle.

It’s a forced aligner — a tool that aligns an audio file containing speech with a script or transcript of the expected content, giving precise timing information and additional data. Professor Devaney suggested using one of these tools because it should give me much more accurate results when segmenting individual words, whereas segmenting based on parameters like loudness or silence detection would likely lead to many errors. Like many forced aligners, Gentle is built upon Kaldi, a toolkit for speech recognition for use by researchers. The website has a friendly demonstration of how it works:

Gentle running its distributed alignment example, Alvin Lucier’s “I Am Sitting in a Room.” Note the breakdown of the word “different” into phones, and also the greyed out words “my speech,” which could not be aligned.

and a link to the GUI version of the program, but I wasn’t able to get the GUI to work. In the end this was a boon — being forced to install the command line version of Gentle gave me access to many more possibilities that I would later realized were desirable.

It was also difficult. Compared to other forced aligners, Gentle claims to be “easier to install and use.” If that’s the case, the others must be impossible. I was very rusty with the command line at the start of the semester and assumed that I was simply not doing it right, but it became apparent that others were having similar problems. Much googling led to a thread on Gentle’s GitHub repository full of people complaining about installation issues wherein one user posted their own “bug-free version” and another gave step-by-step installation instructions.

If you have plans to use Gentle, save yourself some hassle and install this version.

It took some playing around with the various scripts, but I eventually figured out how to get the web version running on a local server — one must run the “serve.py” script then open http://localhost:8765/
From there I had access to the same interface as on Gentle’s website. I loaded in my audio file of 1000 words and pasted the transcript into the text box. After running the alignment, I checked the CSV file that was created and found many unaligned words with no timestamps, so I tried using shorter bits. I settled on running 50 words at a time because at that size I was consistently getting no errors.

I’m not sure why Gentle got hung up on larger files, and why sending it the exact same information but in smaller chunks caused no problems. Perhaps it’s the artificial nature of the input. The word lists I’m using have no grammatical structure, nor is the timing of their delivery anything like normal speech — the words are read to a click track at 60 bpm, one word per beat. If part of Gentle’s machine learning approach to alignment wants to match the input to typical speech patterns, I am doing my best to muddy the waters.

After running each set of 50 words through the alignment process, I was given a CSV file with precise timing info for each word. Time to write a script to segment the words into individual sound files.

First steps

Hello. I’m working on an independent study. Please see the About page for general information.

This is a skeletal look at my progress so far, to be fleshed out soon.

What I’m working on:

  • Learning about the workings of the voice — how sound is produced and varied, what the results are. Reading The Science of the Singing Voice by Johan Sundberg.
  • Familiarizing myself with analysis tools — spectrograms, how to read them, how different settings change their effectiveness.
  • Learning the basics of Python and libraries that can process and analyze audio.
  • Identifying and segmenting recordings of speech with the aid of forced aligners.

One compositional area I’m interested in involves exploring and exaggerating vocal resonance, so learning more about formants and other frequency zones that can be accessed and altered is an ongoing interest here.

In another area, I’m working on a piece with my colleague Aaron Snyder using lists of words collected by computer password programs. In performance, we will likely read from these lists live, as well as have accompanying playback of pre-recorded speech. We want to build up a large set of recorded words that can be flexibly called up by the computer, and to look into ways of organizing the words that could bring out compositionally interesting patterns.

This has been the bulk of my work the last few weeks. The first goal was to find a way to automate (or at least semi-automate) segmentation of a recording of 1,000 words. I did this by:

  • Breaking the recording down into sets of 50 words
  • Feeding each set of 50, along with a “transcription” of the words therein, into Gentle forced aligner, which returns a CSV file with timestamps for the start and end of each word.
  • Running the same audio files and the CSV file through a Python script I adapted which outputs a file for each individual word, with a naming convention that I hope will be useful for later organization and manipulation.
  • Listening and manually correcting words that were not aligned well by Gentle.

The process takes some time, but is still much less work than cutting 1,000 words manually.

Gentle also returns the phonemes found in each word; one of my next steps will be to find a way to store that and other extracted information (perhaps fundamental frequency and MFCC) in a list that can be called by Python for offline composing, or Max/MSP for real-time calling or words and/or audio files.