As mentioned in a previous post, I decided to use the MuBu set of Max/MSP objects to organize and access my audio files. A MuBu is a “multi-buffer” that can hold lots of associated data. It’s typically used to store audio files, and analysis data derived from each file. This can be accessed by various related objects to drive playback of the files or their constituent parts — in fact it seems to be largely used for granular and concatenative synthesis. Somewhat perversely, I’m using it to play back entire sound files — my “grains” are the size of words.
MuBu has built-in tools that can analyze audio files in the buffers for the same features I was looking at in Python, like fundamental frequency and MFCCs. The most immediately interesting playback feature is a scatterplot, where the user chooses two features to plot the analyzed sections of sound files as dots on an X-Y surface. In the primary mode, moving the mouse over a dot triggers playback of the associated file section.

It took some trial and error to get this working with my somewhat unorthodox approach, at which point I spent a lot of time playing with the parameters given by these typical analysis functions, with disappointing results. I would plot, say, center frequency against spectral centroid, move the mouse around, and not feel like there was any meaningful connection between the selection of sounds. I think there is still room to explore — there are many analysis options that take some work to put in place, and as I come to understand the MuBu tools better I may find methods that work better. But when I turned to the phone data I had gathered, things really opened up.
Plotting the first phoneme against the last phone, there is a clear, visible and audible relationship between adjacent words.

Phone keys are shown above and to the left.
There is still a lot to try:
- I want to look further into feature analysis and see if I can integrate that in a meaningful way with the phone data.
- I want to see if I can bring back the inner phone data, although I’ve tried a couple things that haven’t worked.
- I want to look at different playback methods. This one is fun and useful to quickly find interesting groupings of words, but it’s limited. I’m interested in collecting a list of words that I like — with the order perhaps created by mousing over the scatterplot, or maybe with another algorithm — and having them played back rhythmically.
- Another issue with this playback mechanism is that words with identical phone values are plotted directly on top of each other and only one is accessible. The actual playback is driven by a k-nearest neighbors algorithm. At this point I only understand this in the broadest terms, but I imagine if I play with its parameters, or use a different algorithm, I can get access to more of the words in my corpus.
- One result I’d really like to achieve is to have live incoming audio drive the playback. MuBu has an example patch of “granular mosaicing” that gives really interesting results with small slices of sound. I’d like to adapt it to work on my larger level, to listen for awhile and then give back some words that it thinks are related to the input.
The current state of my Max patch is available here. It includes the MuBu externals that are used, as well as a small selection of audio files and a pre-made phone file for loading into Max for demonstration purposes. I’ll close with a link to a short video demonstrating the patch with a fuller corpus. Apologies, I can’t get it to embed at the moment.