PhD in Mathematics.
Email: mashburn [dot] jacob [at] gmail [dot] com
Objective: Given an audio sample, use neural networks to predict the YM2612 synthesizer chip registers (settings) needed to replicate the sound (or a close approximation of it). This is something many in the currently active Sega Mega Drive/Genesis homebrewing community may find useful for their projects.
Sega Enterprises released the Mega Drive in Japan in October 1988, and in North America in August 1989, rebranded as the Genesis. Its onboard audio hardware included the Yamaha YM2612, a frequency-modulation (FM) synthesizer chip capable of producing 6 sounds simultaneously.
Frequency modulation synthesis, to be brief, is fast vibrato applied to a primitive sound wave (called the carrier). This vibrato comes in the form of another sound wave (called the modulator). Usually, sine waves are used in both roles, though many professional FM synthesizers allow square, triangle, sawtooth or other simple waveforms to be used as well. At lower frequencies, the modulation simply causes audible vibrato (the pitch moves up and down), but at faster frequencies, one begins to hear a new kind of sound, but at the original carrier’s pitch with no audible vibrato. With this technique, a much wider range of sounds is possible than, for instance, additive synthesis, which is limited to adding primative waves together.
The YM2612 chip in the Sega Genesis has 6 channels for FM, each capable of producing one type of sound at a time. Within each channel, 4 sine waves are used (called operators). How they are used depends on the algorithm chosen: for example, algorithm 7 is simply adding the four operators together, while algorithm 4 is two modulator-carrier pairs added together. In addition, one sine wave is special in the sense that a feedback effect can be applied to it before the overall algorithm is executed.
Throughout the Genesis’ tenure in the early 1990s, many Western developers struggled with creating sounds using FM synthesis, and opted instead to use preset libraries provided by, for instance, the Sega GEMS (Genesis Editor for Music and Sound Effects) tool. Even understanding the mathematics involved does not help musicians who are just starting out with FM. (This is history repeating itself, as most professional users of the famous Yamaha DX7 FM synthesizer rarely ventured beyond the factory preset patches. Many songs from the most famous artists of the 1980s feature these same preset sounds.)
Given an audio sample of some instrument or sound, can we train a neural network to predict the appropriate register inputs for the YM2612 chip to replicate it?
Side note: to produce a sound, a synthesizer is given a set of numbers, describing various aspects of the sound, such as the fade-in (called attack) for each sine wave involved. These are fed to the synth chip’s registers, i.e. places in the chip to store information temporarily, and the sine waves generated by the chip are then altered according to the registers. These numbers are collectively called a patch, and can be stored in a variety of binary file formats.
The training data consists of:
Just over 5,600 instrument patches, stored in the Y12 file format (target data)
Recorded audio samples of each instrument patch being played for just over 3 seconds. To try to keep things simple, every instrument plays a 4th octave C.
To start off, I computed the spectrogram of each sample- a graph which shows the changes in frequencies of a sound wave over time.
I also used PCA (principal component analysis) on the raw audio sample data, with ~92% explained variance for each model (and harshly diminishing returns beyond that point- see the Jupyter notebook for this step for details). Update Feb 27: once I switch to CNNs, I most likely will not be using this.
As mentioned above, the chip is capable of producing sounds with 4 sine wave operators, used in algorithms ranging from simple additive synthesis (algorithm 7) to a single modulation chain (algorithm 0). There is significant overlap between the sound capabilities of the algorithms (for instance, many algorithm 0 sounds could be produced just as well with algorithm 4). To avoid confusion stemming from this, 8 models will be constructed, one for each algorithm, and the training data will be split according to which algorithm produced it.
Preprocessed audio sample data taken from Sega Genesis audio output by computing spectrograms and applying dimension reduction techniques to the raw audio sample data (though I may not use this).
Trained neural network models on the spectra and YM2612 synthesizer chip registers using TensorFlow and Keras.
Wrote main script for user, which runs a batch of WAV files through all 8 algorithm models by default, though some can be omitted.
I have tested 7 out of 8 CNN models so far, and will test the last one tonight (it had the largest training set, so I saved it for last). I had to implement a unique testing loop for it because my GPU only has 2 GB of RAM… It shouldn’t affect the speed very much though. Anyway, the other models’ error metrics look great so far! You can look for yourself in the metrics directory.
I’ve implemented 8 neural networks in TensorFlow and Keras, one for each FM algorithm. I’ve run preliminary tests and concluded that all 8 models are severely overfitted. To address this, I am cutting back on the number of layers and nodes for each model, as well as decreasing the number of epochs and possibly increasing the batch size. Moreover, I am switching to convolutional neural networks for each, since these tend to do better with audio data (actually, they’re meant for images, but a sound’s spectrogram can be interpreted by a CNN as a greyscale image). I have a prototype model sketched out in a Jupyter notebook and will be integrating it into the training scripts today, and running tests the rest of this week. Now that I have a proper GPU for ML purposes, I hope to finish testing by this weekend.
After a lengthy haitus, I have returned to this project with a more reliable means of obtaining sound samples for the ~5,600 patch files I have. As I type, I have a recording macro running on my desktop. I don’t expect it to finish before noon tomorrow. In the meantime, I have several options I’m currently considering for extending my spectrum processing pipeline to handle changes over time:
1: I modify it to split each sample into N subintervals of equal lengths of time, then repeat the spectral analysis for each piece. The advantage here is to retain the current idea of looking at amplitudes at each multiple of the sample’s fundamental frequency in a pitch-agnostic manner. The disadvantage is this approach assumes the sounds being analyzed are all harmonic, i.e. the main frequencies are only at integer multiples of the sound’s FF.
2: I modify it to extract the spectrographic data as-is. The advantage is simplicity in extraction, as well as having the ability to handle inharmonic sounds. The disadvantage is the location of the main frequencies of a sound is heavily dependent on the pitch (a 4th octave C does not have the same fundamental frequency for all instruments). The first approach is a form of standardizing the spectral data, which may aid significantly in neural network training. Without that, we may run into issues.
My samples also now consist of the note being played for four beats, then released for four beats. As the synthesizer allows for the sound to either trail off in some fashion or mute completely when the note is released, I thought it best to record that data alongside the usual samples of the note being continuously played. I may or may not use this though, as few composers made use of this feature, which would bias the data in favor of notes that mute immediately upon release.