A Generative Adversarial Approach for Audio Inpainting

A generative adversarial approach for audio inpainting

We presente a Generative Adversarial Network designed to perform audio inpainting on magnitude spectrums. This architecture can be deployed int two ways denoted CIGAN and PIGAN CIGAN reconstructs the magnitude spectrum of an audio frame of length T seconds, given the previous and subsequent T seconds. PIGAN, instead, is more similar to the concept perform audio prediction. It can be trained to predict an audio frame of T seconds, given the T previous seconds. This web page accompains the github repository.

CIGAN

Here are some samples output by CIGAN for T=0.064. To reconstruct the audio from the magnitude spectrums, the Griffin-Lim algorithm was used. As shown in the various examples, the reconstructed spectrums are rather good, even if they are blurred. The audio waveforms look similar to the ground truth even if they are not close from them. It might be because the Griffin-Lim does not output the exact waveform, even if it is fed with the real magnitude. Moreover, it works from an approximation of the real magnitude spectrum. To obtain this result, CIGAN was trained for 200 epochs on the small FMA dataset.

Spectrum	Waveform

Original Audio	Reconstructed Audio

Here are other samples for T=0.128 s. One can notice that the reconstruction quality is lower, especially in the first example which is a bad one.

Spectrum	Waveform

Original Audio	Reconstructed Audio

PIGAN

PIGAN is an adaptation of the network in order to achieve audio prediction. The neural network predicts the magnitude spectrum of an audio frame of length T given the T previous seconds. On the examples below, one can see that PIGAN perform less well than CIGAN. It is because the new task is much more challenging. However, the result are still interesting. The example are obtained with T = 0.064 s.

Spectrum	Waveform

Original Audio	Reconstructed Audio

Perceptual Evaluation

To perceptually evaluate the model, we use it to fill in one gap of T = 0.064 s, located at 0.5 s in an audio sample of two seconds. Then, we run a matlab function to compute the ODG between the original audio and the audio with the filled gap. The examples below are obtained with CIGAN with T = 0.064 s.

On the last example of this section, one can heard that CIGAN does its job. Indeed, it is very difficult to heard a difference between the original and the reconstructed one while the difference is audible between the original audio and the one with the hole. In the second example, it is difficult to heard a difference between the original audio, and the one with the hole. This is probably because the hole come when the song is rather silent. Still, we can notice that IGAN doesn't put annoying noise instead of the silence.

Original Audio	Reconstructed Audio	Audio with hole

Chaining prediction

The considered frame length, 0.064 s, is very short. Previously, we fill in a gap 0.064 ms in an audio of 2s length. In this experiment, we reconstruct the whole 2s long audio by frame of 64ms. Each frame of 64 ms are reconstructed using CIGAN, with as input the previous and subsequent 64 ms, taken from the original audio. Then, all the predicted frames are concatenated together. As shown in the examples below, the audios are easily recognizable, even if they are noisy. The noise might come from a possible discontinuity at the intersection of two frames.

Original Audio	Reconstructed Audio

In this experiment, we do the same than previously, but with PIGAN. Each frame of 64ms are first reconstructed using PIGAN, with as input the previous frame, taken from the original audio. Then, all the predicted frames are concatenated together. As shown in the examples below, the audios are easily recognizable, even if they are noisy.

Original Audio	Reconstructed Audio

Generalization

To test our model generalization ability, we used it to inpaint on the MAESTRO dataset while training it on the small version of the FMA dataset. The FMA dataset is composed of 3000 tracks of 30s of 8 different music styles while the MAESTRO dataset is composed of classical piano records. The example below are obtained with CIGAN with T = 0.064. When comparing these examples with the ones shown previously, one can see that CIGAN behaves in the same way in both case. Thus, the model generalization ability seems to be nice.

Spectrum	Waveform

Original Audio	Reconstructed Audio

References

"Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset." , Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. In International Conference on Learning Representations, 2019.

A Dataset for Music Analysis, Defferrard, Michael and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier. In 18th International Society for Music Information Retrieval Conference (ISMIR)