Audio: Artificial Neural Networks

For a while now, it has been my secret goal to learn the dark arts of designing & training Artificial Neural Networks (ANNs).


This mostly is directed at the goal of producing/editing audio in a way that sounds more natural, as ANNs can deal with much more complex data structures than traditional mathematics.

Think object identification and filtering – something that iZotope has potentially implemented in their most recent RX6 audio editor for removing microphone scuffles.

Or the problem of text to speech, speech recognition, or user identification  – all things dealt with by ANNs these days if you’re looking for the best performance on highly parameterised, difficult tasks.


I now introduce an attempt on what I would deem an intermediate/advanced concept in ANNs: The Generative Adversarial Network (GAN). This network architecture is actually made up of two networks (the Discriminator & Generator) that compete against each other. One network generates a statistical distribution, and another tries to determine whether the incoming distributions are real or fake. By spotting the fake ones, the generator can improve its results to match the distribution of the real data. What is interesting about this, is that the generator can not only imitate the real dataset, but ‘imagine’ new distributions that it’s not seen before, but could pass as real (depending on the quality of the discriminator).


I use this concept to try design a lightweight generator to make me STFT windows of audio, using a dataset of speech as the target distribution.

Specifically, the network demonstrated below is a type of GAN called a WGAN.

As it trains, it goes through phases of making a decent attempt at replicating the real dataset. However, it suffers from well-known problems of GANs – being that of little variance. (you can see how most of the generated distributions look like they belong to a set of 1-4 different distributions). There’s been much research recently into making GANs more reliable and representative of the real data. I plan to implement some of this research at some point.


This initial test shows promise, but can be considered unsuccessful – albeit, not a failure. It took me nearly a month to learn all the tricks and tools to make a stable GAN, and I’ll take those tricks on to help my study in other areas.

%d bloggers like this: