Humans have been making music for as long as we can remember — but the tools and methods we use to do so have evolved significantly, from simple wooden drums, to wind and string instruments, to electronic synthesizers. And now, with projects like Google’s Magenta and Sony’s FlowMachines, we’re beginning to see the emergence of music that’s not just played by computers, but actually composed by artificial intelligence.
This post is the first in a two-part series addressing the intersection of music composition and AI. Today, I’ll cover the technical side — how AI music composition works. In a second post, I’ll take a look at some of the artistic, ethical, and legal questions that AI compositions raise.
AI music composition using neural networks
Training the network: The MIDI music format
Most AI for music production relies on neural networks fed with large amounts of training data. The training data, of course, consists of lots of music — but it also needs to be music that’s transcribed in a way that a computer can understand.
Most AI will be trained with MIDI, a music format that’s highly suitable for this purpose, as it’s essentially a numeric set of code arranged in time. MIDI is both a technical standard and a programming language, and it includes a wide variety of protocols for computers to interface with different kinds of audio devices. In addition to clock data, MIDI transfers information about the notes or pitches played, velocity, duration, tempo, volume, etc. Unlike an MP3 track, MIDI does not transmit a waveform of the actual audio signal — which is exactly what makes it easier for a computer to process. We further simplify our MIDI signals down to a single dimension of data by mapping the different properties of each MIDI event onto an artificial alphabet, where every musical element corresponds to a unique character or hash code.
Recurrent neural networks and the network generator step
Once we have our library of MIDI files, it’s time to train up our models, plug in an input or some initial set of parameters, and generate some chart-toppers of our own. Recurrent neural networks, or RNNs, generally seem to be a good starting place for processing this sort of well-structured time series data as input.
The goal of the RNN, simply put, is to generate data that sounds as close to real data as possible without actually being real data. RNNs work by setting up subsequent iterations of data processing with data transferred from the last iteration, meaning that information gets passed through the network on each run, and its previous state can be accounted for. By taking the output of one forward pass and feeding it into the next, the network can generate completely new sequences of data.
Let’s consider the simpler case of a text generator. To generate new text content with an RNN, we’d feed a model many different strings of text and then ask it to predict the next character in a string based on the previous characters — e.g., there is a very high probability that the string ‘appl’ is followed by an ‘e’ to spell out the full word ‘apple’. RNNs for music composition follow the same principle, but the data in question is an artificial alphabet with many, many more than 26 characters and much more complicated musical “words” to spell.
Other types of neural networks are often employed in combination with RNNs to focus in on certain aspects of the output, such as long-term musical structure and relational sequences with Long Short-Term Memory neural networks — but one is enough for a basic composition.
Generative Adversarial Networks and the Discriminator Step
Once we have some musical data output, we hand it over to a second network — typically a convolutional neural network, or CNN.
The CNN’s job isn’t to create new data, but rather to predict the label or category to which provided data belongs. Specifically, it’s trained to differentiate between “authentic” human-composed music and “fake” AI-composed music. It takes in both human-composed and AI-composed sequences of musical data and returns a probability for each sequence, with 100% representing a prediction of “authentic”, and 0% representing a prediction of “fake”.
The CNN and the RNN together are referred to as a Generative Adversarial Network, and they play an ongoing game of AI cat-and-mouse. The RNN’s goal as the “generator” network is to output data that the CNN classifies as “authentic”, while the CNN’s goal as the “discriminator” is to correctly classify the RNN’s data as “fake”. We will get the best output — the “realest” sounding music — when both networks are performing at their best, even if we ultimately want the generator to “win”.
Training the generator network is like training an impersonator. We don’t just need the person who can do a reasonable initial impersonation of, say, Beyoncé — we also need a critic, who can sit in the room with them and give them consistent feedback, so that they gradually become more and more successful at producing music that sounds like something Beyoncé would sing.
Algorithmic AI music composition
Neural networks are the most common approach to AI composition — but some systems don’t use neural networks at all. Instead, they employ different forms of AI and develop their musical “intelligence” based on the rules of music theory. This method is referred to as algorithmic composition.
Algorithmic composition is ideal for composing unique, custom music in seconds. The results tend to be more stylistically agnostic, and thus function better for background music — think hotel lobby music, podcast themes, etc. But it’s also been used to produce pieces like this one, so saying that it’s only good for background music would clearly be a disservice to the technique.
Algorithmic composition isn’t as technically interesting as neural network composition, but it does raise some fascinating music theory questions. For example, Western music notation and music theory tell us that the most important aspect of music is its multidimensionality, and that its dimensions can be measured in fixed, discrete units: pitch, duration, loudness, etc. But methods of measuring dimensions vary across disciplines and cultures — and our actual perceptions of music are much subtler than any discrete measurement. For example, when a pianist builds to an emotional climax in a piece, do we all agree on exactly how much they slow their tempo or build their loudness? In an orchestral piece, do we always know exactly which instruments are playing? At what point along all of the various dimensional axes do we switch from hearing a piece as “happy” to hearing it as “sad”? In order to create rules for an algorithmic composer, its designers will need to make decisions about which of these things are important to measure, and also about how to measure them.
Regardless of the method used, technical hurdles are only one piece of the story. AI music composers also have to grapple with thorny artistic, ethical, and legal questions, which we’ll dig into in the second post in this series. Stay tuned!