Music

Anna Yanchenko

We submit classical piano pieces generated by training hidden Markov models and time varying autoregression models on original piano pieces from the Romantic era. We find that the models we explored are fairly successful at generating new pieces that have largely consonant harmonies, especially when trained on original pieces with simple harmonic structure. However, we conclude that the major limitation in using these models to generate music that sounds like it was composed by a human is the lack of melodic progression in the composed pieces.

The most successful models we considered were first order HMMs, layered HMMs and TVAR models. We also develop quantitative metrics to evaluate the generated pieces in terms of originality, musicality and temporal structure.

Beethoven’s Ode to Joy – First Order HMM
Beethoven’s Ode to Joy – Layered HMM

Music courtesy of Anna Yanchenko. More here.

Zack Zukowski & CJ Carr

This audio was generated and using SampleRNN trained on various stems of vocal performance from a mix of a single song. Eight of the resulting one minute sequences were manually organized by similarity, concatenated, and vertically stacked (with minimal cropping) to create a four minute collage. This was a study of long term phrase generation and modeling with a mix of dry and reverberated space.

The generated audio used LSTM gated units on a two-tiered network. The final audio output was normalized and arranged manually in a DAW with no extra post-processing or cross-fading. There is at most two layers samples of audio at any given point of the composition.

dadabots.com/nips2017/singingRNN-1.mp3 

Hernán Ordiales

The challenge of recycling past vibrations into new emotions. Unknown sounds of unknown people who recorded them for unknown reasons are taken, divided, mixed and merged. Its physical properties are the excuse to express new feelings.

Sounds are retrieved in real-time from a previously processed massive sound database, where sounds are clustered using machine learning algorithms. The required timbre is defined using many Music Information Retrieval descriptors like spectral centroid, inharmonicity or high-frequency content. Then they are processed live using real-time effects and granular synthesis.

Mutantes (Lennie & Ordiales)

“Dialectic in suspense” is a sound work about the conflicting relation between nature and human contradictory development. Natural spaces and ambient sounds mixed with human residual pollution are combined with real-time audio and data processing that shows both human and nature strategies to overcome the critical anthropocentric presence.

The work has three different movements bringing the people (public) the opportunity to enhance their environmental conscience and perspective about it.
The first is a sound landscape, takes original environments and natural sounds. These sounds, begin to be contaminated by human interventions, which are modeled as sound processes. Meanwhile, the noise generated by feedback (techniques of “no input”) starts to be heard, and during the second movement, the sound level of the “no input” feedback surpass the natural landscape, arriving at the Anthropocene concept.

In the last movement, both nature and the human are given strategies to overcome the present crisis. On one hand, information on relevant climatic factors is taken to model sound processes in real time. On the other hand, live performers play through chains of sound parallel processes proper of the human.
The sound is processed digitally using different live-coding techniques. A pre-analysis based on MIR (Music Information Retrieval) stored in an online database is combined with real-time processing and synthesis, random processes and human control via external interfaces.

http://redpanal.org/a/dialectica-en-suspenso/ 

Robbie Barrat

Earlier this year, after introducing the idea of neural networks to my high school’s programming club, I was telling everyone there about how AI could be used to produce more artful things. They weren’t very convinced, and half-jokingly, one of my friends in the club suggested that I should get a neural network to write rap songs in order to prove that AI could be used to make art. The other members in the club really loved this idea, and challenged me to it. If I could come to next week’s meeting and show a neural network that could rap better than most of the people in the club, they’d admit that I was right, and that AI had a place in art. By the time the next meeting was scheduled; after scraping ~6,000 Kanye West lyrics, writing a basic recurrent neural net to control rhyme and meter, and getting a markov chain to generate individual rap lines, I had a program that could reliably output coherent and flowing rap songs, and perform them over a rap-beat using text to speech.

Pablo Samuel Castro

We are building LyricAI, an AI assistant to help songwriters in the process of lyric writing. Rather than have AI generate completed outputs (whether it be music or words), we are interested in exploring the space of AI pushign artists towards new/uncomfortable territory that can then spur creativity. Our submission is the first step in this process. For this submission we used a dataset consisting of the Billboard top-50 songs since 1960; we converted all the words in this dataset into phoneme-syllables and then trained a character-to-character RNN over this; the output of the model was a sequence of phonemes, which we then mapped back to English words. We used some of David Usher’s previous lyrics as seeds and generated a set of lyrics. David then wrote completely new lyrics for one his existing songs, based on the output of LyricAI.

Nao Tokui

“AI DJ Project – A dialogue between AI and a human” is a live performance featuring an Artificial Intelligence (AI) DJ playing alongside a human DJ. Utilizing deep neural network technology, the AI system selects and mixes songs. Playing back to back, each DJ selects one song at a time, embodying a dialogue between the human and AI through music.

In “Back to Back” the AI system and human DJ perform under similar conditions as much as possible. For example, the AI daringly uses physical vinyl records and turntables. The system listens to songs played by the human DJ, detects the tempo, infers the genre with a deep neural network, and processes information on the spot. Following this process, the AI chooses the next record to be played. Of course, since the AI does not have a physical body it cannot set the new record on the turntable, so it requires some human assistance. After a record is set the AI begins the process again, adjusting the tempo of the next song to the tempo of the song played by its human counterpart. The beats of both songs are matched by controlling the pitch of the custom-made turntable. For this purpose, AI uses the second model trained via reinforcement learning.

Oded Ben-Tal

To compose this piece I used a machine learning system we called Folk-RNN. The system was trained on over 20,000 transcription of “session” folk music (broadly speaking Celtic dance tunes). After the training the system can generate new tunes in plausible ‘Irish’ style. As a composer I was more interested in exploring the limits of the system’s capabilities and try to extract material that deviates, in interesting ways, from the style the system learned. The first stage of the composition in each of the four movements consisted of many iterations of asking the system to generate tunes and changing some parameters in order to pull the system’s outputs away from the regular patterns it learned. Each movement is based on melodic material generated by the system but I treated it with a fair amount of freedom when arranging these ‘folk tunes’. Hence the title – these are like the illegitimate off-springs of me and folk-rnn.

Bastard Tunes is an instrumental composition (notated score and concert recording at the URL) which uses a machine learning system called Folk-RNN. This is a long short-term memory network (LSTM) that models music transcription sequences. The details of this system are described in two publications:
Sturm, B. L., Santos, J. F., Ben-Tal, O. and Korshunova, I. (2016) Music transcription modelling and composition using deep learning. 1st Conference on Computer Simulation of Musical Creativity. Huddersfield June 17-19
Sturm, B. L. and Ben-Tal, O. Back to music practice: The evaluation of deep learning approaches to music transcription modelling and generation. Journal of Creative Music Systems 2(1)

The training data used to build the model consists of over 26,600 ABC transcriptions of “session” folk music (like Irish traditional music), with over 4 million tokens. We acquired this data from the online forum of thesessions.org website devoted to the practice of this kind of music. We train the network using a minibatch approach to model joint probability distributions of token sequences. We design this token vocabulary from the ABC notation including symbolic representation of pitch, rhythm and some structural aspects (e.g. bar lines and repeats).
After the training the system can generate new tunes in plausible ‘Irish’ style (for example here). As a composer I was more interested in exploring the limits of the system’s capabilities and try to extract material that deviates, in interesting ways, from the style the system learned. Interactively working with the system involves changing two parameters when invoking the python code: the initial sequence and a temperature setting. The system learned a conditional probability distribution for the next token based on the sequence of token from the beginning of the tune. Therefore we can seed the system with an opening sequence and ask the system to complete the tune (we can generate more then one continuation). This initial seed can be just a meter token (M:4/4 or M:6/8) or it can include the opening notes or bars. The temperature parameters effects the way the distribution is sampled. The lower the temperature the more ‘conservative’ the system is in its selection of next token. With higher temperatures less probable events will happen more frequently. The four movements in my piece employed different strategies for generating material with the system and arranging it afterwards.

Bob L. Sturm

My music composition (for computer playback) is titled, “Eight short outputs generated by a long short-term memory network with three fully connected hidden layers of 512 units each trained on over 23,000 ABC transcriptions of session music (Irish, English, etc.), and arranged by my own “personal” neural network trained on who knows what for who knows how long (I can’t remember any of the settings)”. My composition consists of eight short movements exploring the use of deep learning to assist the process of music composition. In particular, I train an LSTM network (effectively char-rnn) on textual music transcriptions. I then generated over 70,000 new transcriptions, half of which I automatically synthesised and posted to the website, The Endless Traditional Music Session. While the trained system generates transcriptions that often exemplify the conventions in its training data, it sometimes produces ones that are far from those conventions (errors), but that is not necessarily musically bad. My composition features some of these “useful errors”, and demonstrates how even poor model outputs can lead to pleasant music.

Sets 1-3 feature computer-generated tunes alongside traditional Irish tunes played by professional musicians. London-based musician Daren Banarsë assembled each set in 2017 starting with a tune generated by the machine learning system folk-rnn. Set #1 (Jigs) consists of “The Cuil Aodha”, “The Dusty Windowsill”, and the folk-rnn tune “The Glas Herry Comment”. Set #2 (Slow Reels) features “Maghera Mountain”, and tune X:2897 from “The folk-rnn Session Book Volume 1 of 10”. Set #3 (Fast Reels) features “The Rookery”, X:1068 from “The folk-rnn Session Book Volume 1 of 10”, and “Toss The Feathers”. An interesting aside comes from an oversight in a news article in The Daily Mail that features video of Set #1. The video was inadvertently edited such that it only contains the first 30 seconds of the existing traditional tune, “The Cuil Aodha.” The people who bothered to comment in the online forum described how this music is “missing the ‘human’ element”, “Totally lifeless without warmth”, “Sounds like a robotic Irish jig”, and “Well, it’s music but it’s not very good. It really represents nothing more than a presentation of a ‘stereotype’ that ‘seems’ like that ‘style’ or ‘flavor’ of music. It’s not really interesting and it’s not unique in a genuine, good way.”

My website, The Endless folk-rnn Traditional Music Session, provides an unfiltered experience of the synthesized output of three different long short-term memory models trained on over 23,000 textual music transcriptions of traditional dance music from Ireland and the UK. Every five minutes, this website selects at random seven transcriptions for playback from 47,747 generated transcriptions. Each synthesis is automatically created by a simple performance script which randomly selects a tempo and a combination of instruments. Some feature slight changes to timing and dynamics to “humanize” the output. Some of the praise given to this website from members of the data source include, “Jeez. A computer that noodles. That’s all we need.”; “Basically it’s crude turntabling without the sense of a musician familiar with the significance of various motifs & phrases.”; “This sounds like evil devil work.”; “It’s a slightly surreal experience, like you are listening to the output of someone locked in a cell and forced to write tunes!” More discussion can be found here. The project itself is here.

Alexey Tikhonov and Ivan Yamshchikov

This is and EP of four songs where the texts are created by a neural network trained to resemble Kurt Cobain (who would be 50 this year). We have generated the lyrics and recorded the music and Rob Carrol (an independent musician from New York) sang the generated lyrics.

During the training we used concatenated embeddings that but for the standard word2vec representation of the word also included the it’s transcription (so that the network could learn phonetics), the author of the document and other meta-information. The architecture of a network was close to the one described here.