`SingSong:` Generating musical
accompaniments from singing

|paper|

Chris Donahue^*, Antoine Caillon^*¹, Adam Roberts^*,
Ethan Manilow, Philippe Esling¹, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel

Google Research, ¹IRCAM, ^*Equal Contribution

Overview

We present SingSong, a system which generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM---a state-of-the-art approach for unconditional audio generation---to be suitable for conditional ''audio-to-audio'' generation tasks, and train it on the source-separated (vocal, instrumental) pairs. To improve our system's generalization from source-separated training data (where the vocals contain artifacts of the instrumental) to isolated vocals we might expect from users, we explore a number of different featurizations of vocal inputs, the best of which improves quantitative performance on isolated vocals by 53% relative to the default AudioLM featurization. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline.

Listener Study MUSDB18

For our study, listeners are presented a pair of 10s vocal-instrumental mixtures, where the vocals are identical between the two mixtures and come from MUSDB18-test, and the instrumentals come from different sources (ground truth, our models, or baselines). Listeners are asked to indicate in which of the two mixtures do the instrumental accompaniments seem more musically compatible with the vocals.

We compare to two retrieval-based baselines: one that retrieves an instrumental clip uniformly at random from the MUSDB18-dev set (Random), and one that uses musical features of the ground truth mixture to retrieve an appropriate instrumental from MUSDB18-dev and adapt it to further improve alignment (Retrieval).

Vocal Input
Ground Truth
SingSong-XL
SingSong-Base
Retrieval
Random
	"Bounty" Steven A. Clark	"Punch Drunk" Grant$	"High Horse" Secret Mountains

Vocal Input
Ground Truth
SingSong-XL
SingSong-Base
Retrieval
Random
	"Britpop" Music Delta	"Die For Us" Celestial Shore	"Sister Cities" Hop Along

30 Second Examples

While the listener study was conducted with 10s clips, we find we are able to generate coherent backing tracks for longer segements such as heard below.


"Waterduct" Ava Luna	"PunchDrunk" Grant$	"Disturbing Wildlife" Invisible Familiars	"Take A Step" Meaxic

"Bounty" Steven A. Clark	"Spacestation" Strand Of Oaks	"Vermont" The Districts	"Night Owl" A Classic Education

"Take A Step" Meaxic	"You Listen" Meaxic	"Stay Even" Port St. Willow	"Curfews" Snowmine

"NightOwl" A Classic Education	"NightOwl" A Classic Education	"Die For Us" Celestial Shore	"Air Traffic" Clara Berry And Wooldog

"Heavy Love" Dreamers Of The Ghetto	"Heavy Love" Dreamers Of The Ghetto	"Disturbing Wildlife" Invisible Familiars	"Dont You Ever" Matthew Entwistle

"Take A Step" Meaxic	"You Listen" Meaxic	"You Listen" Meaxic	"Beatles" Music Delta

"Disco" Music Delta	"Gospel" Music Delta	"Fire" Night Panther	"Stay Even" Port St Willow

"High Horse" Secret Mountains	"High Horse" Secret Mountains	"Curfews" Snowmine	"Spacestation" Strand Of Oaks

"You Let Me Down" Sweet Lights	"PunchDrunk" Grant$	"PunchDrunk" Grant$	"PunchDrunk" Grant$

Amateur Singing

The previous examples used professional vocals from the MUSDB18 dataset. We are also interested in the potential for SingSong to accompany and empower anyone to make music with their voice. We explore this here with singing examples from the Vocadito dataset which has samples of novice singers recorded on consumer electronics.

Source Separation Artifacts

Source separation models introduces artifacts to the vocals that can leak information to the decoder and hurt generalization to isolated vocals. To overcome this, we add noise to the source separated audio during training.