We tested several source separation methods on a dataset of synthesized Bach chorales. Below are listening examples for each of the methods we tested. The code used to create the chorales dataset is available in the synthesize-chorales repository.

The following listening examples are from chorale BWV 360 (chorale number 350 in the Riemenschneider edition). The chorale’s score is available here.

Baseline: Score-informed NMF

As a baseline, we have implemented the score-informed NMF technique described in this paper (PDF):

Ewert, S., & Müller, M. (2012). Using score-informed constraints for NMF-based source separation. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 129-132).

Our Python implementation is available in the score-informed-nmf repository. We ran four experiments:

mix
soprano alto tenor bass
reference
Ex. A
Ex. B
Ex. C
Ex. D

Wave-U-Net

We evaluated a deep learning separation technique called Wave-U-Net on our dataset. Wave-U-Net is described in the following paper (arXiv):

Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 334–40).

Soprano and bass mixtures

Experiment 1 tested extracting soprano and bass from mixtures containing only these two voices.

mix
soprano bass
reference
Ex. 1

Extract all four voices

Experiments 2–3 tested extracting all four voices from mixtures of all four voices. Experiment 2 used one model to extract all voices, and Experiment 3 used one model per voice.

mix
soprano alto tenor bass
reference
Ex. 2
Ex. 3

Higher-variability dataset

Experiment 4 tested a dataset that contains more sources of variability compared to the original Bach chorales:

mix
soprano alto tenor bass
reference
Ex. 4

Score-informed Wave-U-Net

We created a variant of Wave-U-Net that is conditioned on the musical score. Our code is available in the score-informed-Wave-U-Net repository. We experiment with different score representations and conditioning locations.

One model per voice vs. multi-source training

Experiments 6–7 used the normalized pitch score representation and the input-output conditioning locations. Experiment 6 used one model per voice, whereas Experiment 7 used multi-source training, in which one model is trained to extract any of the four voices guided only by the score.

mix
soprano alto tenor bass
reference
Ex. 6
Ex. 7

Comparison of score conditioning methods

Experiments 8–10 compared 4 score representations (normalized pitch, pitch and amplitude, piano roll, and pure tone) and 3 score conditioning locations (input, output, and input-output) – a total of 12 configurations in each experiment. All three experiments used the higher-variability dataset. Each experiments used a different model type:

Experiment 8

mix
soprano alto tenor bass
reference
score type conditioning location
normalized pitch input
output
input-output
pitch and amplitude input
output
input-output
piano roll input
output
input-output
pure tone input
output
input-output

Experiment 9

mix
tenor
reference
score type conditioning location
normalized pitch input
output
input-output
pitch and amplitude input
output
input-output
piano roll input
output
input-output
pure tone input
output
input-output

Experiment 10

In this experiment, the output conditioning location failed to train, so results are only shown for input and input-output.

mix
soprano alto tenor bass
reference
score type conditioning location
normalized pitch input
input-output
pitch and amplitude input
input-output
piano roll input
input-output
pure tone input
input-output

Download

A zip file containing all listening examples in original quality can be downloaded here (382 MB).