Score-Informed Source Separation of Choral Music

We tested several source separation methods on a dataset of synthesized Bach chorales and on choral recordings from Choral Singing Dataset. Below are listening examples for each of the methods we tested. The code used to create the chorales dataset is available in the synthesize-chorales repository.

The following listening examples are from chorale BWV 359 (chorale number 365 in the Riemenschneider edition). The chorale’s original score is available here. After running our data augmentation procedure, the chorale was modified with simulated breaths and randomly omitted notes. The modified score is available here.

Baseline: Score-informed NMF

As a baseline, we have implemented the score-informed NMF (SI-NMF) technique described in this paper (PDF):

Ewert, S., & Müller, M. (2012). Using score-informed constraints for NMF-based source separation. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 129-132).

Our Python implementation is available in the score-informed-nmf repository. We ran several experiments to select the best parameters for SI-NMF. Results for the best experiment are shown below.

mix	soprano	alto	tenor	bass
reference
SI-NMF

Wave-U-Net

We evaluated a deep learning separation technique called Wave-U-Net on our dataset. Wave-U-Net is described in the following paper (arXiv):

Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A multi-scale neural network for end-to-end audio source separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (pp. 334–40).

Experiments 2–3 tested extracting all four voices from mixtures of all four voices. Experiment 2 used one model to extract all voices, and Experiment 3 used one model per voice.

mix	soprano	alto	tenor	bass
reference
Ex. 2
Ex. 3

Score-informed Wave-U-Net

We created a variant of Wave-U-Net that is conditioned on the musical score. Our code is available in the score-informed-Wave-U-Net repository. We experiment with different score representations and conditioning locations.

Experiments 4–6 compared 4 score representations (normalized pitch, pitch and amplitude, piano roll, and pure tone) and 3 score conditioning locations (input, output, and input-output) – a total of 12 configurations in each experiment. Each experiments used a different model type:

Experiment 4: one model for all voices
Experiment 5: one model per voice
Experiment 6: multi-source training (single model that separate any of the four voices guided only by the score)

Comparison of score conditioning methods

Experiment 4

mix		soprano	alto	tenor	bass
reference
score type	conditioning location
normalized pitch	input
	output
	input-output
pitch and amplitude	input
	output
	input-output
piano roll	input
	output
	input-output
pure tone	input
	output
	input-output

Experiment 5

For tenor, which is the most challenging to separate, all score types were tested. For the other voices, only the score types that performed best on tenor were tested.

mix		soprano	alto	tenor	bass
reference
score type	conditioning location
normalized pitch	input
	output
	input-output
pitch and amplitude	input
	output
	input-output
piano roll	input
	output
	input-output
pure tone	input
	output
	input-output

Experiment 6

In this experiment, the output conditioning location failed to train, so results are only shown for input and input-output.

mix		soprano	alto	tenor	bass
reference
score type	conditioning location
normalized pitch	input
normalized pitch	input-output
pitch and amplitude	input
pitch and amplitude	input-output
piano roll	input
piano roll	input-output
pure tone	input
pure tone	input-output

Evaluation on real choir recording

The following table shows the results of models from several experiments, evaluated on the song ‘El Rossinyol’ from Choral Singing Dataset, described in the following paper:

Cuesta, H., Gómez, E., Martorell, A., Loáiciga, F. (2018). Analysis of Intonation in Unison Choir Singing. In Proceedings of the 15th International Conference on Music Perception and Cognition.

Note that all models were trained only on synthethized examples without any lyrics.

mix	soprano	alto	tenor	bass
reference
SI-NMF
Ex. 3
Ex. 5
Ex. 6

Supplementary Evaluation Figures

In addition to the evaluation figures presented in the paper, we include here a supplementary notebook with figures showing all BSS Eval metrics (SDR, SIR, SAR, and ISR) for all experiments in the paper.