Project on Music/Voice Separation
Alex Berrian
Most recent update: August 22, 2014

This page contains a presentation of the final project I did for the course MAT 271: Applied & Computational Harmonic Analysis, taught by Professor Naoki Saito. Please feel free to email me if you have any comments or questions!

Introduction

Music/voice separation refers to the problem of trying to separate vocals from instrumentals in a song, in order to produce an acappella track containing only vocals, and an instrumental track containing only the instruments. For aspiring DJs and other musicians, as well as for researchers studying MIR (Music Information Retrieval), this is a problem of great interest.

For this project, I compared the performance of four different algorithms which can be used for music/voice separation. (You can learn more about these algorithms in the Analysis portion of this study.) These four algorithms have their MATLAB codes available online at the websites listed in the table below:

Algorithm Authors Website
REPET (REpeating Pattern Extraction Technique) Z. Rafii and B. Pardo [1] REPET website
Adaptive REPET
SVS-RPCA (Singing-Voice Separation from Monaural Recordings Using Robust Principle Component Analysis) P. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson [2] SVS-RPCA website
MLRR (Multiple Low-Rank Representation) Y. Yang [3] MLRR website

I tested these four algorithms on three songs in my personal library: "Woman Got Culture" by Uness, "Buggin' Out" by Uness, and "On To You" by Clinton Sparks featuring Pitbull and Fatman Scoop. Uness has provided the first two songs' acappella and instrumental tracks for free download here and here. (The separated tracks of the last song were previously accessible as a free download years ago on Clinton Sparks' website, but the download is unfortunately no longer available.)

Although this dataset of songs is much too small to statistically determine the overall relative performance of the algorithms, the dataset nonetheless allows us to explore possible strengths and deficiencies of each algorithm depending on the features of the song being processed. Additionally, researchers of these algorithms often do not test their results on songs of more modern styles as presented here. Thus, the goal for this study is to carefully examine the performance of these algorithms on a small set of modern-styled songs.

Results: Audio Clips and Numerical Evaluation

Due to copyright restrictions, I cannot present audio for the song "On To You," so I will omit my results for that song. However, Uness has graciously given me permission to present the clips of his songs here, in both original and modified form. (Thanks, Uness!) You can listen to 20-second mono versions of the separated and mixed tracks for Uness's songs below. (Note that for the purposes of this project, the acappella and instrumental tracks have been converted to mono, then mixed together. The actual stereo versions of these songs sound better! For the full mixed versions of these songs, see the Audio Credits at the end of this document.)

Original tracksWoman Got CultureBuggin' Out
Mixed
Acappella
Instrumental

In order to test and compare these algorithms, I proceeded as follows. First, I converted all the separated tracks to mono. Then, for each song I mixed the acappella and the instrumental tracks together, de-amplifying these tracks before mixing in order to avoid clipping. Next, I shortened each of the mixed tracks, using only select parts of each song for the tests. Finally, I ran the MATLAB codes for each algorithm (using appropriately chosen parameters for each algorithm) on these shortened, mixed tracks.

In the first group of tests, I used "super-short" versions of each song. These are clips of the songs which are less than 30 seconds in length, include the song chorus, and which are clipped so that they start and end approximately in time with the beat. For "Woman Got Culture," the clip consists of the first chorus. For "Buggin' Out," the clip contains the first bridge and half of the first chorus. Below, you can listen to 20-second clips of the results we get from applying each algorithm to these super-short versions:

Approximated tracks
(super-short)
Woman Got CultureBuggin' Out
REPETAcappella
Instrumental
Adaptive REPETAcappella
Instrumental
SVS-RPCAAcappella
Instrumental
MLRRAcappella
Instrumental

In the second group of tests, I used "short" versions of each song. These are cuts of the songs which are approximately half the original song length (about 1.5 to 2 minutes). For "Woman Got Culture," the cut contains the first chorus, first verse, and second chorus (which is the same as the first chorus). For "Buggin' Out," the cut includes the first verse, bridge, and chorus, and second verse and bridge. Since the codes for MLRR and RPCA truncated songs down to 30 seconds in length if they exceeded this duration, I did not attempt to run these codes on the short versions. (In practice, to analyze these algorithms on songs longer than 30 seconds, one would separate the songs into clips of 30 seconds or less.) Thus, we only compare the performance of REPET and Adaptive REPET here. Below, you can listen to 20-second clips of the results we get from applying each algorithm to these short versions:

Approximated tracks
(short)
Woman Got CultureBuggin' Out
REPETAcappella
Instrumental
Adaptive REPETAcappella
Instrumental

To evaluate the performance of the algorithms, I used version 3.0 of the BSS_EVAL toolbox [4] to measure the quality of the approximate acappella and instrumental tracks produced by each algorithm, comparing these approximate separated tracks with the original separated tracks. Using BSS_EVAL, I calculated three different measures of source separation performance: source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifact ratio (SAR). For a given separated track, these quantities respectively measure the energy ratio between the source and the algorithm's distortion of the source, the amount of interference from other separated tracks, and the amount of undesired artifacts in the track due to the algorithm [3]. Higher values of SDR, SIR and SAR mean better separation [1]. The following table provides the SDR, SIR, and SAR of the tested algorithms on the "super-short" versions of the songs:

Super-short versionsWoman Got Culture Buggin' Out
Measure Track REPET Adaptive REPET SVS-RPCA MLRR REPET Adaptive REPET SVS-RPCA MLRR
SDR Acappella 3.3851 3.8246 -3.7959 0.4829 1.7857 -4.8574 -1.9389 1.4094
Instrumental 8.3719 9.422 4.0509 -0.8765 7.5184 2.3938 0.1678 4.4081
SIR Acappella 9.7937 8.1015 1.8849 1.4658 3.5074 -2.0991 -0.4417 3.5803
Instrumental 10.816 13.8202 5.6582 0.3999 15.8072 3.7428 9.2497 6.3206
SAR Acappella 4.9456 6.4808 -0.2577 9.7566 8.2380 2.6058 6.6501 7.0403
Instrumental 12.3791 11.5587 10.1906 7.8792 8.3280 9.6589 1.2282 9.8019

The next table provides the SDR, SIR, and SAR of the tested algorithms on the "short" versions of the songs:

Short versionsWoman Got Culture Buggin' Out
Measure Track REPET Adaptive REPET REPET Adaptive REPET
SDR Acappella 4.5112 3.6119 1.6033 -4.2002
Instrumental 11.7347 10.7645 7.3287 2.8912
SIR Acappella 8.8479 8.6794 3.4336 -1.1654
Instrumental 17.0143 15.1553 15.6480 4.6744
SAR Acappella 7.0379 5.784 7.8633 2.4178
Instrumental 13.3477 12.8594 8.1373 8.8920

Analysis

Description of the Algorithms

To properly analyze the performance of these algorithms, it is necessary to first describe the individual algorithms, as well as their possible strengths and weaknesses. For more precise, mathematical details about these algorithms, please consult the papers [1], [2], and [3].

REPET is meant as a "background/foreground" separation algorithm for popular music, and takes advantage of the kind of repetition that is seen in such music. In particular, many popular songs contain instrumental elements that repeat periodically for the entirety of the song, like a repeating chord progression or a looping drum track. REPET searches for these repeating elements and attempts to isolate this repeating "background" material from the "foreground" material, which is non-repeating and generally consists of the vocal line, and other non-repeating instrumental lines which accentuate or provide the melody [1].

Adaptive REPET is designed to operate on full songs, by applying REPET to local time windows of the song, with a given overlap time scale for the windows. In particular, whereas REPET assumes that the instrumental contains some portion which repeats for the entire song, Adaptive REPET assumes that this repeating portion may change in various parts of the song (for instance, the verse and the chorus may have different repeating backgrounds) [1].

      Possible strengths and weaknesses of REPET and Adaptive REPET:
  • REPET and Adaptive REPET carefully take into account the repeating structure of a song.
  • They also run quickly compared to other music separation algorithms (see [1] for more details).
  • However, the theoretical formulation does not account for gradual tempo or volume changes, and if the instrumental changes too often during the song, then even Adaptive REPET may see undesirable results due to insufficient repetition.
  • Additionally, REPET and Adaptive REPET do not distinguish between a melodic, non-repeating line in the instrumental and a melodic, non-repeating vocal line. This may prove undesirable for music/voice separation.
  • These algorithms require input for the user to help determine the repeating section(s). This may make automatic processing of a large number of songs more difficult.

SVS-RPCA implements an algorithm from Candès, Li, Ma, and Wright [5] called robust principal component analysis. It assumes that the repetitive nature of the instrumental elements relegates them to a low-rank subspace, whereas the voice is sparse and has more variation [2]. Thus Huang et al. implement the robust principal component analysis technique of [5], a matrix factorization algorithm for decomposition into low-rank and sparse matrices.

MLRR attempts to improve on the SVS-RPCA algorithm by instead using a decomposition into two low-rank matrices and one sparse matrix. The two low-rank matrices represent subspaces of sounds corresponding to the voice and instrumentals, and the sparse matrix contains deviations from these subspaces. An online dictionary algorithm is used to infer the subspace structures of the vocal and instrumental sounds [3].

      Possible strengths and weaknesses of SVS-RPCA and MLRR:
  • SVS-RPCA is essentially a blind separation algorithm; that is, it requests no information from the user other than the original song. This allows for easier automatic processing, but leaves the user without a way to customize depending upon the songs.
  • As Yang discusses in [3], SVS-RPCA may fail to properly distinguish between sparse data in the vocals and sparse data in the instrumental. If the data are simultaneously sparse and resemble each other in some way, this may lead to errors.
  • Though MLRR provides more structure than SVS-RPCA, it is no longer a blind separation, requiring more computation time and an appropriate online dictionary algorithm.
  • Nonetheless, the online dictionary algorithm of MLRR provides a wealth of information for better estimation.
  • Still, as Yang explains in [3], the separated sounds are a linear combination of the dictionary material, which leads to unwanted artifacts. Indeed, if the dictionary is not chosen carefully, the instruments in MLRR's approximation may sound quite different from the original instruments.

Description of the Songs

Having described the algorithms, we can now discuss their performance on these two songs. First, we should describe these songs in depth.

"Woman Got Culture" is an R&B song with a tempo of 100 beats per minute (BPM) and a length of 3:34. The elements of this song are as follows. Uness sings the vocals himself, with vocal lines sometimes overlapping each other. He is backed by an instrumental consisting of an acoustic guitar, a light percussive drum sound, and a snapping sound. These are looped over the entire song, and the chord progression consists of the same four chords every four measures. The instrumental appears to change slightly in volume in transitions between the chorus and the verse. Additionally, the guitar part has a different melodic line in the verse and in the chorus. Nonetheless, despite these nuances, the instrumental stays fairly consistent throughout the whole song. Thus, we expect good results with REPET. However, we must also note that the guitar melody and the vocal melody tend to stay in the same range of notes (frequencies), often landing on the same notes simultaneously. Thus, we expect that this may cause some problems for RPCA, which may associate the melodic guitar line with the vocals.

"Buggin' Out" is a song in the dance genre with a tempo of about 128 BPM, and a length of 3:25. This song consists of a reverberated vocal track above artificially-synthesized instrumental sounds. We expect this song to be difficult for our algorithms, for the following reasons. Uness's vocals are smooth and drawn out, whereas the instrumental includes a melodic line with a more staccato character, with an especially sharp, punctuating sound in the chorus. We expect this to give some problems for RPCA and MLRR, which may mistake the instrumental melody for the vocal melody. Indeed, since the instrumental and vocal both contain melody lines, REPET and Adaptive REPET may separate the instrumental melody from the rest of the instrumental track (as they are designed to do). Finally, the instrumental notably changes in volume leading up to the chorus, and the number of instrumental elements increases once the chorus begins. We expect all these factors to give problems for both REPET and Adaptive REPET.

Choice of Parameters

Three of these algorithms' codes require the specification of parameters by the user. The code for MLRR requires the user to specify whether an acappella or an instrumental track is desired from the separation process. REPET requires the user to specify a time interval that approximates the location of a repeating segment in the instrumental. Adaptive REPET does the same, and additionally requires the specification of parameters for a windowing procedure, where it searches for different repeating instrumental segments as the song progresses. (For instance, a verse and a chorus may have different instrumentals that nonetheless repeat [1].) These parameters, in order, describe a window length, a step length, and the order of a median filter (see [1] for more details).

In the following table, I give the time markers of the original songs which delineate the "super-short" and "short" versions under the heading "Song cut location". (For example, the super-short version of "Buggin' Out" consists of the portion of the original song which goes from the time 0:30.958 to the time 1:00.955, in minutes:seconds.) Then, I give the parameters used for REPET and Adaptive REPET.

Parameter choicesWoman Got CultureBuggin' Out
Song cut locationApproximate repeating segment locationWindow parametersSong cut locationApproximate repeating segment locationWindow parameters
super-short ver.
0:01.707 to 0:30.057
0:00.700 to 0:10.300
[24,12,7]
0:30.958 to 1:00.955
0:00.100 to 0:07.610
[24,12,7]
short ver.
0:00.000 to 1:37.600
0:02.400 to 0:12.000
[24,12,7]
0:15.900 to 1:46.259
0:00.100 to 0:07.610
[24,12,7]

Note 1: Although the window parameters are the same for all versions of both songs, they generally change depending on the song used.
Note 2: The approximate repeating segment location is the position in the super-short or short version, and not the position in the original song. This is important because our cuts do not always start at time 0 of the original song.

Performance Evaluation: "Super-Short" Versions

Now, we evaluate the performance of the algorithms on the "super-short" versions of our two songs.

"Woman Got Culture"
  • REPET and Adaptive REPET give overall superior results to the other two algorithms, both subjectively and numerically speaking.
    This performance is likely due to the consistency of the repetition in the instrumental. Still, these two algorithms do not completely succeed at keeping the guitar out of the acappella, or the vocals out of the instrumental.
  • SVS-RPCA seems to have trouble distinguishing between the vocals and the guitar line in the instrumental, which often hit the same notes at the same times. Much of the vocal material shows up in the approximate instrumental track, leading to a low SDR in the acappella track. A low-frequency drum may be the reason for the low SIR, but this can likely be removed by a high-pass filter.
  • MLRR improves slightly on the results of SVS-RPCA for the acappella track. However, the vocals feature very prominently in the instrumental track. Additionally, in both tracks, there is a percussion sound which does not appear in the original song, replacing the snapping noise. This is an artifact from the online dictionary algorithm. Despite this new sound, the SAR values for MLRR are fairly high, which suggests a low amount of other, less audible artifacts. Nonetheless, the presence of the new sound suggests one should be cautious when using statistical measures to evaluate performance.
"Buggin' Out"
  • As expected, this song gives all four of the algorithms trouble.
  • Subjectively (but not numerically), SVS-RPCA gives the best results. Most elements of the instrumental are quiet in the acappella track, except for a low-frequency drum (which is the likely cause of the bad numbers). This can be removed via a high-pass filter. However, the vocals are of a lower quality than would be desirable in practice (e.g. for creating a remix). Additionally, SVS-RPCA accurately captures the "character" (volume, rhythm, and emphasis) of the instrumental, which cannot be said for two of the other algorithms.
  • MLRR and REPET are quite successful numerically, due to the relative loudness of the vocals in the acappella tracks and the synth in the instrumental tracks. However, the melodic line of the instrumental features more loudly in these acappella tracks than it does for SVS-RPCA, and the "character" of the instrumental track is not preserved. (Additionally, in the case of REPET, the low-frequency drum is prominent in the acappella.)
  • It is not a surprise that REPET and Adaptive REPET include the instrumental's melodic line in the acappella track, since these algorithms are designed to separate melodic elements from non-melodic ones.
  • Adaptive REPET has suboptimal numerics, due to forcing too much of the vocals to the instrumental. Notably, during the chorus, the smooth, drawn-out vocals are mistaken for the instrumental, and the staccato melodic line in the instrumental is mistaken for vocals. However, the "character" of the instrumental is conserved, since Adaptive REPET adjusts when the song switches from the verse to the chorus. (For REPET, the instrumental sounds messy in terms of rhythm and emphasis. This is especially true during the chorus, because I specifiedc a portion of the verse for REPET to use as an approximate repeating element.)

Performance Evaluation: "Short" Versions

Lastly, we evaluate the performance of REPET and Adaptive REPET on the "short" versions of our two songs. We note that the performance is similar to the super-short versions, for both algorithms and both songs.

"Woman Got Culture"
  • REPET performs generally better here than it does for the super-short version. This is likely because the instrumental changes very little as the song goes along, and because the short version gives more information for REPET to work with.
  • Adaptive REPET performs better for the most part. However, the SDR and SAR for the acappella tracks are slightly lower here. This is due to the second half of the first verse, which features smooth, non-melodic background vocals. Since this algorithm is built to separate melodic parts from non-melodic parts, the non-melodic vocals are mostly sent to the instrumental track.
  • Note that REPET does not send the non-melodic background vocals to the instrumental track here (for the most part), because I specified a portion of the song without background vocals for REPET to approximate a repeating background for the entire song. (Recall that Adaptive REPET re-estimates the repeating background as the song goes on.)

"Buggin' Out"
  • In terms of SDR, SIR, and SAR, REPET numerically performs worse than it did for the super-short version, and Adaptive REPET performs slightly better than it did for the super-short version. This is due to REPET not accounting for the major changes in the instrumental as the song progresses, while Adaptive REPET adjusts for these changes.

Conclusion

Though the dataset for this study is small, we have nonetheless seen various problems one may encounter while attempting audio source separation on modern-styled songs. To extend this study to statistically significant results, I would like to test these algorithms on a larger dataset consisting of a diverse set of modern-styled songs. In fact, at the current time, there is still a need for a large, genre-diverse, and publicly available data set of songs containing their separated acappella and instrumental tracks - not just for music/voice extraction but also for other MIR-related studies such as melody extraction [6]. We note that the authors of the four algorithms all used the MIR-1K dataset of Hsu & Jang, consisting of 1000 short clips of Chinese songs with their vocal and instrumental tracks separated [7], to statistically verify the results of their work. However, as explained in [6], the utility of this dataset is limited by the strong similarity in style between the songs.

As another note, we have seen how the measures of BSS_EVAL may not necessarily give a description of separation performance that completely matches a subjective evaluation. In future work, I would like to use the newer dataset PEASS to evaluate the performance of these algorithms.

References

[1] Z. Rafii and B. Pardo, Repeating pattern extraction technique (REPET): A simple method for music/voice separation, Audio, Speech, and Language Processing, IEEE Transactions on 21.1 (2013), 73-84.

[2] P. Huang, S. D. Chen, P. Smaragdis, M. Hasegawa-Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, Proc. ICASSP (2012), 57-60.

[3] Y. Yang, Low-rank representation of both singing voice and music accompaniment via learned dictionaries, International Society for Music Information Retrieval Conference (ISMIR), November 2013.

[4] C. Févotte, R. Gribonval and E. Vincent, BSS_EVAL Toolbox User Guide, IRISA Technical Report 1706, Rennes, France, April 2005. http://www.irisa.fr/metiss/bss_eval/.

[5] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, J. ACM, vol. 58, pp. 11:111:37, Jun. 2011.

[6] J. Salamon, E. Gómez, D. P. W. Ellis and G. Richard, Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges, IEEE Signal Processing Magazine, 31(2):118-134, Mar. 2014.

[7] C. Hsu and J. Jang, On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset, Audio, Speech, and Language Processing, IEEE Transactions on 18.2 (2010), 310-319.

Audio Credits

Songs used with permission from Uness. Thanks again to Uness for allowing me to present his songs in the audio clips here!

"Woman Got Culture" is ©2009 Uness/Crime in the City, written by Uness. It comes from Uness's mixtape The Seeker, Vol. 1.

"Buggin' Out" is ©2010 Uness/Crime in the City, written by Uness and Mandee, and produced by Dennis-Manuel Peters, Daniel Coriglie and Mario Bakovic for T-Town Productions. It comes from Uness's album Fashionably Late.

You can download a lot of Uness's music for free at his Bandcamp page here. Uness's style varies, but his music is generally in the R&B, hip-hop, and dance genres. I personally recommend Gunslinger, Used to Love You, Back to Your Heart, and That Day, although he has many other great songs.

"On To You" is ©2009 C.Sparks Entertainment, Inc. It is produced by Logan de Gaulle (Clinton Sparks & DJ Snake) and written by Clinton Sparks, William Grigahcine, Armando Christian Pérez, and Isaac Freeman III. It comes from Clinton Sparks's single Pilot Episode: On To You Feat. Pitbull & Fatman Scoop.


Back to Alex Berrian's website