This page contains a presentation of the final project I did for the course MAT 271: Applied & Computational Harmonic Analysis, taught by Professor Naoki Saito. Please feel free to email me if you have any comments or questions!
Music/voice separation refers to the problem of trying to separate vocals from instrumentals in a song, in order to produce an acappella track containing only vocals, and an instrumental track containing only the instruments. For aspiring DJs and other musicians, as well as for researchers studying MIR (Music Information Retrieval), this is a problem of great interest.
For this project, I compared the performance of four different algorithms which can be used for music/voice separation. (You can learn more about these algorithms in the Analysis portion of this study.) These four algorithms have their MATLAB codes available online at the websites listed in the table below:
I tested these four algorithms on three songs in my personal library: "Woman Got Culture" by Uness, "Buggin' Out" by Uness, and "On To You" by Clinton Sparks featuring Pitbull and Fatman Scoop. Uness has provided the first two songs' acappella and instrumental tracks for free download here and here. (The separated tracks of the last song were previously accessible as a free download years ago on Clinton Sparks' website, but the download is unfortunately no longer available.)
Although this dataset of songs is much too small to statistically determine the overall relative performance of the algorithms, the dataset nonetheless allows us to explore possible strengths and deficiencies of each algorithm depending on the features of the song being processed. Additionally, researchers of these algorithms often do not test their results on songs of more modern styles as presented here. Thus, the goal for this study is to carefully examine the performance of these algorithms on a small set of modern-styled songs.
Results: Audio Clips and Numerical Evaluation
Due to copyright restrictions, I cannot present audio for the song "On To You," so I will omit my results for that song. However, Uness has graciously given me permission to present the clips of his songs here, in both original and modified form. (Thanks, Uness!) You can listen to 20-second mono versions of the separated and mixed tracks for Uness's songs below. (Note that for the purposes of this project, the acappella and instrumental tracks have been converted to mono, then mixed together. The actual stereo versions of these songs sound better! For the full mixed versions of these songs, see the Audio Credits at the end of this document.)
In order to test and compare these algorithms, I proceeded as follows. First, I converted all the separated tracks to mono. Then, for each song I mixed the acappella and the instrumental tracks together, de-amplifying these tracks before mixing in order to avoid clipping. Next, I shortened each of the mixed tracks, using only select parts of each song for the tests. Finally, I ran the MATLAB codes for each algorithm (using appropriately chosen parameters for each algorithm) on these shortened, mixed tracks.
In the first group of tests, I used "super-short" versions of each song. These are clips of the songs which are less than 30 seconds in length, include the song chorus, and which are clipped so that they start and end approximately in time with the beat. For "Woman Got Culture," the clip consists of the first chorus. For "Buggin' Out," the clip contains the first bridge and half of the first chorus. Below, you can listen to 20-second clips of the results we get from applying each algorithm to these super-short versions:
In the second group of tests, I used "short" versions of each song. These are cuts of the songs which are approximately half the original song length (about 1.5 to 2 minutes). For "Woman Got Culture," the cut contains the first chorus, first verse, and second chorus (which is the same as the first chorus). For "Buggin' Out," the cut includes the first verse, bridge, and chorus, and second verse and bridge. Since the codes for MLRR and RPCA truncated songs down to 30 seconds in length if they exceeded this duration, I did not attempt to run these codes on the short versions. (In practice, to analyze these algorithms on songs longer than 30 seconds, one would separate the songs into clips of 30 seconds or less.) Thus, we only compare the performance of REPET and Adaptive REPET here. Below, you can listen to 20-second clips of the results we get from applying each algorithm to these short versions:
To evaluate the performance of the algorithms, I used version 3.0 of the BSS_EVAL toolbox  to measure the quality of the approximate acappella and instrumental tracks produced by each algorithm, comparing these approximate separated tracks with the original separated tracks. Using BSS_EVAL, I calculated three different measures of source separation performance: source-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifact ratio (SAR). For a given separated track, these quantities respectively measure the energy ratio between the source and the algorithm's distortion of the source, the amount of interference from other separated tracks, and the amount of undesired artifacts in the track due to the algorithm . Higher values of SDR, SIR and SAR mean better separation . The following table provides the SDR, SIR, and SAR of the tested algorithms on the "super-short" versions of the songs:
The next table provides the SDR, SIR, and SAR of the tested algorithms on the "short" versions of the songs:
Description of the Algorithms
To properly analyze the performance of these algorithms, it is necessary to first describe the individual algorithms, as well as their possible strengths and weaknesses. For more precise, mathematical details about these algorithms, please consult the papers , , and .
REPET is meant as a "background/foreground" separation algorithm for popular music, and takes advantage of the kind of repetition that is seen in such music. In particular, many popular songs contain instrumental elements that repeat periodically for the entirety of the song, like a repeating chord progression or a looping drum track. REPET searches for these repeating elements and attempts to isolate this repeating "background" material from the "foreground" material, which is non-repeating and generally consists of the vocal line, and other non-repeating instrumental lines which accentuate or provide the melody .
Adaptive REPET is designed to operate on full songs, by applying REPET to local time windows of the song, with a given overlap time scale for the windows. In particular, whereas REPET assumes that the instrumental contains some portion which repeats for the entire song, Adaptive REPET assumes that this repeating portion may change in various parts of the song (for instance, the verse and the chorus may have different repeating backgrounds) .
Possible strengths and weaknesses of REPET and Adaptive REPET:
SVS-RPCA implements an algorithm from Candès, Li, Ma, and Wright  called robust principal component analysis. It assumes that the repetitive nature of the instrumental elements relegates them to a low-rank subspace, whereas the voice is sparse and has more variation . Thus Huang et al. implement the robust principal component analysis technique of , a matrix factorization algorithm for decomposition into low-rank and sparse matrices.
MLRR attempts to improve on the SVS-RPCA algorithm by instead using a decomposition into two low-rank matrices and one sparse matrix. The two low-rank matrices represent subspaces of sounds corresponding to the voice and instrumentals, and the sparse matrix contains deviations from these subspaces. An online dictionary algorithm is used to infer the subspace structures of the vocal and instrumental sounds .
Possible strengths and weaknesses of SVS-RPCA and MLRR:
Description of the Songs
Having described the algorithms, we can now discuss their performance on these two songs. First, we should describe these songs in depth.
"Woman Got Culture" is an R&B song with a tempo of 100 beats per minute (BPM) and a length of 3:34. The elements of this song are as follows. Uness sings the vocals himself, with vocal lines sometimes overlapping each other. He is backed by an instrumental consisting of an acoustic guitar, a light percussive drum sound, and a snapping sound. These are looped over the entire song, and the chord progression consists of the same four chords every four measures. The instrumental appears to change slightly in volume in transitions between the chorus and the verse. Additionally, the guitar part has a different melodic line in the verse and in the chorus. Nonetheless, despite these nuances, the instrumental stays fairly consistent throughout the whole song. Thus, we expect good results with REPET. However, we must also note that the guitar melody and the vocal melody tend to stay in the same range of notes (frequencies), often landing on the same notes simultaneously. Thus, we expect that this may cause some problems for RPCA, which may associate the melodic guitar line with the vocals.
"Buggin' Out" is a song in the dance genre with a tempo of about 128 BPM, and a length of 3:25. This song consists of a reverberated vocal track above artificially-synthesized instrumental sounds. We expect this song to be difficult for our algorithms, for the following reasons. Uness's vocals are smooth and drawn out, whereas the instrumental includes a melodic line with a more staccato character, with an especially sharp, punctuating sound in the chorus. We expect this to give some problems for RPCA and MLRR, which may mistake the instrumental melody for the vocal melody. Indeed, since the instrumental and vocal both contain melody lines, REPET and Adaptive REPET may separate the instrumental melody from the rest of the instrumental track (as they are designed to do). Finally, the instrumental notably changes in volume leading up to the chorus, and the number of instrumental elements increases once the chorus begins. We expect all these factors to give problems for both REPET and Adaptive REPET.
Choice of Parameters
Three of these algorithms' codes require the specification of parameters by the user. The code for MLRR requires the user to specify whether an acappella or an instrumental track is desired from the separation process. REPET requires the user to specify a time interval that approximates the location of a repeating segment in the instrumental. Adaptive REPET does the same, and additionally requires the specification of parameters for a windowing procedure, where it searches for different repeating instrumental segments as the song progresses. (For instance, a verse and a chorus may have different instrumentals that nonetheless repeat .) These parameters, in order, describe a window length, a step length, and the order of a median filter (see  for more details).
In the following table, I give the time markers of the original songs which delineate the "super-short" and "short" versions under the heading "Song cut location". (For example, the super-short version of "Buggin' Out" consists of the portion of the original song which goes from the time 0:30.958 to the time 1:00.955, in minutes:seconds.) Then, I give the parameters used for REPET and Adaptive REPET.
Note 1: Although the window parameters are the same for all versions of both songs, they generally change depending on the song used.
Note 2: The approximate repeating segment location is the position in the super-short or short version, and not the position in the original song. This is important because our cuts do not always start at time 0 of the original song.
Performance Evaluation: "Super-Short" Versions
Now, we evaluate the performance of the algorithms on the "super-short" versions of our two songs.
"Woman Got Culture"
Performance Evaluation: "Short" Versions
Lastly, we evaluate the performance of REPET and Adaptive REPET on the "short" versions of our two songs. We note that the performance is similar to the super-short versions, for both algorithms and both songs.
"Woman Got Culture"
Though the dataset for this study is small, we have nonetheless seen various problems one may encounter while attempting audio source separation on modern-styled songs. To extend this study to statistically significant results, I would like to test these algorithms on a larger dataset consisting of a diverse set of modern-styled songs. In fact, at the current time, there is still a need for a large, genre-diverse, and publicly available data set of songs containing their separated acappella and instrumental tracks - not just for music/voice extraction but also for other MIR-related studies such as melody extraction . We note that the authors of the four algorithms all used the MIR-1K dataset of Hsu & Jang, consisting of 1000 short clips of Chinese songs with their vocal and instrumental tracks separated , to statistically verify the results of their work. However, as explained in , the utility of this dataset is limited by the strong similarity in style between the songs.
As another note, we have seen how the measures of BSS_EVAL may not necessarily give a description of separation performance that completely matches a subjective evaluation. In future work, I would like to use the newer dataset PEASS to evaluate the performance of these algorithms.
 Z. Rafii and B. Pardo, Repeating pattern extraction technique (REPET): A simple method for music/voice separation, Audio, Speech, and Language Processing, IEEE Transactions on 21.1 (2013), 73-84.
 P. Huang, S. D. Chen, P. Smaragdis, M. Hasegawa-Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, Proc. ICASSP (2012), 57-60.
 Y. Yang, Low-rank representation of both singing voice and music accompaniment via learned dictionaries, International Society for Music Information Retrieval Conference (ISMIR), November 2013.
 C. Févotte, R. Gribonval and E. Vincent, BSS_EVAL Toolbox User Guide, IRISA Technical Report 1706, Rennes, France, April 2005. http://www.irisa.fr/metiss/bss_eval/.
 E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, J. ACM, vol. 58, pp. 11:111:37, Jun. 2011.
 J. Salamon, E. Gómez, D. P. W. Ellis and G. Richard, Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges, IEEE Signal Processing Magazine, 31(2):118-134, Mar. 2014.
 C. Hsu and J. Jang, On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset, Audio, Speech, and Language Processing, IEEE Transactions on 18.2 (2010), 310-319.
Songs used with permission from Uness. Thanks again to Uness for allowing me to present his songs in the audio clips here!
"Woman Got Culture" is ©2009 Uness/Crime in the City, written by Uness. It comes from Uness's mixtape The Seeker, Vol. 1.
"Buggin' Out" is ©2010 Uness/Crime in the City, written by Uness and Mandee, and produced by Dennis-Manuel Peters, Daniel Coriglie and Mario Bakovic for T-Town Productions. It comes from Uness's album Fashionably Late.
You can download a lot of Uness's music for free at his Bandcamp page here. Uness's style varies, but his music is generally in the R&B, hip-hop, and dance genres. I personally recommend Gunslinger, Used to Love You, Back to Your Heart, and That Day, although he has many other great songs.
"On To You" is ©2009 C.Sparks Entertainment, Inc. It is produced by Logan de Gaulle (Clinton Sparks & DJ Snake) and written by Clinton Sparks, William Grigahcine, Armando Christian Pérez, and Isaac Freeman III. It comes from Clinton Sparks's single Pilot Episode: On To You Feat. Pitbull & Fatman Scoop.