Saturday 4 January 2014

Ministry of Recycled Sound

An advert for the latest Ministry of Sound album came on TV the other day. As usual, they played samples of some of the tunes you'll get on the album - one of which was Darude - Sandstorm. It occurred to me that all of the Ministry of Sound albums I can remember watching adverts for contained this one track. This made me wonder how many other songs kept getting put on the MoS albums repeatedly.

What I wanted to see was a list of tracks sorted by how many times they had appeared on a Ministry of Sound album. I had a hunch that Sandstorm would be at the top.


Fetch the database

FreeDB is a license-free database for use in looking up track listings for CDs. We start by downloading the whole database (roughly 800MB bzipped file) and extracting its contents to the filesystem: Producing the following file structure:
Each file (000a9612, 00073614 etc.) represents a single CD and they're organised into genre directories. The files contain, among other things, the track listing of the CD.

Search for Ministry of Sound CDs

Since there are over three million CDs listed in FreeDB, the files needed paring down a little.
This grep command searches recursively and case-insensitively for the exact string "ministry of sound". It will normally output something like
but all we care about is the file name (4512b117) so we use perl regex to capture and print it with a newline appended. This will produce duplicate file names, so we pipe it to uniq and finally output to a file called ministry.txt. This command took a long time to complete, but would have been a whole lot slower had we have done it using Python.

Now we have a list of Ministry of Sound CDs, we copy the files into their own directory since we don't care what genre they're listed as.
We now have all of the CD files in one directory so we can move on to using Python to figure out the answer to our question.

Parse the CD files

The first issue we need to solve is that some of the track names are too long to fit on a single line in the CD file.
We create a function which parses each line of the CD file and creates a dictionary of the keys and values. If a key is already in the dictionary, it will concatenate the value onto it.
Within this function we also replace certain characters and strings. Some of the delimiters are inconsistent in FreeDB, so we make an effort to replace them all with "/". We also remove any occurrences of " (original mix)" since these are indeed the original tracks.
We create a REPLACEMENTS constant to put at the top of our script and pass it to the parse_cd function at runtime.

Count up the tracks

We now have all the tools we need to count up the tracks.
We create a tracks_count dictionary to store the amount of appearances of each track. Next, we list the directory which contains all of the CD files and parse the files. We grab the tracks (the key always starts with "TTITLE")  and either increment an existing entry in tracks_count or set it to 1 if it is its first appearance on a CD. Finally, we sort the tracks_count dictionary incrementally so that the most common track appears last on our console output and then print the whole lot.

The results

The top 30 most common tracks used on Ministry of Sound CD releases. Look who's sitting pretty with 25 appearances :)

I must admit, I thought these tracks would show up more regularly, considering the amount of CD files we were searching through (1,229), but I'm happy that at least my prediction was correct. Next time you see an advert for a MoS album keep an ear out for Darude - Sandstorm!

The full script