main-image This can also be a random X generator, where X is any medium with a substantial corpus of raw-text that can be used to train a model.

You can check out the source on github here.

Most of my hacks are pretty CRUDdy. That’s not to say that they aren’t interesting because some of the most useful things I’ve made have a pretty simple algorithm behind it. What my projects do tend to lack is any kind of interesting integration of advanced topics in computer science. So I decided to combine two things I’ve been wanting to dabble in, NLP and statistical modeling, to put together a naive robot rapper who I’ve dubbed MC Andrey.

A Brief Background on Markov Chains

Markov chains are used by some of the biggest players in the tech industry. To put things in perspective, the entire premise of Google’s PageRank algorithm is built off the concept of Markov chains. Of course, I’ve decided to use this extremely powerful statistical model to actually do useful things, like build a rapping robot yo.

The concept of a Markov chain is simple. Imagine a system where there are various, inter-connected states. There are pre-assigned probabilities for moving from one state to another. For example imagine a system with two states, SUN and RAIN. If it’s sunny today, it’ll probably be sunny tomorrow, but there is a chance of rain. Eventually there is a transition from sunny to rainy and then rain lingers for a few days before transitioning back to sun. This is of course a gross simplification of weather patterns, but it illustrates the simple concept of how a Markov model actually works.

Now imagine a basic sentence like “The cat in the red hat”. A sentence is essentially a state machine where each word has a 100% probability of transitioning to the word directly after it. What if we just looked at the raw set of words in the sentence (the, cat, in, red, hat) and assigned to each word a list of words that comes after it.

the: [cat, red]

cat: [in]

in: [the]

red: [hat]

hat: []

If we start at the word ‘the’, we can come up with the following sentences.

The red hat.

The cat in the red hat.

The cat in the cat hat.

The cat in the cat in the cat in the cat in the cat in the red hat.

And so on. All of these Markov chain sentences are based on a single base sentence. What if we added a sentence like “Who is afraid of the big bad wolf” to the mix? What if we added in an entire paragraph of text? It becomes apparent that the number of interesting chains grows dramatically as more text sources are added.

Why Build a Rap Generator?

I was primarily inspired by this XKCD classic where the alt text reads “freestyle rapping is basically applied Markov chains”. What’s interesting is how a Markov chain is an especially suitable model for rap generation. A lot of rap has to do with the clever linguistics and subtle wordplay embedded in each verse. A common technique (which probably has a name but I call the “before and after” effect) takes a popular phrase A and strings on another popular phrase B that continues off the last word or words of phrase A. For example “runway model citizen”, “second string cheese”, and “getting away with murder she wrote” are the kinds of wordplay that are both found in rap and are likely to be generated by a Markov chain.

Data Munging

I started with a list of the top 100 raps of all time and scraped out the song titles and artists. I then took this list and started scraping lyrics of off azlyrics. I removed things in brackets and parens like [Chorus] and (yeah) as well as removed all punctuation, which left me with a bunch of clean tokens. I then combined all the lyrics into one massive text file and began training the Markov model. The initial results were good, but each randomly generated line varied too much from the others. So I enforced a strict eight words per line rule. This kept the results looking a little more uniform. The next was getting the words to rhyme. I decided to use NLTK’s cmudict which is basically a pronunciation dictionary. I split up words at the end of each sentence into its syllables and returned a list of words whose pronunciations matched exactly for the last two syllables. Pretty naive, but it does a fairly decent job of generating a rhyming dictionary on the fly!

Concluding Thoughts

Markov chains and NLTK are awesome. In conjunction, they give you a powerful text generator backed by a smart linguistics engine. This project was just something I hacked together in a few hours, but it has the potential for so much more. What if I were to just train the model with lyrics only coming from a particular artist? This would make the style of each rap a lot more uniform and potentially more coherent. What if instead of raps, I trained the model with pop songs? How about Shakespeare? The possibilities are endless!