Adventures with Markov Chains

A Markov chain is an incredibly simple model describing a sequence of events in which the probability of each event depends only on the state attained in the previous event. And yes I did lift that from Wikipedia but it can explain it much better than I. Basically the idea is that you have a number of states, like whether it's raining or not, and the probability of a certain state tomorrow only depends on the state today. In other words the probability of it raining tomorrow depends on whether it's raining today.

I've used this model to generate text in particular styles. You feed the model a huge amount of text, and it builds up a probability table of what the next word may be given the previous word. You then use a random number generator to generate chains of text.

I experimented a bit with order and token choice. I found that training the model on individual letters produced pronouncable garbage. Any words more than a couple of letters long were not real words, it was incomprehensible but pronouncable. If I wanted to make an English-sounding fake language this is probably how I'd do it. I wanted some generated text chains that made sense, however, so I trained it on individual words instead. I found that a first order model (looking back only one word) didn't produce very good sentences, and a third-order model (looking back 3 words) tended to reproduce the input data because I didn't have enough of it. The second-order model seemed like a good middle ground.

I trained the model on a variety of corpuses. Generated wine tasting notes were suprisingly believable, as were generated Trump tweets and dreams. Here are some generated wine-tasting notes:

"This wine combines weight and a low altitude vineyard. It's packed with both power and makes it refreshing too. Drink now."

"A kitchen-sink blend of 90% Sangiovese and the wine is still closed but delivers a creamy, textural wine. Those body powder and lime aromas are less attractive."

"The voluble, opinionated Jean-Luc Colombo has crafted this classic Gewurztraminer. It's plush on the finish."

"Clove aromas are outweighed by a touch of volatility along with dried apricot while the palate shows ripe aromas of sweet oak tones that will age well. Drink from 2016."

These are, of course, hand-picked, but the generated sentences were of unreasonably good quality in general. The wine-tasting notes worked especially well, which may be because I had a huge amount of data.