I've seen Markov chains applied to language generation - producing sentences tha...

ddt · on July 29, 2014

There are complicated ways of doing this, but the naïve way is as follows:

First you need a corpus of text that's grammatically correct

Each node in the chain is a word or piece of punctuation. Each word has a certain probability of being followed by every other word in the corpus, including itself. There are a few different ways to start the sentence. One approach is to start from the node for the punctuation mark ".", and only selecting a following node that is not a period, since sentences don't tend to start with punctuation. From there, use a random number generator to pick a following node based on your probability matrix, rinse, repeat.

If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.

scott_s · on July 30, 2014

If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.

Which is also true for human speakers.

jaredandrews · on July 29, 2014

Here is cool generator that demonstrates this in action: http://projects.haykranen.nl/markov/demo/

kephra · on July 29, 2014

Markov chains work well for artificial language and name generations. One of my first programs in 1979, when I was a 13yr old kid, was to sum up the occurrence of characters following other characters in text files, and to roll dice on this table to generate names for role playing games.

I later learned that I reinvented Markov this way. I still have those printouts, and use them when ever I need a name for a role playing game NPC.

TazeTSchnitzel · on July 29, 2014

Someone made a Twitter account to generate baby names with markov chains, actually: https://twitter.com/markovbaby

chatmasta · on July 30, 2014

Yeah, same. The first time I ever heard "Markov chains" was when I was a 13-year-old hanging out on black hat SEO forums. Back in the day, software like "YASG" (or something like that, "yet another site generator") made a lot of money with simple PHP scripts for rewriting Wikipedia articles with Markov chains, thus skirting Google's duplicate content algorithms, which were much weaker at the time.

There were actually quite a few vulnerabilities back then. I wrote a script that allowed me to create a Google account with the IP address of a visitor to the website, all without them knowing. All I had to do was open the registration page with a server side script, download the CAPTCHA, display it to the user and ask them to fill it out. When they filled it out, the submitted form targeted the Google registration form in a 1x1 iFrame, then another button targeted the logout form. Google was not checking the referrer of the sign up form, nor was it comparing the IP address that received the CAPTCHA to the one that submitted it.

I had a friend load that script into thousands of generated blogspot blogs, which got long tail google traffic and asked the user to "fill the CAPTCHA to continue." The script ran for ~2 weeks and generated ~60000 Google accounts all from unique IP addresses.

That was around 2007, so obviously it's all patched up by now. I was 15 years old and never did anything with the accounts, so if anyone from google is reading this, keep the lawyers away from me please.

Blackhat SEO actually has a lot of clever tricks. I haven't been part of that space in a while, for a lot of reasons, but I can attribute 99% of my knowledge of marketing to time spent trawling through blackhat SEO forums reading not my about that, but also landing page optimization, conversion tracking, copywriting, etc. It was definitely a worthwhile learning experience for me.

Oops that veered off topic. But yes, Markov chains. Blackhat content generation. :)

regularfry · on July 29, 2014

Usually, you'd have a row and column per word in your dictionary. Rows would represent the current word, and the columns would represent the next word in the sentence, with the transition probability of each cell in the matrix being determined by counting the frequency of occurence of each word pair in some body of text.

swimfar · on July 30, 2014

Here's a nice introduction/exercise which is part of a python tutorial.

http://www.greenteapress.com/thinkpython/html/thinkpython014...