Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've seen Markov chains applied to language generation - producing sentences that make sense grammatically but not literally. Anyone know what the connection is here? I think I have an idea but would like to see if it gets independently verified by someone else.


There are complicated ways of doing this, but the naïve way is as follows:

First you need a corpus of text that's grammatically correct

Each node in the chain is a word or piece of punctuation. Each word has a certain probability of being followed by every other word in the corpus, including itself. There are a few different ways to start the sentence. One approach is to start from the node for the punctuation mark ".", and only selecting a following node that is not a period, since sentences don't tend to start with punctuation. From there, use a random number generator to pick a following node based on your probability matrix, rinse, repeat.

If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.


If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.

Which is also true for human speakers.


Here is cool generator that demonstrates this in action: http://projects.haykranen.nl/markov/demo/


Markov chains work well for artificial language and name generations. One of my first programs in 1979, when I was a 13yr old kid, was to sum up the occurrence of characters following other characters in text files, and to roll dice on this table to generate names for role playing games.

I later learned that I reinvented Markov this way. I still have those printouts, and use them when ever I need a name for a role playing game NPC.


Someone made a Twitter account to generate baby names with markov chains, actually: https://twitter.com/markovbaby


Yeah, same. The first time I ever heard "Markov chains" was when I was a 13-year-old hanging out on black hat SEO forums. Back in the day, software like "YASG" (or something like that, "yet another site generator") made a lot of money with simple PHP scripts for rewriting Wikipedia articles with Markov chains, thus skirting Google's duplicate content algorithms, which were much weaker at the time.

There were actually quite a few vulnerabilities back then. I wrote a script that allowed me to create a Google account with the IP address of a visitor to the website, all without them knowing. All I had to do was open the registration page with a server side script, download the CAPTCHA, display it to the user and ask them to fill it out. When they filled it out, the submitted form targeted the Google registration form in a 1x1 iFrame, then another button targeted the logout form. Google was not checking the referrer of the sign up form, nor was it comparing the IP address that received the CAPTCHA to the one that submitted it.

I had a friend load that script into thousands of generated blogspot blogs, which got long tail google traffic and asked the user to "fill the CAPTCHA to continue." The script ran for ~2 weeks and generated ~60000 Google accounts all from unique IP addresses.

That was around 2007, so obviously it's all patched up by now. I was 15 years old and never did anything with the accounts, so if anyone from google is reading this, keep the lawyers away from me please.

Blackhat SEO actually has a lot of clever tricks. I haven't been part of that space in a while, for a lot of reasons, but I can attribute 99% of my knowledge of marketing to time spent trawling through blackhat SEO forums reading not my about that, but also landing page optimization, conversion tracking, copywriting, etc. It was definitely a worthwhile learning experience for me.

Oops that veered off topic. But yes, Markov chains. Blackhat content generation. :)


Usually, you'd have a row and column per word in your dictionary. Rows would represent the current word, and the columns would represent the next word in the sentence, with the transition probability of each cell in the matrix being determined by counting the frequency of occurence of each word pair in some body of text.


Here's a nice introduction/exercise which is part of a python tutorial.

http://www.greenteapress.com/thinkpython/html/thinkpython014...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: