I just learned what a Markov chain is, this evening. I'm not going to
try to explain it, as I don't understand it well enough to explain quite
yet, but if you are interested, Wikipedia's entry is a reasonable
starting point: http://en.wikipedia.org/wiki/Markov_chain
My thing for today is an algorithm for extracting Markov chain data from
text input. It's naive and in no way fully thought through, and will
probably turn out to be entirely the wrong way of doing it, but it's
what I have so far. Not gonna research how it Should Be Done until I
have either a working model from this algorithm, or have discovered
unfixable problems with it. No pic, just a text file. (And how will
posterous handle a text file? We'll see shortly!)
# markov chain for chat
each user's statement, process it for tokens and store in db
how to store the info? naively, use a db with fields token1, token2, count.
"one two three four five one two three four" =>
- one 1
one two 1
one two three 1
one two three four 2
two three four five 1
three four five one 1
four five one two 1
five one two three 1
two three four - 1
three four - 1
four - 1
(maybe the tokens should be stored in the other order, for thinking about)
store(string):
token1 = findFirstToken(string)
token2 = findNextWord(string, token1) // but what if token1 appears more than once?
// probably should track index in the string instead
count = findInDb(tokan1, token2)
if( count == 0 ) // this means it's not there
enter into db token1, token2
increment count and save
repeat until the token1 finder returns nothing
how to handle finding token2 if token1 is the end of the string? is it reasonable to consider null as a token? Yes; we need to be able to stop making messages at some reasonable point, and to know what tokens to start with.
findFirstToken(string)
token = ""
for(numWordsPerToken) // how many words are in a token? this should be a global setting
word = findAWord(string)
token += word
return token
findNextWord(string, token)
iterate through string until we find token, drop a marker
substring the string at marker
find the first word in the new string
Comments [0]