I did something similar many years ago. I fed about half a million words (two decades of mostly fantasy and science fiction writing) into a Markov model that could generate text using a “gram slider” ranging from 2-grams to 5-grams.
I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.
Curious if you've heard of or participated in NaNoGenMo[0] before. With such a corpus at your fingertips could be a fun little project; obviously, pure Markov generation wouldn't be quite sufficient but a good starting point maybe.
Hey that's neat! I hadn't heard of it. It says you need to publish the novel and the source at the end - so I guess as part of the submission you'd include the RNG seed.
The only thing I'm a bit wary of is the submission size - a minimum of 50,000 words. At that length, It'd be really difficult to maintain a cohesive story without manual oversight.
There was an MS-DOS tool by James Korenthal called Babble[0], which did something similar. It apparently worked according to a set of grammatical transformers rather than by generating n-grams, so it was more akin to the "cut-up" technique[1]. He reported that he got better output from smaller, more focused corpora. Its output was surprisingly interesting.
I think you're absolutely right about the easiest approach. I hope you don't mind me asking for a bit more difficulty.
Wouldn't fine tuning produce better results so long as you don't catastrophically forget? You'd preserve more context window space, too, right? Especially if you wanted it to memorize years of facts?
I gave a talk in 2015 that did the same thing with my tweet history (about 20K at the time) and how I used it as source material for a Twitter bot that could reply to users. [1]
What a fantastic idea, I have about 30 years of writing, mostly chapters and plots for novels that did not coalesce. Love to know how it turns out too.
So that's the key difference. A lot of people train these Markov models with the expectation that they're going to be able to use the generated output in isolation.
The problem with that is either your n-gram level is too low in which case it can't maintain any kind of cohesion, or your n-gram level is too high and it's basically just spitting out your existing corpus verbatim.
For me, I was more interested in something that could potentially combine two or three highly disparate concepts found in my previous works into a single outputted sentence - and then I would ideate upon it.
So I haven't opened the program in a long time so I just spun it up and generated a few outputs:
A giant baby is navel corked which if removed causes a vacuum.
I'm not sure what the original pieces of text were based on that particular sentence but it starts making me think about a kind of strange void harkonnen with heart plugs that lead to weird negatively pressurized areas. That's the idea behind the dream well.
I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.