I did something similar many years ago. I fed about half a million words (two de...

Tallain · 2025-12-14T05:50:29 1765691429

Curious if you've heard of or participated in NaNoGenMo[0] before. With such a corpus at your fingertips could be a fun little project; obviously, pure Markov generation wouldn't be quite sufficient but a good starting point maybe.

[0]: https://nanogenmo.github.io/

vunderba · 2025-12-14T06:02:19 1765692139

Hey that's neat! I hadn't heard of it. It says you need to publish the novel and the source at the end - so I guess as part of the submission you'd include the RNG seed.

The only thing I'm a bit wary of is the submission size - a minimum of 50,000 words. At that length, It'd be really difficult to maintain a cohesive story without manual oversight.

wrp · 2025-12-14T23:26:55 1765754815

There was an MS-DOS tool by James Korenthal called Babble[0], which did something similar. It apparently worked according to a set of grammatical transformers rather than by generating n-grams, so it was more akin to the "cut-up" technique[1]. He reported that he got better output from smaller, more focused corpora. Its output was surprisingly interesting.

[0] https://archive.org/details/Babble_1020, https://vetusware.com/download/Babble%21%202.0/?id=11924

[1] https://en.wikipedia.org/wiki/Cut-up_technique

echelon · 2025-12-14T01:56:05 1765677365

What would the equivalent be with LLMs?

I spend all of my time with image and video models and have very thin knowledge when it comes to running, fine tuning, etc. with language models.

How would one start with training an LLM on the entire corpus of one's writings? What model would you use? What scripts and tools?

Has anyone had good results with this?

Do you need to subsequently add system prompts, or does it just write like you out of the box?

How could you make it answer your phone, for instance? Or discord messages? Would that sound natural, or is that too far out of domain?

ipaddr · 2025-12-14T02:54:31 1765680871

Simplest way pack all text into a prompt.

You could use a vector database.

You could train a model from scratch.

Probably easiest to use OpenAI tools. Upload documents. Make custom model.

How do you make it answer your phone? You could use twillio api + script + llm + voice model. Want natural use a service.

echelon · 2025-12-14T03:32:32 1765683152

I think you're absolutely right about the easiest approach. I hope you don't mind me asking for a bit more difficulty.

Wouldn't fine tuning produce better results so long as you don't catastrophically forget? You'd preserve more context window space, too, right? Especially if you wanted it to memorize years of facts?

Are LoRAs a thing with LLMs?

Could you train certain layers of the model?

dannyw · 2025-12-14T03:36:55 1765683415

A good place to start with your journey is this guide from Unsloth:

https://docs.unsloth.ai/get-started/fine-tuning-llms-guide

davely · 2025-12-13T23:57:23 1765670243

I gave a talk in 2015 that did the same thing with my tweet history (about 20K at the time) and how I used it as source material for a Twitter bot that could reply to users. [1]

It was pretty fun!

[1] https://youtu.be/rMmXdiUGsr4

bitwize · 2025-12-13T22:52:14 1765666334

Terry Davis, pbuh, did something very similar!

boznz · 2025-12-14T01:50:38 1765677038

What a fantastic idea, I have about 30 years of writing, mostly chapters and plots for novels that did not coalesce. Love to know how it turns out too.

idiotsecant · 2025-12-14T01:10:44 1765674644

Did it work?

vunderba · 2025-12-14T04:50:31 1765687831

So that's the key difference. A lot of people train these Markov models with the expectation that they're going to be able to use the generated output in isolation.

The problem with that is either your n-gram level is too low in which case it can't maintain any kind of cohesion, or your n-gram level is too high and it's basically just spitting out your existing corpus verbatim.

For me, I was more interested in something that could potentially combine two or three highly disparate concepts found in my previous works into a single outputted sentence - and then I would ideate upon it.

So I haven't opened the program in a long time so I just spun it up and generated a few outputs:

  A giant baby is navel corked which if removed causes a vacuum.

I'm not sure what the original pieces of text were based on that particular sentence but it starts making me think about a kind of strange void harkonnen with heart plugs that lead to weird negatively pressurized areas. That's the idea behind the dream well.

kqr · 2025-12-14T14:47:49 1765723669

> A giant baby is navel corked which if removed causes a vacuum.

Very The Age of Wire and String.