Here's what I don't get. Wikimedia claims to be a nonprofit for spreading knowle...

joshuaissac · 2025-05-02T14:46:09 1746197169

They provide database dumps already, and those dumps have the diff information. Crawlers are ignoring the dumps and scraping the websites anyway.

PeterStuer · 2025-05-02T16:40:07 1746204007

Have they ever asked the customers why they prefer scraping over the data deltas?

ks2048 · 2025-05-02T17:42:20 1746207740

I would bet the answer is it is easier to write a script the simply downloads everything it can (foreach <a href=>: download and recurse), rather than looking into which sites provide data dumps and how to use them.

PeterStuer · 2025-05-03T08:21:29 1746260489

So the solution would be an edgecached site exactly like the full site with just the deltas since a periodic timepoint?

The crawler still crawls but can confidently rest assured it still has all the info with the base+delta as if it had recrawled everyting?

Palomides · 2025-05-02T16:39:01 1746203941

>Every customer would prefer a firehose content delta over having to scrape for diffs.

customers is a strong word, especially when you're saying they should be providing a new service useful, more or less exclusively, to AI startups and megacorps

PeterStuer · 2025-05-03T08:23:50 1746260630

Why not? If your mission as a non-profit is to share the knowledge, are these not just welcome new value adding channels to achieve that goal?

add-sub-mul-div · 2025-05-02T17:17:40 1746206260

Why should they create a whole new architecture to support when you can find changed articles between two dumps with a simple query? I'd rather load a big file into a database than maintain a firehose consumer.

PeterStuer · 2025-05-03T08:24:40 1746260680

That is a question of latency.