Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's what I don't get. Wikimedia claims to be a nonprofit for spreading knowledge. They sit on nearly half a billion of assets.

Every customer would prefer a firehose content delta over having to scrape for diffs.

They obviously have the capital to provide this, and still grow their funds for eternity without ever needing a single dollar in external revenue.

Why don't they?



They provide database dumps already, and those dumps have the diff information. Crawlers are ignoring the dumps and scraping the websites anyway.


Have they ever asked the customers why they prefer scraping over the data deltas?


I would bet the answer is it is easier to write a script the simply downloads everything it can (foreach <a href=>: download and recurse), rather than looking into which sites provide data dumps and how to use them.


So the solution would be an edgecached site exactly like the full site with just the deltas since a periodic timepoint?

The crawler still crawls but can confidently rest assured it still has all the info with the base+delta as if it had recrawled everyting?


>Every customer would prefer a firehose content delta over having to scrape for diffs.

customers is a strong word, especially when you're saying they should be providing a new service useful, more or less exclusively, to AI startups and megacorps


Why not? If your mission as a non-profit is to share the knowledge, are these not just welcome new value adding channels to achieve that goal?


Why should they create a whole new architecture to support when you can find changed articles between two dumps with a simple query? I'd rather load a big file into a database than maintain a firehose consumer.


That is a question of latency.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: