> Crawling the internet is a natural monopoly. How so? A caching proxy costs you...

sokoloff · 2025-05-10T17:06:35 1746896795

Costs almost nothing, but returns even less.*

There are so many other bots/scrapers out there that literally return zero that I don’t blame site owners for blocking all bots except googlebot.

Would it be nice if they also allowed altruist-bot or common-crawler-bot? Maybe, but that’s their call and a lot of them have made it on a rational basis.

* - or is perceived to return

threeseed · 2025-05-10T19:29:39 1746905379

> that I don’t blame site owners for blocking all bots except googlebot

I run a number of sites with decent traffic and the amount of spam/scam requests outnumbers crawling bots 1000 to 1.

I would guess that the number of sites allowing just Googlebot is 0.

Aurornis · 2025-05-10T18:52:46 1746903166

> that I don’t blame site owners for blocking all bots except googlebot.

I doubt this is happening outside of a few small hobbyist websites where crawler traffic looks significant relative to human traffic. Even among those, it’s so common to move to static hosting with essentially zero cost and/or sign up for free tiers of CDNs that it’s just not worth it outside of edge cases like trying to host public-facing Gitlab instances with large projects.

Even then, the ROI on setting up proper caching and rate limiting far outweighs the ROI on trying to play whack-a-mole with non-Google bots.

Even if someone did go to all the lengths to try to block the majority of bots, I have a really hard time believing they wouldn’t take the extra 10 minutes to look up the other major crawlers and put those on the allow list, too.

This whole argument about sites going to great lengths to block search indexers but then stopping just short of allowing a couple more of the well-known ones feels like mental gymnastics for a situation that doesn’t occur.

fc417fc802 · 2025-05-10T19:30:31 1746905431

> sites going to great lengths to block search indexers

That's not it. They're going to great lengths to block all bot traffic because of abusive and generally incompetent actors chewing through their resources. I'll cite that anubis has made the front page of HN several times within the past couple months. It is far from the first or only solution in that space, merely one of many alternatives to the solutions provided by centralized services such as cloudflare.

luckylion · 2025-05-10T19:18:57 1746904737

Regarding allowlisting the other major crawlers: I've never seen any significant amount of traffic coming from anything but Google or Bing. There's the occasional click from one of the resellers (ecosia, brave search, duckduckgo etc), but that's about it. Yahoo? haven't seen them in ages, except in Japan. Baidu or Yandex? might be relevant if you're in their primary markets, but I've never seen them. Huawei's Petal Search? Apple Search? Nothing. Ahrefs & friends? No need to crawl _my_ website, even if I wanted to use them for competitor analysis.

So practically, there's very little value in allowing those. I usually don't bother blocking them, but if my content wasn't easy to cache, I probably would.

Onavo · 2025-05-10T17:00:20 1746896420

In the past month there were dozens of posts about using proof of work and other methods to defeat crawlers. I don't think most websites tolerate heavy crawling in the era of Vercel/AWS's serverless "per request" and bandwidth billing.

stackskipton · 2025-05-10T16:48:47 1746895727

Not everyone wants to deal with caching proxy because they think the load on their site under normal operations is fine if it's rendered server side.

immibis · 2025-05-10T17:03:50 1746896630

You don't get to tell site owners what to do. The actual facts on the ground are that they're trying to block your bot. It would be nice if they didn't block your bot, but the other, completely unnatural and advertising-driven, monopoly of hosting providers with insane per-request costs makes that impossible until they switch away.

AlexandrB · 2025-05-10T17:09:23 1746896963

They try to block your bot because Google is a monopoly and there's little to no cost for blocking everything except Google.

This isn't a "natural" monopoly, it's more like Internet Explorer 6.0 and everyone designing their sites to use ActiveX and IE-specific quirks.

luckylion · 2025-05-10T18:42:12 1746902532

One possible answer: pay them for their trouble until you provide value to them, e.g. by paying some fraction of a cent for each (document) request.

BobaFloutist · 2025-05-10T22:49:01 1746917341

Cool, you wanna solve micropayments now or wait until we've got cold fusion rolling first...?

luckylion · 2025-05-11T07:41:21 1746949281

You wouldn't have to make them micropayments, you can pay out once some threshold is reached.

Of course, it would incentivize the sites to make you want to crawl them more, but that might be a good thing. There would be pressure on you to focus on quality over quantity, which would probably be a good thing for your product.

BobaFloutist · 2025-05-11T19:13:12 1746990792

>You wouldn't have to make them micropayments, you can pay out once some threshold is reached.

Believe it or not, this is a potential solution for micropayments that has been explored.

immibis · 2025-05-11T20:52:52 1746996772

I could even pay a fixed amount to my ISP every month for a fixed amount of data transfer.

threeseed · 2025-05-10T19:30:56 1746905456

> The actual facts on the ground are that they're trying to block your bot

Based on what evidence.

immibis · 2025-05-12T12:37:55 1747053475

based on them matching the user-agent and sending you a block page? I don't know what else to tell you. It's in plain sight.