Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few?

We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.

EDIT: Just checked our anubis stats for the last 24h

CHALLENGE: 829,586

DENY: 621,462

ALLOW: 96,810

This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).

Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.



This seems like two different issues.

One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.

The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.


> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it

That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.

That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.


> That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis.

Well sure you can. If it's requesting something which is allowed in robots.txt, it's a legitimate request. It's only if it's requesting something that isn't that you have to start trying to decide whether to filter it or not.

What does it matter if they use multiple IP addresses to request only things you would have allowed them to request from a single one?


> If it's requesting something which is allowed in robots.txt, it's a legitimate request.

An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.

Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).

So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.


Why would it be only 0.001% of requests? You can fill your actual pages with links to pages disallowed in robots.txt which are hidden from a human user but visible to a bot scraping the site. Adversarial bots ignoring robots.txt would be following those links everywhere. It could just as easily be 50% of requests and each time it happens, they lose that IP address.


I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.

The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: