Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A small part. On my server AI bots outnumber real visitors 300 to one.


I don't mean that users are following the links to `acme.com` and `demo.com` type domains in documentation; I mean that bots are likely finding and following many links to them because of their widespread use in documentation.

If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.

This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.


That is very possible.

But it is not necessary to see the results that are being described.

If sites like my tiny little browser game, with roughly 120 weekly unique users, are getting absolutely hammered by the scraper-bots (it was, last year, until I put the Wiki behind a login wall; now I still get a significant amount of bot traffic, it's just no longer enough to actually crash the game), then sites that people actually know and consider important like acme.com are very likely to be getting massive deluges of traffic purely from first-order hits.


The article describes that a lot of the requests are for non-existent URLs. Do you observe the same?


Yes; I get a lot of requests for a mostly a small set of paths on my site that look like they're attempts at finding exploitable surfaces. Things like /auth/bind-session, /auth/check?jwt=, etc. (And those are just the ones that are coming up in the obvious error reports; when I go looking at the logs there are more.)


That such an absolutely ludicrous thing to hear in a "wtf are these people doing" type of way. I can't imagine a non-social media site would be generating enough traffic to the level that these bots need to be essentially doing continuous scraping. It's just gross to me to be okay with that level of unsophisticated effort that they just do the same thing over and over with zero gain.


Next to the massive amounts of energy they are burning in their own datacenters, they are burning up other datacenters as well. Plus all the extra energy used by every router, hub and switch in between.


How are you measuring this? Does your solution rely on user agent or device fingerprinting? Curious to know what tools are available today and how accurate they are.


I'm popular in Europe, there's no reason people from Singapore, Russia, Brazil and literally every other country in the world to all start visiting very old articles and permalinks for comments en masse.

Having honeypot links is the only thing that helps, but I'm running into massive IP tables, slowing things down.

This is not what I want to do with my time. I can't afford the expensive specialised tools. I'm just a solo entrepreneur on a shoestring budget. I just want to improve the website for my 3k real users and 10k real daily guests, not for bots.


Where from? And quite frankly why? There are existing training data sets that are large enough for smaller models. Larger models have been focusing on data quality more than quantity. There's limited utility to further indiscriminate widespread scraping,


Tell that to the idiots doing the scraping.

Small site operators like us know very well that the utility they can get by scraping us is marginal at best. Based on their patterns of behavior, though, my best guess is that they've simply configured their bots to scrape absolutely everything, all the time, forever, as aggressively as possible, and treat any attempt to indicate "hey, this data isn't useful to you" as an adversarial signal that the site operator is trying to hide things from them that are their God-given right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: