Attack of the RSS Aggregators!

My server is under attack by RSS aggregators! They eat bandwidth and resources at four times the rate of regular viewing mortals like you and me. I love RSS, don't get me wrong. But the current crop of brute-force aggregators is really driving me crazy. (Amphetadesk and Netnewswire seem to be the worst offenders, but they may simply be the most popular.) Some stop by as often as every ten minutes without so much as identifying themselves. It's just rude to make so many requests. Aggregator authors could create polite software very simply: use conditional HTTP gets. The aggregator sends the last time they've seen the feed along with each request. And my server politely says, "no, it's the same one you have. 304." Or "yes, it has changed since your version, here you go. 200." It's much more civilized than, "gimmie! gimmie! gimmie! 200! 200! 200!" The other alternative is to set up a centralized ping-server where RSS authors can let every aggregator on earth know that their feed has changed recently. (like weblogs.com.) It's not as elegant or scalable as conditional HTTP gets, but it would be better than our current state of RSS anarchy. As it is, I'm going to have to write some sort of filter to slow them down.

Comments

I've recently rolled my own aggregator, starting with Aaron Swartz's rss2email script and expanding it to use ZODB. The brains of the outfit come from Mark Pilgrim's ultra-liberal RSS parser, rssparser.py. That puppy supports Etag and Last-Modified headers and makes the load on the server where the aggregator runs as well as being nice to the remote HTTP servers. I also don't see any reason to poll more than once an hour... I'd say you've got every right to block hosts that aren't being nice.
I was surprised to find someone grabbing my rss file once every five minutes (he doesn't know I never update).

I considered writing a little script to handle him and others, but then I felt bad about blocking him and I didn't set it up.

Maybe when someone subscribes to an rss feed they also have to jump through a hoop or two on your server so you can track them and manage them. I'm thinking you give them a subscription key that their aggregator uses when asking for updates.

I suppose you could even charge access to the subscription key. (I'm sure this exists, I'm just so behind in all of this rss/xml talk)
There's been a lot of talk about demands of RSS aggregators. A good summary on this recent thread at K5:
http://www.kuro5hin.org/comments/2002/11/10/122820/97/27#27
I'm behind on all of the xml/rss talk too. I just noticed that the requests have been unusually heavy lately. Instead of filtering out frequent requests, I'll first make sure the Etag and Last-Modified headers are being used properly on my end, and see if that makes a difference. It sounds like tying a key to an IP address would work well--and make people responsible for their access. Though I'm not sure many people would take the time to go through the registration process.
Good day. I'm the creator of AmphetaDesk. How have you identified AmphetaDesk as an offender, per chance? Your report is a direct opposite of Joel Spolsky's report:

http://www.intertwingly.net/blog/925.html
http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=17465

In any case: AmphetaDesk always identifies itself with a User-Agent, and can not be set lower than one hour on a check - in fact, it's set to three hours by default. Could you give me more information on what you're seeing? AmphetaDesk tries like mad (often to the chagrin of it's users) to NOT waste bandwidth - I'd be really surprised if indeed AmphetaDesk is a willing culprit here.
AmphetaDesk was not one of the same-IP, every ten-minutes offenders. And one copy of AmphetaDesk checking my RSS feed every hour is no problem. But 30+ people with copies of AmphetaDesk, each set to check my server every 1-3 hours is what's causing the problem. So the hits roll in every few minutes--granted, from different IP addresses. It's different from a standard Web browser visit because of the frequency of repetition. A person wouldn't return every single hour without fail to see if something has changed. As your software and other brute-force aggregators becomes more popular, the problem will increase. A single copy of AmphetaDesk doesn't hurt my bandwidth. It's all of the copies working independently that will hurt content providers. The bandwidth bill will be a part of the decision to make or not-make RSS feeds available.
Alright, in that regard, it is an issue of popularity then, and one that will happen with any aggregator once it gets a decent amount of users. Until someone decides to create a non-app-specific gateway (to receive information on when RSS files like yours are updated, like weblogs.com), and then all the blog makers include support for that gateway (to adequately report that your RSS file has changed), and then all RSS readers support that gateway (to get information on when your RSS feed was updated), it's not going to magically resolve itself.
I agree there's no easy solution. But I think the weblogs.com centralized notification system is only one method that could work. Andre's suggestion of controlling it at the provider point would work. And subscription systems built to form individual connections between publisher and subscriber could work. Or the P2P distribution method I discussed in another post could help the situation. Finding a solution to this problem should be a matter of survival for RSS aggregators, though. You wouldn't want the most popular feeds to be pulled in the future because of high bandwidth bills.
But then whoever runs that gateway is the one that's going to be hammered by all the aggragators - and this will be multiplied mayhem because of the number of feeds they will be monitoring. So whoever builds the central ping monitoring site better have bandwidth up the wazoo or they'll crumble as well.
Exactly. That's why I think people should share RSS feeds like they share MP3s.
×

Search Results

No emoji found