revjim.net Rotating Header Image

damned if I do, and damned if I don't

In my quest to develop a good, web-based, RSS aggregator, I've learned one thing: RSS is fucked. Not RSS, itself, but RSS as it is used. There are just too many variations of everything to accommodate anything. For instance, one complaint I've always had about the available RSS readers, is that feeds were always jumbled all together with, seemingly, no regard for any kind of date order. In my aggregator, I set out to correct this.

First, I try to look for a date on the item. Maybe a dc:date in Dublin Core format, maybe a rss:pubDate in standard format. I try to accommodate multiple timezones and formats. If, for some reason, a feed author doesn't include item-level dates, I fall back to channel level dates. If an item is new to a channel, and the channel has a rss:pubdate a dc:date or an rss:lastbuilddate, then I can assume that that date can be safely applied to the new item. This isn't always the case, but it's a fairly safe assumption. Then, in case that isn't provided, I also record the date/time when the item is parsed. When a feed is first parsed, all the items will have the same date (which is annoying). However, as new items are added, the date/time becomes closer and close to accurate (+/- the amount of time between updates). This seemed to be working okay, except, when you look into it deeper, it's still really screwy.

If the system clock of the publisher is off a little (or if they should choose to date an entry 12 years into the future) things get skewed. Some users incorrectly(?) put a Dublin Core style date in a rss:pubdate field. It's just annoying, more than anything.

I guess I jumped the gun a bit. RSS isn't really fucked, and neither is the use of it. I just wish there were some consistency to make life a bit easier. I'm now realizing that, all of this time and effort I've spent in attempting to deduce a date for each item was futile, because, in some cases, it just isn't possible.

Aside from that, the RSS aggregator is coming along nicely. I actually use it to do most of my daily blog reading. The HTTP GET portion of it, doesn't play real nice just yet. It gets a new copy of the feed every time, without bothering to see if the feed has changed any. This is wasteful of bandwidth. The HTTP GET portion of it also doesn't support basic authentication, though I am uncertain there is actually a need for this (unless, of course, LiveJournal were to modify it's code to allow users to supply Basic HTTP authentication without requesting it. Then, by hardcoding an LJ username and password into the RSS reader, one could even get updates from those who post "friends only".

One idea I am tossing around adding is adaptive update frequencies. Generally, if an item is added to a feed, more items will be added soon. This isn't always the case, but it is likely. Additionally, those who update frequently, tend to continue to update frequently. Again, not always the case, but very likely. With this knowledge, one could adjust the update interval for each feed based on whether there were new items the last time we checked. For instance, if the interval is currently at 60 minutes, and an update is found, cut it in half, therefore making it 30 minutes. Every time the interval occurs and an update is NOT found, increase the interval by, 50% or so. Of course, the "reward" and "penalty" amounts would need to be fine tuned to find what most closely reflects the real world. The reason I suggest this is, as most users would, I'm sure, I like to have the most recent information possible, knowing INSTANTLY when someone updates, without having to check for RSS updates every 10 seconds. I guess I've been spoiled by the INSTANT nature of LiveJournal.

Comments and thoughts are always appreciated. And stay tuned for a software release, hopefully today or tomorrow.

Google Buzz
blog comments powered by Disqus