revjim.net

April, 2003:

RSS updates: there has to be a better way

RSS Feeds are great and all, but there has to be a better way.

Using RSS as the method of encapsulating information is fine. However, polling them every hour, half hour, or thirty seconds is a bit ridiculous. There’s got to be a way to let a particular server know that you’re interested in being notified when changes to the feed have been made. Frontier has the right idea with its Publish and Subscribe methods. This coupled with the regular updates that are pushed to weblogs.com and the system works pretty well. Unfortunately, weblogs.com doesn’t active this service for the general public to use (at least not that I can tell).

With a system like this in place, each RSS reader will be notified when feeds they monitor change. Additionally, they will receive updates almost instantly, as opposed to having to wait almost an entire refresh period before seeing the new items. The servers which serve the RSS feeds will receive less hits and therefore will utilize less bandwidth. It’s a win-win situation, it seems.

A system like this can be established in two ways, and the two methods are interoperable with one another. In order to support weblogging engines that do not support this feature, a standalone server can be utilized. This server can be dedicated to serving one blog, or many blogs. Or, if the blogging engine itself supports this behavior directly, each server can handle its own update notifications. I will outline the flow of each type of server.

Standalone Server

In a standalone environment, the server must accept update notifications and send out update notifications. Additionally, it must accept subscription requests.

It should accept update notifications via the same interface used by weblogs.com and blo.gs. Most weblogging applications already support sending update notifications in this fashion, which makes the standalone server easy to integrate.

After receiving an update ping from the website, the standalone server would consult the list of subscribers and send an XML-RPC notification to each of them.

Additionally, by using information contained within the RSS feed itself, a RSS reader should be able to determine the XML-RPC server and method to call in order to subscribe to update notifications for a particular feed. Therefore, an RSS reader would, initially, download the RSS feed. Upon detection of this additional information, it should send a subscription request to the server specified in the RSS feed and stop polling the RSS feed on regular intervals. As a safety-net, an RSS reader might choose to still poll an RSS feed once every 2 or 3 days, just to make certain that the server handling update notifications didn’t fail for some reason. Aside from that, the RSS reader would only request the RSS feed after it received notification that a particular feed has been updated.

Integrated Server

When a blogging engine supports this behavior natively, the need to send an initial update notification is removed.

When the blogging engine receives a new article from its author, it should send out an update notification to its subscribers in the same fashion as the standalone server.

An integrated server should also accept subscription requests like the standalone server does.

A Problem

In order to receive the update notification, the RSS reader must be capable of listening on an external port, which may not be possible for those RSS Readers that run behind firewalls. In this case, it might be advantageous to implement a a stored notification method.

The stored notification method can be implemented in several ways. While it wont help to reduce unneeded network traffic much if integrated servers also supported a stored notification method, implementing stored notifications as a separate standalone server would be advantageous. An RSS reader would merely inform its notifiers of an alternate location to send updates. When the server at that location receives and update it would store them until requested by the RSS Reader. Periodically (and possibly more frequently) the RSS Reader will use another method to check for stored notifications.

Support

The problem with a concept like this is, it only really works if mass amounts of people start to support it. Fortunately, most blogs update either blo.gs or weblogs.com. This means that, if a standalone server meant for mass usage were to start by polling weblogs.com and blo.gs for updated weblogs, the system could be used almost immediately regardless of whether the author of the RSS feed you intend to read supports this use or not. However, without an RSS reader supporting this operation, it doesn’t do anyone any good at all.

Comments? Questions? Suggestions?

Inklog: more thoughts

My initial thoughts for InkLog could be summarized in a few simple statements. Everything is a node. All nodes are displayed through templates.

The underlying theory is also simple. What makes a “blog” a “blog” isn’t really anything other than a date sorted collection of data. So, the theory behind InkLog was to make everything within a website detectable and sortable by date.

There are already thousands of applications to upload text files and give them a pretty look and feel. And there are thousands of applications that allow one to upload images and put captions on them. But I wanted InkLog to be able to track ALL changes to a website (or even multiple websites) and then, display those changes to the user. Of course, in addition to this, the existing features of these systems need to be duplicated.

The easiest, fastest, cleanest way that I could see to do this, was to store EVERYTHING in a site, in the same fashion. With this in place, one simple database query would be enough to gather any information anyone could ever need. All of the articles, forms, journal entries, weblog entries, images, polls, quizzes and static pages in an entire site would be regarded the same. These items would be stored with a few pieces of information attached to them: date, category/path, file type, metadata. Once it was decided which pieces of information should be displayed, the system would nearly need to call an output function based on the type of each item in order to display it properly.

However, the more I think about it, the more difficult it becomes. Let’s say, for instance, that I want to add an RSS Reader to mix. Okay, that’s simple enough. We’ll create a new object type called “rss feed” and the data of which will be equal to serialized PHP data representing the items in the feed. If a new item is detected, the date on that item will be updated to the current time, thus making that feed appear to be updated. At regular intervals, via cron, the system can look for all items of type “rss feed” retrieve newer versions of that feed and update the item accordingly. Like I said… that’s pretty simple. Or is it?

Let’s say I have 10 feeds added into a category/path named “friendslist”. Examining the systems nodes is fairly simple. Determining which node was recently updated is fairly simple. Creating a view of that node that will display all of the items in that feed is simple. However, in order for the application to function as I believe it should, each item INSIDE the feed should show as an update of its own. I don’t want to know which feed was updated last, I want to know the details of the actual update that was made in the order in which it was made. Now, this can be done by having the cron job actually add (and delete) nodes from the system as it sees new items. It can even use feed information to help categorize the items into more usable subsets. It’s certainly doable, it’s just a lot more complicated.

It still seems like the right way to go. It seems like this method would make the most sense. In the long run, I’m emulating what the webserver does (statically) already. Only, in this case, everything is templatable, file types can be converted from one to the another on the fly, the data is in a database and is automatically searchable and sortable.

I really wish I had another head around here that I could bounce ideas off of.

Brad in the Hood

Brad and href="http://www.livejournal.com/users/tydel/">Tydel href="http://www.livejournal.com/users/brad/1890025.html">reminded
me of Brad in
the Hood
, a parody I wrote what seems like ages ago.

I miss those days.

Jess is amazing

Allow me a few minutes of your time to express to you just how
wonderful my wife is.

Regardless of how grumpy, irrational, stubborn, or
just plain mean I am being, she supports me publically. And then, with
compassion and sincerity, she counsels me and comforts me in
private.

And that’s the way it should be.

I love you, Jess.

Inklog: getting it in writing

I’ve done a lot of work on href="http://revjim.net/archives/inklog/">Inklog ever since I href="http://revjim.net/archives/2003/04/23/9489.php">decided to rewrite
it so that it doesn’t use the filesystem as the database. href="http://revjim.net/wiki/InkLog">These plans aren’t completely
laid out yet, but they’re stable enough to begin development. As
development proceeds, I’m certain portions of the structure and design
will change.

PHP converts periods to underscores

In case you didn’t know it, PHP converts periods (“.”) to underscores (“_”) in all variable names. This includes GET, POST and COOKIE variables. A period is not a valid character in a PHP variable name. If you consider the syntax you would have to use in order to access it, you can see why:

$foo.bar = 'hello';
if($foo.bar == 'hello') {
    $foo.baz = $foo.bar;
}

The period is a concatenate operator in PHP. Therefore, in these constructs, PHP is going to attempt to regard everything after the period as another variable, or as a constant, or as something that needs to be evaluated and then concatenated with the value of $foo. I’m sure, if it were really needed, PHP could implement a special construct to access these (${foo.bar} perhaps). But, if you didn’t realize you had to convert the period in the variable name to an underscore, you probably also wouldn’t realize that you had to use a special construct in order to access it, so, even if this construct did exist, it would defeat its own purpose, I think. [Thanks Keith]

as MovableType is to TypePad, TextPattern is to …

Dean Allen, author of href="http://www.textpattern.com/">TextPattern, href="http://www.textism.com/article/719/">announces his blog
hosting service, TextBox.

TypePad, LiveJournal, and my Failure

Everyone else is saying it, so I’m sure you’ve heard already. Six Apart, creators of Movable Type, have announced a new service called TypePad that will be released later this year.

The best way to describe the product they intend to offer is this: it will harness the power of Movable Type into a system as easy to use as LiveJournal and all the gizmos and gadgets that bloggers have been tacking on to their sites for years like server stats, referrer tracking, and blog rolls. No installation will be required, and no hosting company will need to be afforded. All of this is included in the package, which, if you couldn’t tell, won’t be free.

While I don’t fault LiveJournal for not getting to this point yet (because where they choose to take their product is their choice), this is how I envisioned it to be 3 years ago when I had root access to all of the servers and was playing a fairly active role in its development. When I began realizing that LiveJournal wasn’t headed in the direction that I envisioned, or, wasn’t headed there fast enough, I started to consider doing it myself. I began coding Inklog with the hopes of creating a product that I would be able to offer to the community and provide specialized, centralized versions of for a price.

In just 18 short months, MovableType (regardless of the bad things I say about it) has revolutionized the blogging world. The features it offers are unparalleled and almost every blogging application built after it has used it as a template for what works.

In a way, it makes me sad to see this happen. Don’t get me wrong, I’m happy for Ben and Mena. They’ve done an amazing thing for the blogging community and have increased the usability and the content value of the Internet as a whole with their product. What makes me sad is that, it could have just as easily been me. While their software is certainly unique in its operation, it isn’t anything that I couldn’t have created. The difference, is that they did and I didn’t.

I still think that, with a better template system, categories, and customizable single entry pages, LiveJournal would beat MovableType hands down in ease of use. With the addition of referrer tracking, image hosting, site statistics, and trackback, it would even beat what I envision TypePad will be. Especially considering their new URL structure, the recently added RSS 2.0 feeds, and the News Aggregator functionality it provides.

In a lot of ways I wish that Brad Fitzpatrick and I hadn’t had differing visions, because I know that, with a partner, I’m much more motivated to get things done. Perhaps that’s how Ben and Mena have done it so quickly.

In the end, if you want to be successful, the quality of the code doesn’t matter. It’s the functionality it provides, and how long it takes you to build it that does.

I think Andre Torrez (creator of FilePile) said it best:

But I’ll tell you what: ideas are fucking worthless. Anyone could do FilePile. I could write MetaFilter in a day. The only thing special about the code is that it was written. The only thing special about the sites are the users.

Even you can do it.

I’ll keep chugging along with Inklog. Maybe some day, it’ll be where it should have been a year ago. And maybe then, it’ll be worth something. As it is right now, it’s nothing but a collection of bits and bytes that doesn’t do anything for anyone.

speed concerns: database vs. filesystem

While writing Inklog, I’ve been debating with myself regarding my use of the filesystem as a datastore. While the filesystem certainly makes creating, updating, visualizing, backing up, and restoring data much easier than it would be in a database, it adds many hardships. First of all, the convenience of SQL is thrown out the window. While it is nice that using the filesystem doesn’t require a database server, not being able to use a database server means that more programming is involved. Additionally, things like searching through all the entries in the system become difficult, not to mention slow. Another downfall is that the filesystem limits the amount of metadata for each entry that can be kept in a simple fashion.

However, the biggest question on my mind was whether or not using a database server would be faster or slower when performing the most commonly requested actions: getting a list of recent items from the entire system, getting a list of items from a category, and getting one item. I decided to write a test case.

I created 6000 empty files in 20 directories. I also created a table in a mysql database that simulated the filesystem: name, mtime, dir, data. I added indexes on dir, name, and mtime. Then I started testing. In each case the test is run 10 times. Then the average is displayed. For the database tests, mysql_connect is called each time.

Getting the filenames of the 10 most recent entries from the entire system.

FILESYSTEM

TIME: 1.7814919948578
TIME: 1.7425200939178
TIME: 1.8071219921112
TIME: 1.6778069734573
TIME: 1.6711789369583
TIME: 1.7414019107819
TIME: 1.6959699392319
TIME: 1.6531630754471
TIME: 1.7546479701996
TIME: 1.6758890151978
TOT TIME: 17.201191902161
AVG TIME: 1.5637447183782

DATABASE

TIME: 0.0039100646972656
TIME: 0.001039981842041
TIME: 0.00095093250274658
TIME: 0.00096702575683594
TIME: 0.00095295906066895
TIME: 0.00098395347595215
TIME: 0.0009620189666748
TIME: 0.0009760856628418
TIME: 0.00094294548034668
TIME: 0.00095808506011963
TOT TIME: 0.012644052505493
AVG TIME: 0.0011494593186812

Getting the filenames of the 10 most recent files in a single directory.

FILESYSTEM

TIME: 0.055459976196289
TIME: 0.053847074508667
TIME: 0.044721961021423
TIME: 0.043873071670532
TIME: 0.043742060661316
TIME: 0.043787956237793
TIME: 0.043717980384827
TIME: 0.04374098777771
TIME: 0.043833017349243
TIME: 0.04370105266571
TOT TIME: 0.46042513847351
AVG TIME: 0.041856830770319

DATABASE

TIME: 0.0095839500427246
TIME: 0.0055500268936157
TIME: 0.005547046661377
TIME: 0.0055389404296875
TIME: 0.0056079626083374
TIME: 0.00553297996521
TIME: 0.005499005317688
TIME: 0.0055099725723267
TIME: 0.0053470134735107
TIME: 0.0053049325942993
TOT TIME: 0.059021830558777
AVG TIME: 0.0053656209598888

Getting one item.

FILESYSTEM

TIME: 0.00032293796539307
TIME: 0.00021898746490479
TIME: 0.00017297267913818
TIME: 0.00016999244689941
TIME: 0.00027298927307129
TIME: 0.00017201900482178
TIME: 0.00016689300537109
TIME: 0.00016403198242188
TIME: 0.0001760721206665
TIME: 0.00017201900482178
TOT TIME: 0.0020089149475098
AVG TIME: 0.0001826286315918

DATABASE

TIME: 0.0042519569396973
TIME: 0.0011199712753296
TIME: 0.0010420083999634
TIME: 0.0010360479354858
TIME: 0.0010439157485962
TIME: 0.0010349750518799
TIME: 0.001041054725647
TIME: 0.0010310411453247
TIME: 0.0010330677032471
TIME: 0.0064520835876465
TOT TIME: 0.019086122512817
AVG TIME: 0.0017351020466198

The database was 1360 times faster than the filesystem when looking for the 10 most recent items in the entire system. The database was 7.8 times faster when looking for the 10 most recent items in a single directory. However, the filesystem was 9.5 times faster at getting a single file.

These numbers skew greater and greater towards the database as the number of items increases. And, in the one place that the filesystem wins, the operation being performed is so un-time-consuming in general, that the increase in the speed of the filesystem doesn’t amount to much.

These tests were performed with the database being on the same server as the running script. Additionally, the server performing these actions was, basically, not performing anything else at the time. If your database server is only accessible over a 2400bps modem link, your results will differ greatly. Additionally, if your database server is heavily loaded, while your web server isn’t, you may also see very different results.

Benchmarks are crap, for the most part. They don’t really mean a whole lot, unless they represent the exact cases in which you will be using the functions being tested. However, in this case, they DO represent exactly what I will be doing.

What does this mean? Inklog will no longer use the file system as its main method of data storage.

Xaraya uses feedParser

Wow. I didn’t know this until now, but Xaraya uses my href="http://revjim.net/code/feedParser/">feedParser code to handle
its RSS functionality. Very cool.