Home | Photography | Flickr | LiveJournal | Get Firefox

screw you and your database

I've spent the last day or so writing, testing, and benchmarking a few different methods for data storage. The horrible part is, I'm not any closer to having any real solution. Well, maybe a little.

Through the process of elimination I'm left with two different database schemas and can't really decide which would be better suited for my purposes since I haven't any real data to use yet. I generate a bunch of random data but, it's just that… random. It doesn't really reflect, at all, how the data might be used in the real world.

Let me explain the two methods. I'm sure you've seen them before.

I'm trying to make a universal storage mechanism. Yes, I know that means I'm getting most of the shitty features and very few of the good features that an RDBMS provides me. Oh well.

The first method is the Modifed Preorder Tree Traversal method. The second method is it's opposite, the Adjacency list method. Both are methods for tracking child -> parent data relationships.

The MPTT method, is fairly complex. The basic version of it is, by adding two additional id numbers to each item in the list, it is possible to predetermine the order (or… preorder) the list items. Every time a new item is added to list it's position is determined and all of the items to the "right" of it are moved over to make a space for it. It's mostly equivilent to giving each item two pieces of paper and lining the papers up in a straight line. Whenever you want to add an item inside of another, the older that item is (the more left on the line it sits) the more papers have to moved over and renumbered to make a space for it. However, the advantage is that, once everyone is lined up, figuring out who belongs to who is easy as pie — if number "2″ is to the left AND to the right of you, then you are a member of it.

The AL method is a bit easier to understand. Every child is told who its parent is. Then, in order to traverse the list, you start at the top and ask, "who belongs to person A". When those are determined you, again, ask for their children to come forward and you continue until everyone's been asked. This means that no reordering is done when a new item is added. You merely tell that item who its parent is and move on. However, when you want to see who belongs to who, you have to ask a lot of questions because noone knows who their grandfather is… they only know their own parent.

In my test case, with 20,000 random rows in the database, I'm able to obtain the entire tree (which happens very rarely) in 2.1568 seconds using the MPTT method. However, it takes a whopping 22.1202 seconds to gather the same tree using the AL method. Almost seems like a no-brainer, right? Not exactly. You have to remember that very rarely (never?) will someone request to know ALL of the children of the ROOT node. That's just too much data to process. If, in cases of small amounts of data, it wasn't too much to process, it also wouldn't take nearly as long. You see, the length of time required to read using the AL method is directory proportional to the amount of recursion in that list. If there were only ONE parent and everyone else was its child, both methods would be equally as fast.

Writing, however, is another story entirely. With the same test data present, a write on the LEFT side of the tree (the more time consuming side) takes 3.6466 seconds using the MPTT method. Using the AL method, however, only 0.0179 seconds are required. However, one might still be inclined to say that, since reads are generally MUCH more frequent than writes, the MPTT method is worth the wait. However, there are more things to consider.

When the MPTT method is writing, the tree is unreadable. That means that for, in this case, 3.6466 seconds, no one can read anything. Additionally, the AL method's read time was a truly worst case situation for it. In MOST cases, the entire tree will not be read. A more average example of a query shows the MPTT method taking 0.0094 seconds, and the AL method requiring 0.0680 seconds for the same read operation. These are much more livable numbers.

The real kicker comes when we analyze an actual USE of the data. For instance, because this is a universal data storage method, it stands to reason that it should be able to hold its own permission data (read, write, etc)… and it can. However, this introduces layers of complexity in the MPTT method that are difficult to overcome. It also means that, in some cases, the MPTT method gives more data than it should requiring longer query times and post-query processing. It's possible that my SQL skills are not up to par and that there is a better way to do this, however. In many cases, once permissions are factored in, the same read is faster using the AL method than it is with the MPTT method simply because of the additional, unneeded, data being returned by the latter.

20,000 rows may seem like a lot of data for the possible uses of a system like this but, in reality, it isn't. Consider using it for a weblog. Every post is an item. Every post item has several permissions, which are each items. Every comment is an item. Comments also have permissions, which are more items. By design (which could possibly be improved), metadata for items are stored as additional items belonging to the items they describe. This means, even more items in the system. Consider a simple blog like mine. I have about 3000 entries. I also have about 3000 comments. That's 6000 items. Plus their related permissions…. adds 6000 "everbody can read" permissions, plus 3000 "everybody can comment" permissions, plus 3000 "jim can edit" permissions. That's 18,000 items, already. They add up quickly, you see. Waiting 22 seconds to get a list of EVERYTHING I have to offer is not too long considering I have 6000 items to offer. However, waiting 3.6466 seconds to post a comment the first post I ever made is starting to look bad.

MPTT has the advantage that, newer items are updated quicker than older items. And, since newer items are most likely the ones being updated or commented on, this makes a lot of sense. However, AL has lighting fast write times and read times that, when coupled with permissions and reality checks, are fast enough that time doesn't really matter.

I've considered placing the permissions in another table. This might increase the write time of the MPTT method as MANY rows would be removed from the system making its writes faster. However, since permissions items don't have children, they aren't very hard to move. Additionally, since items are always added at the END of a branch (right side) instead of at the beginning (left side), the case where a lot of things would need to be moved gets less and less likely. Additionally, since permission items aren't placed on permission items themselves, it really wont save much in terms of the AL method's read time since it'll never return or check an item that it can't find permission to read.

With the MPTT, the worst case would be a root node with two children in which the second (more RIGHT) child has MANY children of its own and an item is added to the first (more LEFT) child. Then all of the children of the second item have to move over. That's time consuming. In this, very limited, case, adding a child to the root node would be the best case as it would only have to alter the root node itself. Second to that would be adding an item to the second child, as it would only have to move the end points on that second child and the root node.

With AL, the time required to perform a query grows slightly with the number of rows being returned, and grows GREATLY with the amount of recursion invovled in the queries. So the only real determining factor for speed is, how many levels deep is the tree?

It should also be noted that the MPTT method lends itself to doing lots of other less common tasks VERY easily. Fining the ancestoral chain for an item is easy as pie, for instance. Doing so in the AL method requires more recursive queries.

So… what do I do? I have no clue.

Comments and criticism are greatly appreciated. If you need more information on how the particular models are being used in order to provide optimization ideas, or alternate methods of operation, please let me know.

Share and Enjoy:
  • Facebook
  • StumbleUpon
  • Digg
  • e-mail
  • del.icio.us
  • Google
  • Reddit
  • Technorati
  • BlinkList
  • blogmarks
  • Blue Dot
  • description
  • Furl
  • Ma.gnolia
  • MisterWong
  • Netvouz
  • PlugIM
  • Propeller
  • Simpy
  • Spurl
  • TailRank