October 17th, 2002:

feedParser v0.3

feedParser v0.3 has been released.

entities will be my death

PHP’s xml_parse_into_struct() function is severely broke, but I can’t tell whose fault it is. It uses the expat libaray for parsing, which has become a sort of standard amongst UNIX based SAX parsers. The problem could be there. Or it could be in PHP’s implementation of expat. Or it might not really be broken at all, just merely unusable. The problem has to deal with Entity parsing.

A little XML background, first. For our purposes, we’ll say an entity can be declared in two places: in the DTD referenced by that ugly “DOCTYPE” line you see at the top of many XML and HTML documents, or in the document itself. Here are the symptoms of this problem:

If a DOCTYPE declaration is given, entities will be substituted with their replacement values if that entity is declared in the document. If they are not declared, they are simply removed from the document. If an xml_default_handler() is set both defined and undefined (locally) entities will be unsubstituted and passed to the handler. This means that handler can do its own entity parsing. If a DOCTYPE declaration is not given (which is the case with almost EVERY RSS feed out there), all non-default entites result in a parser-error. This means that, if a DOCTYPE is present, one can either parse ALL entities manually, or have them thrown out. It is possible to use the parser to detect DOCTYPE declarations and fetch the DTDs in order to obtain the additional entities, however, this means that all entities will have to be handled manually as the DOCTYPE declaration triggers the xml_default_handler(), and, when it is set, all entities are not parsed. Expat has provisions to allow an xml_default_handler() to be set, and still have entities parsed, but PHP does not implement this (xml_default_handler_expand()). If a DOCTYPE is not present, one can only pray that entities are not present. That is unusable, in my book, as MANY, MANY, documents do not have a DOCTYPE declaration.

Sure, a DOCTYPE is required in order for an XML document to be valid, but this isn’t about valid XML. This is about being able to parse what users are actively creating on a daily (if not hourly) basis. All HTML documents are supposed to have a DOCTYPE in order to be valid, but if your browser REFUSED to parse them when an entity showed up, I’d guess well over 80% of the web would disappear. Even Mark Pilgrim‘s and Dave Winer‘s feeds don’t have a DOCTYPE declaration. That doesn’t mean their feeds aren’t well-formed, it just means they aren’t valid. The fact that PHP’s implementation of expat (and perhaps expat itself) dies with a parser error on an undefined Entity is absurd. Expat is supposed to be non-validating.

I think I’ve devised a less than perfect solution. I’ll define a set of entities within feedParser and I’ll parse out the entities before parsing the document for data. This way I can ensure that no entities in the XML data will break the parser, and I can ensure that all entities are parsed.

Wish me luck.


My country loves me so much that they protect me from hackers and theives by forbidding me from obtaining important security information that may help me defend my electronic resources, and the electronic resources of my corporation from hackers, thieves, and other mischief makers. In order to assure my safety, my country will even arrest innocent people performing actions considered perfectly legal in their own countries to ensure that this security information never falls on my ears, or the ears of any other member of my country. As long as my country is protecting me, I don’t care if the rest of the world takes us off the map.

I’m sure that, since my country is so concerned about me, one of its citizens, my country is also doing something to protect me from being harmed by the malicious acts of citizens of other nations who might have access to this, clearly, useless, for anything other than illegal activities, information. I’m also sure that, in the event that my electronic resources, or those electronic resources of my corporation are penetrated and/or stolen due to my country forbidding me from obtaining the free and public information I need to protect myself, they will be more than happy to accept all financial, criminal, civil, and moral responsibility in a court of law.

America stands for freedom. Freedom that is handed out by my country as it sees fit and taken away when my country decides it should be.

Isn’t America great?

[via kasia]