PHP's xml_parse_into_struct() function is severely broke, but I can't tell whose fault it is. It uses the expat libaray for parsing, which has become a sort of standard amongst UNIX based SAX parsers. The problem could be there. Or it could be in PHP's implementation of expat. Or it might not really be broken at all, just merely unusable. The problem has to deal with Entity parsing.
A little XML background, first. For our purposes, we'll say an entity can be declared in two places: in the DTD referenced by that ugly "DOCTYPE" line you see at the top of many XML and HTML documents, or in the document itself. Here are the symptoms of this problem:
If a DOCTYPE declaration is given, entities will be substituted with their replacement values if that entity is declared in the document. If they are not declared, they are simply removed from the document. If an xml_default_handler() is set both defined and undefined (locally) entities will be unsubstituted and passed to the handler. This means that handler can do its own entity parsing. If a DOCTYPE declaration is not given (which is the case with almost EVERY RSS feed out there), all non-default entites result in a parser-error. This means that, if a DOCTYPE is present, one can either parse ALL entities manually, or have them thrown out. It is possible to use the parser to detect DOCTYPE declarations and fetch the DTDs in order to obtain the additional entities, however, this means that all entities will have to be handled manually as the DOCTYPE declaration triggers the xml_default_handler(), and, when it is set, all entities are not parsed. Expat has provisions to allow an xml_default_handler() to be set, and still have entities parsed, but PHP does not implement this (xml_default_handler_expand()). If a DOCTYPE is not present, one can only pray that entities are not present. That is unusable, in my book, as MANY, MANY, documents do not have a DOCTYPE declaration.
Sure, a DOCTYPE is required in order for an XML document to be valid, but this isn't about valid XML. This is about being able to parse what users are actively creating on a daily (if not hourly) basis. All HTML documents are supposed to have a DOCTYPE in order to be valid, but if your browser REFUSED to parse them when an entity showed up, I'd guess well over 80% of the web would disappear. Even Mark Pilgrim's and Dave Winer's feeds don't have a DOCTYPE declaration. That doesn't mean their feeds aren't well-formed, it just means they aren't valid. The fact that PHP's implementation of expat (and perhaps expat itself) dies with a parser error on an undefined Entity is absurd. Expat is supposed to be non-validating.
I think I've devised a less than perfect solution. I'll define a set of entities within feedParser and I'll parse out the entities before parsing the document for data. This way I can ensure that no entities in the XML data will break the parser, and I can ensure that all entities are parsed.
Wish me luck.











