revjim.net Rotating Header Image

HTML entities in RSS titles and descriptions

I attempted to add my RSS 1.0 feed to Syndic8 and it was rejected because my feed contains a "rogue & in titles". The problem they are discussing comes from my post about the French language because of the "&" used to make the international entity in the title (ç makes a ç).

First of all, I disagree with the rejection of my feed. What if "français: partie un" was my desired title? My feed validates as proper RSS/XML, why is this reviewer judging my feed based on content (aside from my feed simply being full of test posts)? From what I can tell, it is allowable, and in fact the default, to encode RSS items as "text/html". The data should be unencoded (therefore, changing "français" back into "français") and sent to the client that way. The client should then determine, using the specifications of RSS, how to display the content; either as text/plain, as text/html, or in some other encoding all together.

Am I incorrect? Should all RSS data be designed to decode into "text/plain"? Is HTML not allowed in an article title or description? Is it common practice to use HTML inside those RSS tags? Is this strictly forbidden in the RSS specification in some place I haven't found (which is possible since the specification is a lengthy read, and I haven't gone over all of it)?

If I am incorrect, and RSS data should not be encoded as "text/html", then what is the proper way to include an international character in the title of a post, or in the description of a post? If I am typing in plain text, how would I create any of the international characters? I could enter them XML encoded, which I do anyway, and then tell my blogging engine NOT to encode the data as XML. Then, I would be storing XML in my database, and decoding and reencoding for HTML (easy as pie since the two are basically synonymous) would be done during page generation.

The funniest part is that Syndic8.com doesn't even remain consistent here. On the notes page for my feed, you'll see that the note is encoded into HTML before display, signifying that the note itself should be entered in "text/plain". However, on the action log page you'll see that that same data has not been encoded into HTML, but instead sent straight to the browser, indicating that notes should be entered in "text/html". Well, which is it? It's hard to tell exactly how the note author actually typed the note, and how Syndic8 is processing it. Either way, Syndic8 is not doing the right thing, and it's possible that the note author wasn't doing the right thing either.

With closer inspection it appears that Syndic8 is really screwed up. On the "action log" page, looking at the HTML source shows the note as "Rogue & in titles". However, on the "notes" page it shows "Rogue & in titles". If we assume that Syndic8 stores the note exactly as it was typed by the author, tthen no encoding takes place on the "action log" page, and the data is encoded twice on the "notes page". Either that or the note is decoded on the "action log" page and encoded on the "notes" page.

Update: I went ahead and entered my own "note" to test Syndic8's handling of ampersands in note fields. I typed it as "& &". On the "notes" page, Syndic8 shows that it has indeed encoded what I typed twice. Perhaps it is encoding before it stores it in the database, and then again before it displays it. I attempted to check the "action log" page, but it appears there is a bug in the Syndic8 code, because it is showing the same note twice there. However, based on how the "notes" page is displaying my data, and how the other data was displayed, I can make this fairly safe assumption. Syndic8 encodes the data before it stores it. On the "action log" page, it decodes the data (why, I'm not sure) and then offers it to the browser. On the "notes" page, it encodes the data (again) and then offers it to the browser. Therefore, if my assumptions are correct, my note would display as "& &" on the "action log" page, but only because browsers are not strict. In the HTML source my note would display as "& &", which is not valid as the page would then contain a "rogue &", the same thing they are accusing me of when, in actuality, mine does not contain a "rogue &", merely an "&" that the reviewer did not beleive should be there. Their output would not validate. Mine does.

I think I've spent way too much time on this.

Google Buzz
blog comments powered by Disqus