revjim.net

September 25th, 2002:

that’s MR grump to you, asshole

I’m broke. That alone sucks. As of about 6 hours ago, I’m also out of cigarettes. That sucks worse. In fact, if there were some device that recorded everything I have ever said, ever, one might hear this interesting tidbit coming from my mouth less than a year ago: “If I were poor, I mean really poor, I think I’d rather go hungry than go without a cigarette”.

Wish me (and Jess… sorry, babe) luck. I’ll have money again on Friday.

MovableType and the encoding of special characters

In my frustration with RSS I’ve learned a few things.

MovableType, regardless of your instructions to “encode_xml”, will not detect international characters (entered with 8-bit ASCII) and convert them into HTML entities. This means that if you type an 8-bit ASCII character into your “subject” field, you will have an 8-bit ASCII character sent to the browser, an 8-bit ASCII character in your static files, and an 8-bit ASCII character in your RSS feed. This is not good. By making certain that proper character sets were being sent in all the right places at all the right times, it is possible that there is a method in which to send an occasional international character in a method that is decodable and readable by the majority of possible readers, however, I know very little in regard to how to do this and I don’t feel like learning.

RSS is XML. Period. Because of that, proper XML entity encoding can be used. However, proper XML encoding of the letter “e” with an acute accent over it is not “é”. This is the HTML encoding and is not accepted in XML. Instead, the decimal or hexadecimal values of these characters in ASCII should be encoded as an entity: “é” and “é”, respectively. Doing this, however, does not cause MovableType to handle the situation any more elegantly. The & at the beginning of the entity is still encoded by MovableType (because of the “encode_xml” directive) leaving the data still encoded after decoding.

There is no real way to know if HTML is allowed in “title” or “description” sections of RSS. RSS 0.92 (which is supposed to merely be the documentation and specification of RSS 0.91 as it was being used) states that HTML can exist in either of these fields provided it is entity-escaped. RSS 0.91 specifies that HTML cannot exist, and it isn’t mentioned in RSS 0.9 though we can assume no. RSS 1.0 might allow it, but only with the content module. However, in practice, things are much different. HTML is used in “description” sections. HTML is not used in “title” sections. So, to be safe, don’t include HTML in your “title” or “description” sections if you want to ensure that everyone, everywhere, can read everything you produce. If you don’t care about that, then go do whatever you want and stop reading now.

It seems that using UTF-8 (or some other widely accepted international character set) is the most proper way to handle international characters, even on an occasional basis. Further research is required.

MovableType does not currently handle the use of XML/HTML entities gracefully in any context. This means that, if the user chooses not to use UTF-8, or something similar, the only option is to turn off the “encode_xml” attribute on all fields and ENSURE that ALL data that is intended to be delivered via XML be properly XML encoded by hand. This is possibly only because HTML and XML have similar entity encoding methods. If they didn’t, it might very well be impossible.

It is my opinion that MovableType should either perform an XML decode (which includes international character entities) before performing an XML encode (not a good idea, because data can be lost) or that it learn how to properly encode 8-bit ASCII characters. But it really isn’t their fault.

The whole situation is a mess, because it is unclear which content-type should be used to enter data in the various fields of MovableType. For instance, in the entry field, one generally types (or should type) properly encoded HTML. If MovableType then sends the full content of a post in an RSS feed, it should either perform an HTML decode and then an XML encode, or just send it in a CDATA container and let the client know that it is text/html. MovableType is doing the right thing. The data is being encoded to be safe to transmit via XML.

Take this as an example. Let’s assume that I want the title of a post to read “& and its friend &”. This particular line of text looks very different in various encoding methods. In plain text, it appears as above: “& and its friend &”. In HTML and XML encoding, it would appear differently: “& and its friend &”. If I client isn’t decoding the XML data that is being sent to it, then it is just plain stupid. After decoding the XML representation of the post’s title, it should read as we intended: “& and its friend &”. Unfortunately, it isn’t obvious how this should be entered into MovableType. If we type the data as HTML, then MovableType can display it straight to the client browser with no modification. However, if we type it as plain text, it must first be HTML encoded. Additionally, if we type the data as plain text, it must first be XML encoded before being safe to travel via RSS. However, if we type the data as HTML, then it must first be decoded into plain text, and then reencoded into HTML (which, in this case, results to the same string, but it might not always).

Since MovableType doesn’t dictate that ANY field be of any certain content-type, the user is left to choose. Here is what you need to remember, as a user: whatever you do, do it with consistency. The same way, every time, all the time.

If you choose to enter HTML in your fields, then always encode your fields properly. MovableType can’t decode and reencode all characters properly, so we are lucky that HTML and XML encode in similar fashion. The rule is, all XML encoding (except the CDATA container) works in HTML. Not all HTML encoding works in XML. So, use XML encoding and don’t use CDATA containers. Make sure that your templates do not contain “encode_html” or “encode_xml” directives for those fields you choose to enter as HTML or you will end up with double encoded data, which is very bad. Additionally, you should make certain your templates contain the content-type information whenever possible.

If you choose to enter plain text into your fields, then use ONLY plain text. Don’t type any HTML. Don’t use any entity encoding. Let MovableType do it all. This means you must use UTF-8 or some other acceptable international character set, and alter your templates to reflect the use of that character set. Make certain that you are using the “encode_html” and “encode_xml” directives for fields that are being entered as plain text.

You can choose different fields to be entered in different fashions, as long as you always do it the same. You cannot, however, choose different fields to be in different character sets. This means that, if you choose to use plain text for any field, then you should always use a UTF-8 or other international character set for all fields. Either that, or never use ANY special entities in the fields entered as plain text (aside from those that can be typed in plain 7-bit ASCII).

My recommendation is as follows. All titles and category names should be entered as plain text. Special characters should not be used in them unless you intend to use a character set designed for those characters. Because MovableType automatically strips HTML tags from the content of your post if you don’t provide an excerpt, it is best to use plain text when providing an excerpt of your own. This means no HTML can be included, and new lines are useless. The actual body of the post should be entered in HTML as anything else would really defeat the purpose of all this.

My head hurts.

HTML entities in RSS titles and descriptions

I attempted to add my RSS 1.0 feed to Syndic8 and it was rejected because my feed contains a “rogue & in titles”. The problem they are discussing comes from my post about the French language because of the “&” used to make the international entity in the title (ç makes a ç).

First of all, I disagree with the rejection of my feed. What if “français: partie un” was my desired title? My feed validates as proper RSS/XML, why is this reviewer judging my feed based on content (aside from my feed simply being full of test posts)? From what I can tell, it is allowable, and in fact the default, to encode RSS items as “text/html”. The data should be unencoded (therefore, changing “français” back into “français”) and sent to the client that way. The client should then determine, using the specifications of RSS, how to display the content; either as text/plain, as text/html, or in some other encoding all together.

Am I incorrect? Should all RSS data be designed to decode into “text/plain”? Is HTML not allowed in an article title or description? Is it common practice to use HTML inside those RSS tags? Is this strictly forbidden in the RSS specification in some place I haven’t found (which is possible since the specification is a lengthy read, and I haven’t gone over all of it)?

If I am incorrect, and RSS data should not be encoded as “text/html”, then what is the proper way to include an international character in the title of a post, or in the description of a post? If I am typing in plain text, how would I create any of the international characters? I could enter them XML encoded, which I do anyway, and then tell my blogging engine NOT to encode the data as XML. Then, I would be storing XML in my database, and decoding and reencoding for HTML (easy as pie since the two are basically synonymous) would be done during page generation.

The funniest part is that Syndic8.com doesn’t even remain consistent here. On the notes page for my feed, you’ll see that the note is encoded into HTML before display, signifying that the note itself should be entered in “text/plain”. However, on the action log page you’ll see that that same data has not been encoded into HTML, but instead sent straight to the browser, indicating that notes should be entered in “text/html”. Well, which is it? It’s hard to tell exactly how the note author actually typed the note, and how Syndic8 is processing it. Either way, Syndic8 is not doing the right thing, and it’s possible that the note author wasn’t doing the right thing either.

With closer inspection it appears that Syndic8 is really screwed up. On the “action log” page, looking at the HTML source shows the note as “Rogue & in titles”. However, on the “notes” page it shows “Rogue & in titles”. If we assume that Syndic8 stores the note exactly as it was typed by the author, tthen no encoding takes place on the “action log” page, and the data is encoded twice on the “notes page”. Either that or the note is decoded on the “action log” page and encoded on the “notes” page.

Update: I went ahead and entered my own “note” to test Syndic8′s handling of ampersands in note fields. I typed it as “& &”. On the “notes” page, Syndic8 shows that it has indeed encoded what I typed twice. Perhaps it is encoding before it stores it in the database, and then again before it displays it. I attempted to check the “action log” page, but it appears there is a bug in the Syndic8 code, because it is showing the same note twice there. However, based on how the “notes” page is displaying my data, and how the other data was displayed, I can make this fairly safe assumption. Syndic8 encodes the data before it stores it. On the “action log” page, it decodes the data (why, I’m not sure) and then offers it to the browser. On the “notes” page, it encodes the data (again) and then offers it to the browser. Therefore, if my assumptions are correct, my note would display as “& &” on the “action log” page, but only because browsers are not strict. In the HTML source my note would display as “& &”, which is not valid as the page would then contain a “rogue &”, the same thing they are accusing me of when, in actuality, mine does not contain a “rogue &”, merely an “&” that the reviewer did not beleive should be there. Their output would not validate. Mine does.

I think I’ve spent way too much time on this.

subconscious statement

I can’t think of a UNIX command that starts with these letters, or an application that contains these letters in this order in its name. I also cannot remember typing these letters in this order in the past 24-hours or so. Yet somehow, in my commandline bar, they sit there:

numb