Home | Photography | Flickr | LiveJournal | Get Firefox

In my frustration with RSS I've learned a few things.

MovableType, regardless of your instructions to "encode_xml", will not detect international characters (entered with 8-bit ASCII) and convert them into HTML entities. This means that if you type an 8-bit ASCII character into your "subject" field, you will have an 8-bit ASCII character sent to the browser, an 8-bit ASCII character in your static files, and an 8-bit ASCII character in your RSS feed. This is not good. By making certain that proper character sets were being sent in all the right places at all the right times, it is possible that there is a method in which to send an occasional international character in a method that is decodable and readable by the majority of possible readers, however, I know very little in regard to how to do this and I don't feel like learning.

RSS is XML. Period. Because of that, proper XML entity encoding can be used. However, proper XML encoding of the letter "e" with an acute accent over it is not "é". This is the HTML encoding and is not accepted in XML. Instead, the decimal or hexadecimal values of these characters in ASCII should be encoded as an entity: "é" and "é", respectively. Doing this, however, does not cause MovableType to handle the situation any more elegantly. The & at the beginning of the entity is still encoded by MovableType (because of the "encode_xml" directive) leaving the data still encoded after decoding.

There is no real way to know if HTML is allowed in "title" or "description" sections of RSS. RSS 0.92 (which is supposed to merely be the documentation and specification of RSS 0.91 as it was being used) states that HTML can exist in either of these fields provided it is entity-escaped. RSS 0.91 specifies that HTML cannot exist, and it isn't mentioned in RSS 0.9 though we can assume no. RSS 1.0 might allow it, but only with the content module. However, in practice, things are much different. HTML is used in "description" sections. HTML is not used in "title" sections. So, to be safe, don't include HTML in your "title" or "description" sections if you want to ensure that everyone, everywhere, can read everything you produce. If you don't care about that, then go do whatever you want and stop reading now.

It seems that using UTF-8 (or some other widely accepted international character set) is the most proper way to handle international characters, even on an occasional basis. Further research is required.

MovableType does not currently handle the use of XML/HTML entities gracefully in any context. This means that, if the user chooses not to use UTF-8, or something similar, the only option is to turn off the "encode_xml" attribute on all fields and ENSURE that ALL data that is intended to be delivered via XML be properly XML encoded by hand. This is possibly only because HTML and XML have similar entity encoding methods. If they didn't, it might very well be impossible.

It is my opinion that MovableType should either perform an XML decode (which includes international character entities) before performing an XML encode (not a good idea, because data can be lost) or that it learn how to properly encode 8-bit ASCII characters. But it really isn't their fault.

The whole situation is a mess, because it is unclear which content-type should be used to enter data in the various fields of MovableType. For instance, in the entry field, one generally types (or should type) properly encoded HTML. If MovableType then sends the full content of a post in an RSS feed, it should either perform an HTML decode and then an XML encode, or just send it in a CDATA container and let the client know that it is text/html. MovableType is doing the right thing. The data is being encoded to be safe to transmit via XML.

Take this as an example. Let's assume that I want the title of a post to read "& and its friend &". This particular line of text looks very different in various encoding methods. In plain text, it appears as above: "& and its friend &". In HTML and XML encoding, it would appear differently: "& and its friend &". If I client isn't decoding the XML data that is being sent to it, then it is just plain stupid. After decoding the XML representation of the post's title, it should read as we intended: "& and its friend &". Unfortunately, it isn't obvious how this should be entered into MovableType. If we type the data as HTML, then MovableType can display it straight to the client browser with no modification. However, if we type it as plain text, it must first be HTML encoded. Additionally, if we type the data as plain text, it must first be XML encoded before being safe to travel via RSS. However, if we type the data as HTML, then it must first be decoded into plain text, and then reencoded into HTML (which, in this case, results to the same string, but it might not always).

Since MovableType doesn't dictate that ANY field be of any certain content-type, the user is left to choose. Here is what you need to remember, as a user: whatever you do, do it with consistency. The same way, every time, all the time.

If you choose to enter HTML in your fields, then always encode your fields properly. MovableType can't decode and reencode all characters properly, so we are lucky that HTML and XML encode in similar fashion. The rule is, all XML encoding (except the CDATA container) works in HTML. Not all HTML encoding works in XML. So, use XML encoding and don't use CDATA containers. Make sure that your templates do not contain "encode_html" or "encode_xml" directives for those fields you choose to enter as HTML or you will end up with double encoded data, which is very bad. Additionally, you should make certain your templates contain the content-type information whenever possible.

If you choose to enter plain text into your fields, then use ONLY plain text. Don't type any HTML. Don't use any entity encoding. Let MovableType do it all. This means you must use UTF-8 or some other acceptable international character set, and alter your templates to reflect the use of that character set. Make certain that you are using the "encode_html" and "encode_xml" directives for fields that are being entered as plain text.

You can choose different fields to be entered in different fashions, as long as you always do it the same. You cannot, however, choose different fields to be in different character sets. This means that, if you choose to use plain text for any field, then you should always use a UTF-8 or other international character set for all fields. Either that, or never use ANY special entities in the fields entered as plain text (aside from those that can be typed in plain 7-bit ASCII).

My recommendation is as follows. All titles and category names should be entered as plain text. Special characters should not be used in them unless you intend to use a character set designed for those characters. Because MovableType automatically strips HTML tags from the content of your post if you don't provide an excerpt, it is best to use plain text when providing an excerpt of your own. This means no HTML can be included, and new lines are useless. The actual body of the post should be entered in HTML as anything else would really defeat the purpose of all this.

My head hurts.

Share and Enjoy:
  • Facebook
  • StumbleUpon
  • Digg
  • e-mail
  • del.icio.us
  • Google
  • Reddit
  • Technorati
  • BlinkList
  • blogmarks
  • Blue Dot
  • description
  • Furl
  • Ma.gnolia
  • MisterWong
  • Netvouz
  • PlugIM
  • Propeller
  • Simpy
  • Spurl
  • TailRank

Trackbacks

blog comments powered by Disqus