| If you’re creating HTML, SGML, and XML directly, perhaps using a text editor or writing a program, always use “decimal numeric character references” for curling single and double quote characters (these marks are called “smart quotes,” “curly quotes,” “curled quotes,” “curling quotes,” or “curved quotes”). In other words, for left and right double quotation marks, use “ and ” - and for left and right single quotation marks (and apostrophes), use ‘ and ’ - and you’ll be glad you did. This approach complies with all international standards, and works essentially everywhere. Here’s a table showing what I mean. | To show | In HTML, SGML, or XML use | Displays on your system as | | Left Double Quotation Mark | “ | “ | | Right Double Quotation Mark | ” | ” | | Left Single Quotation Mark | ‘ | ‘ | | Right Single Quotation Mark (including English possessives and contractions) | ’ | ’ | By doing this, your text will look good on a very wide variety of browsers and viewers, and you can easily cut-and-paste portions of data between HTML, SGML, and XML documents (letting you dynamically query and create new material from existing material, without having to deal with the complexities of translating between character sets). Rationale There are many advantages to this particular recommendation. These are the official, standard, vendor-neutral encodings for these characters according to both Unicode and ISO-10646, so you don’t need to worry about them not working in the future. They also work across XML, HTML, and SGML, simplifying data extraction - alternatives such as named character entity references do not easily work across XML and HTML (in particular). Systems which can display curling quotes (with the current fonts) will do so, and practically without exception will gracefully go back to neutral (vertical) characters if they can’t - even if they’re a somewhat old browser. I’ve tested this approach on several versions of Internet Explorer, Netscape (the old 4.5 and 6.X), Mozilla (0.9.9 and 1.0), and lynx (a text browser), on a variety of systems (Windows, Linux, Sun Solaris). The one minor problem is that on some older X windows systems with old fonts, the left single quotation mark may get mapped to a character that is an angled character for the right single quotation mark - but it doesn’t look bad, the alternatives look far worse everywhere else, and this solution is “future-proof”. Do not use the various alternatives: - Don’t use HTML’s character entity references assigned for this purpose: “, ”, ‘, and ’. Character entity references won’t work in SGML or XML in general, because they aren’t a predefined entity in SGML or XML (see the XML specification version 1.0 on predefined entitities for more information). They are predefined in modern HTML implementation, and you could define them in both SGML and XML, but this makes it harder to use data fragments - if you take parts of the material, the definitions probably won’t come along. Are you sure your information will never be used again? Indeed, one of the main points of XML is that you can manipulate the resulting data, and using these conveniences interferes with that process. Another problem is that they are not supported by older browsers (such as Netscape 4.5) and tools, and remember, it takes some users a long time to upgrade. Some older text browsers don’t support them - and text browsers are important for accessibility, because they’re the basis of most readers for the blind. It’s also easy to make mistakes with character entity references - earlier versions of this document incorrectly used “lsquot” instead of “lsquo” (note the excess letter t). If you’re sure you’ll never use the text in SGML or XML, you could consider using these symbols in a few years as browsers retire, but it’s not worth it. You’re probably much better off following the recommendation above; it will be easier to combine your data with other data (e.g., to create dynamic results) by following this recommendation.
- Don’t use HTML hexadecimal numeric character references, such as “, ”, ‘, and ’. Hexadecimal numeric character references are nice because the official documents that define the character standards also use hexadecimal. However, support for hexadecimal is a recent feature with inconsistent support: older browsers (like Netscape 4.5) don’t support it, and many other SGML and XML processors don’t support it. Indeed, SGML doesn’t include this ability at all. Since they're rarely used (compared to the decimal versions), there is also a higher risk of hitting a bug with them.
- Don’t embed the UTF-8 (or UTF-16) characters directly into the text and depend on setting the UTF-8 charset for now. This won’t work on some text browsers (e.g., lynx, and thus many blind readers that depend on text rendering). It’s possible in XML and HTML to specify that characters should be interpreted according to a particular character set (charset), but requiring a particular charset has many drawbacks. Setting the charset to utf-8 does work in many places, but only if you explicitly set the charset; failing to set the charset will cause this to fail on many systems. Fundamentally, this makes combining your material with other sources harder, because they’re likely to use other charsets. For example, Microsoft’s non-standard character sets (discussed next) interfere with it, so using the UTF-8 encoding can cause trouble when trying to combine with data from some Microsoft and MacOS tools in some circumstances. In the longer term, hopefuly everyone will switch to UTF-8 and UTF-16, and then this would a reasonable alternative. For now, don’t do it.
|