Saturday, January 16, 2010

Characters and escaping

XML documents consist entirely of characters from the Unicode repertoire. Except for a small number of specifically excluded control characters, any character defined by Unicode may appear within the content of an XML document. The selection of characters which may appear within markup is somewhat more limited but still large.

XML includes facilities for identifying the encoding of the Unicode characters which make up the document, and for expressing characters which, for one reason or another, cannot be used directly.

Encoding detection

The Unicode character set can be encoded into bytes for storage or transmission in a variety of different ways, called "encodings". Unicode itself defines encodings which cover the entire repertoire; well-known ones include UTF-8 and UTF-16.[5] There are many other text encodings which pre-date Unicode, such as ASCII and ISO/IEC 8859; their character repertoires in almost every case are subsets of the Unicode character set.

XML allows the use of any of the Unicode-defined encodings, and any other encodings whose characters also appear in Unicode. XML also provides a mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used.[6] Encodings other than UTF-8 and UTF-16 will not necessarily be recognized by every XML parser.

Escaping

There are several reasons why it may be difficult or impossible to include some character directly in an XML document.

  • The characters "<" and "&" are key syntax markers and may never appear in content.[7]
  • Some character encodings support only a subset of Unicode: for example, it is legal to encode an XML document in ASCII, but ASCII lacks code points for Unicode characters such as "é".
  • It might not be possible to type the character on the author's machine.
  • Some characters have glyphs that cannot be visually distinguished from other characters: examples are non-breaking-space ( ) and Cyrillic Capital Letter A (А).

For these reasons, XML provides escape facilities for referencing problematic or unavailable characters. There are five predefined entities: < represents "<", > represents ">", & represents "&", ' represents ', and " represents ". All permitted Unicode characters may be represented with a numeric character reference. Consider the Chinese character "中", whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offered no method for entering this character could still insert it in an XML document encoded either as or . Similarly, the string "I <3>" could be encoded for inclusion in an XML document as "I <3>".

"

No comments:

Post a Comment