logoheader

XML and Unicode

XMLBlueprint XML Editor fully supports the Unicode UTF-8 and UTF-16 standards on all Windows versions, including Windows '98. In addition, XMLBlueprint supports character encodings for many different languages. This means you can edit and validate your XML documents in almost any language.

What is Unicode?

Unicode is an international standard developed by the Unicode Consortium which represents almost all of the written languages of the world, including ancient scripts known only by scholars, and mathematical symbols. Unicode was created to replace existing character encodings. Today, it is considered the most complete character set and one of the largest.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit):

  • UTF-8
  • UTF-16 (Big Endian / Little Endian)
  • UTF-32

UTF-32 is not widely used and is not supported by XMLBlueprint. As an example, we'll see how the Chinese character for tea is encoded in each of these encoding forms.

tea (Chinese character for tea)

UTF-8

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

The Chinese character for tea is in UTF-8 encoded as [E8 8C B6].

UTF-16

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-16 has two variants: UTF-16BE (Big Endian) and UTF-16LE (Little Endian). Documents created on a Macintosh or Unix platform are normally encoded as UTF-16BE (Big Endian). Documents created on a Windows platform are normally encoded as UTF-16LE (Little Endian).

  • The Chinese character for tea is in UTF-16BE encoded as [83 36].
  • The Chinese character for tea is in UTF-16LE encoded as [36 83].

Why use Unicode?

Unicode is a superset of all other character sets. At the moment, it includes the vast majority of scripts ever used, and will ultimately include them all. This means that almost any character can be encoded in it, and will always be unambiguously represented. What is more, with Unicode you can mix character sets at will: if you wish to produce a text which includes material in Tibetan script along with Chinese logograms and IPA transcription, Unicode will allow you to do so.

So if you're aiming international markets especially countries / languages with large character sets (e.g. Japanese, Chinese) it's a good idea to use Unicode.

Please note, that XMLBlueprint does not force you to use Unicode. If you want, you can save your documents in ASCII, or you can specify a specific character encoding in the XML document.

XMLBlueprint support for Unicode

XMLBlueprint supports UTF-8, UTF16-BE and UTF-16LE. The correct encoding is automatically detected by means of a Byte Order Mark (BOM), a character at the very start of a document, invisible to the user. When no BOM is found, the document is interpreted as encoded in UTF-8, unless a specific character encoding is present in the document.

See also

For more (technical) information on Unicode, see (Wikipedia): Unicode, Byte Order Mark (BOM), UTF-8, or UTF-16.

Character encoding

The XML specification does not force you to use Unicode. For instance, you can specify that an XML document is written in Greek (ISO-8859-7) by including the following XML declaration right at the top of the document:

<?xml version="1.0" encoding="ISO-8859-7"?>

When this XML document is opened, XMLBlueprint will auto-detect the encoding and it will convert the characters in the document into the Unicode numbers it uses internally. If no encoding is specified, then a Unicode encoding is assumed.

See the list of supported languages which encoding to use for a specific language.

You can easily convert files to a different character encoding. For example, you can change the character encoding of your document to Cyrillic (KOI8-R) before saving the document:

<?xml version="1.0" encoding="koi8-r"?>

If you want to change the character encoding to Unicode, just remove the encoding before saving the document:

<?xml version="1.0"?>

See also