Basics of Japanese multi-byte encodings
It is often said quite hard to figure out how Japanese texts
are handled in the computer. This is not only because Japanese
characters can only be represented by multibyte encodings,
but because different encoding standards are adopted for
different purposes / platforms. Moreover, not a few character
set standards are used there, which are slightly different
from one another. Those facts have often led developers
to inevitable mess-up.
To create a working web application that would be put in
the Japanese environment, it is important to use the proper
character encoding and character set for the task in hand.
Storage for a character can be up to six bytes
Most of multibyte characters often appear twice as wide
as a single-byte character on display. Those characters
are called "zen-kaku" in Japanese which means
"full width", and the other (narrower) characters
are called "han-kaku" - means half width. However
the graphical properties of the characters depend on the
glyphs of the type faces used to display them or print them
out.
Some character encodings use shift(escape) sequences defined
in ISO2022 to switch the code map of the specific code area
(00h to 7fh).
ISO-2022-JP should be used in SMTP/NNTP, and headers and
entities should be reencoded as per RFC requirements. Although
those are not requisites, it's still a good idea because
several popular user agents cannot recognize any other encoding
methods.
Webpages created for mobile phone services such as i-mode,
Vodafone live!, or EZweb are supposed to use Shift_JIS.
References
Multibyte character encoding schemes and the related issues
are very complicated. There should be too few space to cover
in sufficient details. Please refer to the following URLs
and other resources for further readings.
Unicode materials
http://www.unicode.org/
Japanese/Korean/Chinese character information
http://examples.oreilly.com/cjkvinfo/doc/cjk.inf
|