LIII. Multibyte String Functions
Introduction
While there are many languages in which every necessary character
can be represented by a one-to-one mapping to a 8-bit value,
there are also several languages which require so many characters
for written communication that cannot be contained within
the range a mere byte can code. Multibyte character encoding
schemes were developed to express that many (more than 256)
characters in the regular bytewise coding system.
When you manipulate (trim, split, splice, etc.) strings
encoded in a multibyte encoding, you need to use special
functions since two or more consecutive bytes may represent
a single character in such encoding schemes. Otherwise,
if you apply a non-multibyte-aware string function to the
string, it probably fails to detect the beginning or ending
of the multibyte character and ends up with a corrupted
garbage string that most likely loses its original meaning.
mbstring provides these multibyte specific string functions
that help you deal with multibyte encodings in PHP, which
is basically supposed to be used with single byte encodings.
In addition to that, mbstring handles character encoding
conversion between the possible encoding pairs.
mbstring is also designed to handle Unicode-based encodings
such as UTF-8 and UCS-2 and many single-byte encodings for
convenience (listed below), whereas mbstring was originally
developed for use in Japanese web pages.
PHP Character Encoding Requirements
Encodings of the following types are safely used with PHP.
A singlebyte encoding,
which has ASCII-compatible (ISO646 compatible) mappings
for the characters in range of 00h to 7fh.
A multibyte encoding,
which has ASCII-compatible mappings for the characters in
range of 00h to 7fh.
which don't use ISO2022 escape sequences.
which don't use a value from 00h to 7fh in any of the compounded
bytes that represents a single character.
These are examples of character encodings that are unlikely
to work with PHP.
JIS, SJIS, ISO-2022-JP, BIG-5
Although PHP scripts written in any of those encodings
might not work, especially in the case where encoded strings
appear as identifiers or literals in the script, you can
almost avoid using these encodings by setting up the mbstring's
transparent encoding filter function for incoming HTTP queries.
Note: It's highly discouraged to use SJIS, BIG5, CP936,
CP949 and GB18030 for the internal encoding unless you are
familiar with the parser, the scanner and the character
encoding.
Note: If you have some database connected with PHP, it
is recommended that you use the same character encoding
for both database and the internal encoding for ease of
use and better performance.
If you are using PostgreSQL, the character encoding used
in the database and the one used in the PHP may differ as
it supports automatic character set conversion between the
backend and the frontend.