You are here: start » utf8

UTF-8

All versions of CMSimple_XH since 1.2 are UTF-8 encoded (since CMSimple_XH 1.5 you must not change this setting). If you don't have to upgrade from older versions, don't want to edit any of the files offline with an editor and don't want to use any ANSI encoded templates or plugins you can skip the rest of this article.

Why UTF-8?

UTF-8 has virtually become the standard for the Internet. So for best interoperability in the WWW, it was decided to change the encoding for CMSimple_XH to UTF-8.

This should also help with the encoding of the core and the plugin files particularly in multi language installations. Of course it's possible to encode all language specific files in the standard encoding for that language. But unfortunately there are no universally agreed standards in this regard. Consider so called western languages. These are often encoded as ISO-8859-1. But this character set didn't include the “€” character, so ISO-8859-15 was invented. But on Windows the default encoding is CP-1252. This three encodings are very similar, but they're not identical. For cyrillic languages the situation is even slightly worse: ISO-8859-5, KOI-R, KOI-U and CP-1251 are “competing”.

And consider the problems with files commonly used for different languages: these shouldn't contain any non ASCII characters at all. This might not be a big problem for proper programm files (PHP and JS), but for data files of plugins which will be used for all languages.

So in the long-run switching to UTF-8 seems to be the best solution for everybody.

What's a BOM?

The following section is quite technical. As CMSimple_XH user it's not important to grasp all the details. The only rule you should keep in mind:

Never ever encode a PHP file used by CMSimple_XH as UTF-8 with BOM.

BOM is the abbreviation for byte order mark. That's an important concept for platform interoperability regarding many multibyte encodings, e.g. UTF-16 and UTF-32. It is necessary, as different OSs expect those encodings in different byte orders (big-endian vs. little-endian).

But for UTF-8 the byte order is fixed for all platforms, so the BOM has lost it's original meaning. However, it is used by many editors to mark a file as being UTF-8 encoded. That's probably not the best idea, and the Unicode Standard does not recommend using a BOM in UTF-8 encoded files. Often the BOM doesn't matter though, but for PHP files and files that will be include()d by PHP the BOM causes a problem: the BOM will be sent to the browser as soon as the file is processed. As the HTTP response is already started, later sending of HTTP headers will be suppressed, which might cause different malfunctions of the script.

Editing Files Offline

If you want to edit any PHP file offline, it's mandatory to save it as UTF-8 without BOM. Some editors automatically insert the BOM when the character encoding is UTF-8 (e.g. Windows Notepad), so use an editor, that is capable of saving UTF-8 without BOM (e.g. Notepad++) and make sure it does so. Otherwise you might experience malfunctions of CMSimple_XH; since CMSimple_XH 1.5.4 the following error message will be shown:

Cannot modify header information - headers already sent (output started at path/to/file.php:1)

Upgrading from ANSI encoded Versions

Since CMSimple_XH 1.2 all versions are UTF-8 encoded. So if you want to upgrade from an older version, you have to convert all files that contain non ASCII characters to UTF-8 without BOM. You can do it manually, by using a tool that allows to handle a complete project or you can use the Utf8migrator_XH plugin (please note that it's still in BETA stadium).

If you have special characters in the page headings, the page URLs are likely to have changed after the UTF-8 conversion. To keep them working and inform robots that the links have changed, you can use the following index.php (in the root of the CMSimple_XH installation, and in all second language folders):

<?php
 
$qs = str_replace(array('%E4',    '%F6',    '%FC',    '%C4',    '%D6',    '%DC',    '%DF'),
                  array('%C3%A4', '%C3%B6', '%C3%BC', '%C3%84', '%C3%96', '%C3%9C', '%C3%9F'),
                  $_SERVER['QUERY_STRING']);
 
if ($qs != $_SERVER['QUERY_STRING']) {
    $loc = 'http'
        . (!empty($_SERVER['HTTPS']) && $_SERVER['HTTPS'] != 'off' ? 's' : '')
        . '://' . $_SERVER['SERVER_NAME']
        . ($_SERVER['SERVER_PORT'] < 1024 ? '' : ':' . $_SERVER['SERVER_PORT'])
        . preg_replace('/index.php$/', '', $_SERVER['SCRIPT_NAME']) . '?' . $qs;
    header("Location: $loc", true, 301);
    exit;
} else {
    unset($qs);
}
 
include('./cmsimple/cms.php');
 
?>

You have to adjust the first two lines to the characters that are used. For second language's index.php files you have to change the last line:

include('../cmsimple/cms.php');

Using ANSI encoded Templates and Plugins

Reusing ANSI encoded Templates in UTF-8 encoded CMSimple_XH shouldn't be a problem. Just change the encoding of template.htm and stylesheet.css (and maybe other files in the template folder) to UTF-8 without BOM.

Reusing ANSI encoded Plugins might work, if you convert the files (particularly the language and data files) to UTF-8 without BOM. But there might be other problems that can't be solved that easy, so it's probably best to contact the plugin's author and ask for an UTF-8 conforming version or to use an alternative plugin that's already UTF-8 conforming. If that's not possible you have to try for yourself, if converting the files to UTF-8 without BOM suffices.

 
You are here: start » utf8
Except where otherwise noted, content on this wiki is licensed under the following license: GNU Free Documentation License 1.3
Valid XHTML 1.0 Valid CSS Driven by DokuWiki