So, embarrassingly I have been putting up with this for years and just never could quite put a finger on how to handle it. The situation is this: You have some sort of content management system (CMS) setup for your client that you have made homebrew (or not maybe). You’ve painstakingly put time into security, format, ease of use, etc and you hand them the rains. Two hours later you get a call that they have some ‘weird code’ on their about page. You look and you realize they have copied and pasted some special chars from Word. What to do?
Maybe this is the jerk way of handling it but I always had users copy and paste from word into a flat text file, then copy and paste into the form. Or, I would use TinyMCE in the CMS and have their Cleaner plugin handle it. Most of the time that seems to work but not always. Today, I hit the wall on this problem with CakePHP and had to figure it out. Essentially, it all comes down to page and database encoding.
The short explanation: Set everything to UTF-8 at the beginning of your project. The long explanation (which is just a rehash of this awesome post at missingfeatures.com):
- Set your mySQL database up with UTF-8 encoding. This includes:
- The database itself
- The tables
- Every field (watch out for this one in phpmyadmin because if you have already created your table you’ll need to specifically change the collation for all text, varchar, and char fields)
- In your layouts:
- Put this at the top:<?php header(’Content-type: text/html; charset=UTF-8′); ?>
- Put this in the head: <?php echo $html->charset(’utf-8′); ?>
- Put this at the top:<?php header(’Content-type: text/html; charset=UTF-8′); ?>
- In your database configuration file (app/config/database.php):
- Add this to your default database array: ‘encoding’ => ‘utf8′
This puts everything on the same encoding plane and allows for characters like: ü ß ☠ ☮ ☯ ♠ Ω ♤ ♣ ♧ ♥ ♡ ♦ ♢ ♔ ♕ ♚ ♛ ⚜ ★ ☆ ✮ ✯ ☄ ☾ ☽ ☼ ☀ ☁ ☂ ☃ ☻ ☺ ☹ ۞ ۩