Strange Characters in My WordPress Blog

by Bob Plankers on September 14, 2011 · 3 comments

in Site Administration,System Administration

In late August I moved my blog from a host running Red Hat Enterprise Linux 4 to a host running Red Hat Enterprise Linux 6. With it came an upgrade to MySQL, from 4 to 5. The migration itself seemed pretty painless, as I was able to recompile Apache on the new host and just copy everything over intact. I moved the databases by dumping each one and importing them on the other side.

A week ago I was doing some work on the blog and noticed that for every apostrophe there were three other characters now: ’. I use a lot of apostrophes, and it sort of sucks to have every “it’s” become “it’s”. More specifically, it was only the posts I’d written in Windows Live Writer, too. Apparently Live Writer replaces apostrophes with the UTF-8 “acute accent” character. Why? No idea, though you can shut it off in the options (which, now that I know this, I strongly recommend).

Alex King wrote about this problem a few years ago. Older WordPress databases contain UTF-8 data, but have their defaults set to Latin1 encoding. So when I reloaded the data on the new host it converted all the UTF-8 special characters to Latin1, because that’s how the tables were set up. In his case he was creating a test database and could just reload it. In my case I had newer posts I wanted to preserve, which complicated things.

Here’s how I fixed it all.

1. I created a test copy of my blog with the goal of setting test up correctly, then copying it back to production. I used the same web server, set up for virtual hosting, and just used “cp -ax” to copy the whole web site over to a test directory. I set permissions on the host so only my client workstation could access the test copy, to prevent search engines and whatnot from getting to it.

I edited wp-config.php to point the copy of the blog at the test database. I also used the advice in the WordPress Codex on forcing the site URLs via wp-config.php so that I could keep the database settings the same as production, but it wouldn’t redirect me to (or reconfigure) the production site. I forced the site URLs in both test and production, and I also added a test comment to production, on an old, obscure post, that just said “PRODUCTION.”  That way, when things got crossed up I could tell where I was.

I also dumped a copy of the corrupted production database, just in case.

2. I had the original database dump I’d made with mysqldump during the migration. Using the UNIX command ‘less’ I looked at the dump and noted the default charset was Latin1, as expected:

CREATE TABLE `wp_posts` (
...lots of stuff was here...

3. A little sed action to get it to create the databases as UTF-8:

sed s/"DEFAULT CHARSET=latin1"/"DEFAULT CHARSET=utf8"/g original-dump.sql > utf8-dump.sql

4. Now we can import it to the test database:

cat utf8-dump.sql | mysql test_blog_db

You could get ballsy and pipe the output of sed right into the MySQL client, but I like having a copy around so I can see what I imported later, if I need to.

5. I logged into the test instance of WordPress and confirmed I had what looked like apostrophes, and not all those crazy characters. I then proceeded to adjust my plugins and such so that test looked like production again, replicating the changes from the last couple of weeks.

6. WordPress has the ability to export and import post information (via the Tools part of the admin interface), and the post export function can take a date range. I looked at the test blog to see when the last post was and used that to set the date range for the export of posts from the production blog.

It’s okay if it’s not 100% accurate, the import procedure will not recreate identical blog posts. I imported the posts, instructing it to download media and images. After importing the posts to the test database I deleted any duplicates I had. The duplicates all had corrupted characters so it was easy to figure out which one to keep.

7. I’d updated some of my pages, like my “About Me” and “Disclosures” pages. The export-from-prod/import-to-test process didn’t replace them, so I had to delete them in test and then import the newer versions. Worked great.

8. After a final check of things in test I dropped & recreated the production database:

mysql> DROP DATABASE prod_blog_db;
mysql> CREATE DATABASE prod_blog_db;

and then copied the test database over:

mysqldump test_blog_db | mysql prod_blog_db

9. Done! I now have my blog back, without all the spurious characters.


Comments on this entry are closed.

Previous post:

Next post: