In late August I moved my blog from a host running Red Hat Enterprise Linux 4 to a host running Red Hat Enterprise Linux 6. With it came an upgrade to MySQL, from 4 to 5. The migration itself seemed pretty painless, as I was able to recompile Apache on the new host and just copy everything over intact. I moved the databases by dumping each one and importing them on the other side.
A week ago I was doing some work on the blog and noticed that for every apostrophe there were three other characters now: ’. I use a lot of apostrophes, and it sort of sucks to have every “it’s” become “it’s”. More specifically, it was only the posts I’d written in Windows Live Writer, too. Apparently Live Writer replaces apostrophes with the UTF-8 “acute accent” character. Why? No idea, though you can shut it off in the options (which, now that I know this, I strongly recommend).
Alex King wrote about this problem a few years ago. Older WordPress databases contain UTF-8 data, but have their defaults set to Latin1 encoding. So when I reloaded the data on the new host it converted all the UTF-8 special characters to Latin1, because that’s how the tables were set up. In his case he was creating a test database and could just reload it. In my case I had newer posts I wanted to preserve, which complicated things.
Here’s how I fixed it all.
1. I created a test copy of my blog with the goal of setting test up correctly, then copying it back to production. I used the same web server, set up for virtual hosting, and just used “cp -ax” to copy the whole web site over to a test directory. I set permissions on the host so only my client workstation could access the test copy, to prevent search engines and whatnot from getting to it.
I edited wp-config.php to point the copy of the blog at the test database. I also used the advice in the WordPress Codex on forcing the site URLs via wp-config.php so that I could keep the database settings the same as production, but it wouldn’t redirect me to (or reconfigure) the production site. I forced the site URLs in both test and production, and I also added a test comment to production, on an old, obscure post, that just said “PRODUCTION.” That way, when things got crossed up I could tell where I was.
I also dumped a copy of the corrupted production database, just in case.
2. I had the original database dump I’d made with mysqldump during the migration. Using the UNIX command ‘less’ I looked at the dump and noted the default charset was Latin1, as expected:
DROP TABLE IF EXISTS `wp_posts`; CREATE TABLE `wp_posts` ( ...lots of stuff was here... ) ENGINE=MyISAM AUTO_INCREMENT=3654 DEFAULT CHARSET=latin1;
3. A little sed action to get it to create the databases as UTF-8:
sed s/"DEFAULT CHARSET=latin1"/"DEFAULT CHARSET=utf8"/g original-dump.sql > utf8-dump.sql
4. Now we can import it to the test database:
cat utf8-dump.sql | mysql test_blog_db
You could get ballsy and pipe the output of sed right into the MySQL client, but I like having a copy around so I can see what I imported later, if I need to.
5. I logged into the test instance of WordPress and confirmed I had what looked like apostrophes, and not all those crazy characters. I then proceeded to adjust my plugins and such so that test looked like production again, replicating the changes from the last couple of weeks.
6. WordPress has the ability to export and import post information (via the Tools part of the admin interface), and the post export function can take a date range. I looked at the test blog to see when the last post was and used that to set the date range for the export of posts from the production blog.
It’s okay if it’s not 100% accurate, the import procedure will not recreate identical blog posts. I imported the posts, instructing it to download media and images. After importing the posts to the test database I deleted any duplicates I had. The duplicates all had corrupted characters so it was easy to figure out which one to keep.
7. I’d updated some of my pages, like my “About Me” and “Disclosures” pages. The export-from-prod/import-to-test process didn’t replace them, so I had to delete them in test and then import the newer versions. Worked great.
8. After a final check of things in test I dropped & recreated the production database:
mysql> DROP DATABASE prod_blog_db; mysql> CREATE DATABASE prod_blog_db;
and then copied the test database over:
mysqldump test_blog_db | mysql prod_blog_db
9. Done! I now have my blog back, without all the spurious characters.
That’s pretty good Bob…it’s annoying how small things like that can cause so many issues.
I have a permanent “test” site that I use from time to time to try out new themes, upgrades, patches, plug-ins, etc. I had a similar issue when I first started although I needed to modify the author the a few hundred posts so I backed up the database, dumped it… hacked away with sed and the re-imported it back into MySQL.
I’m running CentOS 5.5 on my server. Are you still running PHP 5.1.x? I started using the Atomic repository and updated to PHP 5.3.x and it’s been very stable.
Cheers!
Mike
Though I talked about it like I just set it up, I really do have a permanent test setup, too. I don’t leave it up all the time lest I need to patch it, worry about Google detecting duplicate content, etc. if the security controls on it are messed up. It’s invaluable for plugin testing, theme hacking, etc.
I build my own Apache & PHP, actually. It isn’t hard to do and it gives me much better control over build options than a prepackaged RPM does. Plus, Red Hat doesn’t ever rev the version of PHP, so this way I get newer stuff. Often, software like WordPress won’t even run on the vendor-supplied LAMP stack components anymore. It does suck that I have to keep up with security announcements myself, rather than aggregated through a vendor, but it’s a small price to pay for flexibility.
If you are going from an more or less antique LAMP-Stack to a more or less actual, you could do a sed -e ‘s/ENGINE=MyISAM/ENGINE=InnoDB/’ which is the default since a while.
If you still encounter Latin-1 characters within your dump use
recode -v Latin-1..UTF8
to make sure there are none left.