College Publisher to WordPress conversion script is now open source

Alternate title for this post: Let the exodus continue. The Python conversion script CoPress used to migrate over 50 student publications to the glorious free and open source WordPress is now itself licensed under GPL version 2. It’s optimized for College Publisher 4 and College Publisher 5 databases, but will also work with most any database you can turn into a flat CSV file. You can fork it on Github or download the brand new 1.0 release.

Right off the bat, I’d like to say that the most awesome bit about the conversion script is its ease of use. Granted, you do have to run it on the command line and it does often throw mythical, unintelligible errors if your data is screwy, but it’s about 100 to 1,000 times easier than what Sean Blanda or Brian Schlansky had to go through. Furthermore, it spits out WordPress eXtended RSS files that WordPress imports natively. Depending on the size of your archives, you could even do the entire migration in less than a half hour.

Backup your database using Sequel Pro. This is a critically important step, as you’ll definitely want a clean version to revert to if the import goes awry.

Place the conversion script and your archives in a folder you can access from the command line. Both College Publisher 4 and College Publisher 5 migrants should receive an articles file that will need to be renamed “stories.csv.” Publications migrating from the former will have all of their image references stored in a file that will need to be renamed “media.csv.” Navigate to that directory from your terminal prompt and run “python CoPress-Convert.py.”

Once the script is running, you’ll be asked a series of questions to configure the conversion process. Most options are self-explanatory, and all are explained fully in the README file packaged with the script. The most important thing I’d like to note in this post is that, unless you have less than 500 authors in your archives, I’d highly, highly recommend importing your authors as custom fields instead of users. WordPress is not optimized to add a large number of new users through its import process. We learned this the hard way migrating CM Life‘s database last summer.

When the script is done, you’ll have a series of WordPress eXtended RSS files you can easily upload into WordPress.

Feel free to send along any suggestions for improvement, bugs, fixes or general comments. I intend to maintain it for the indefinite future, it’s good Python practice when everything else I’m working on is PHP, but code contributions are always welcome. There is a short list of upgrades under consideration in the top of the script.

Hmm… I seem to be getting an error right from the start. Do you have any clue what it can be? I have all of the images as well as stories.csv in the same directory as the python file. http://pastebin.com/HyPeB2cR

Yeah, it looks like they did. They seem to be using ≤|||≤ as the delimiter. Do you know what the second least painful way of doing this is? I got it the data in Excel and I’ve been cleaning it up a bit: http://i.imgur.com/aZA8F.png

To be honest, I think the last time I had to deal with this I just asked them to send an actual CSV. If you can open it in Excel, you should be able to then export it again as standard CSV. The technical problem is that the Python library doesn’t support parsing files with multi-character delimiters.

Hey, csvkit also won’t handle multi-char delimiters either, but I suspect there are no legitimate “|||” strings in your data, so you can probably just replace it with a single delimiter. In vim this would be something like:

:%s/|||/,/g

But you could just as easily do it with Find/Replace in whatever your text editor is choice is. Once you’ve got it down to a single delimiter you should be able to process it with Excel or whatever.

That being said, who knows what other land mines are in that data. (quoting?) If you can get a clean file, that’s a much better solution.

*Application: vim (the file was too big for anything else)*
First of all, CP totally messed my data up. The file they sent me was 130,000+ lines long. I did “:sort u” in vim (sort and remove duplicates) and it left me with about 30,000+ lines (that’s a lot of duplicates). After that, I would definitely suggest using Excel to clean things up, it makes things MUCH easier.

*Application: Excel*
I imported it into Excel using the pipe “|” as the delimiter and just told it to ignore consecutive delimiters (I’m willing to lose a few stories that contain a “|” character… Worth it). Then I did a find on all ² characters (they were used as quotes in mine) and replaced it with an empty string. At this point I still had a lot of really messed up data. Since it was nicely sorted, I noticed there were triple duplicates of many stories with only 1 containing valid data (the other two would have the columns all scrambled). At this point I realized unscrambling data wasn’t helping, I just needed to delete the messed up rows. To do this I filtered certain columns for things that didn’t make sense, for example, the title or body being blank or a category that didn’t start with a colon. With all that bad data in view I was able to delete thousands of rows at a time. After lots and lots and lots of filtering and deleting (down to about 20,000+ lines now), I decided to save the CSV and try the script. This part was weird, to get Excel to save a CSV with the “|” as a delimiter, I needed to go into Windows Control Panel > Regional and Language Settings > Additional Settings > and change the “List separator” to a “|” instead of a comma. So with that being done I saved it as stories.csv from Excel. Yay, pipe delimited. Now for the script.

*Application: Your favorite text editor (Notepad++)*
Some edits to the script I had to make included changing the delimiter to a “|” as well as editing the date string. CP saved my dates as Fri, Feb 10, 2012. The script didn’t like the weekday in there so at the beginning of that date parsing function I did datestring = datestring[5:] which removed the day of the week. There were some other tweaks I made along the way but I forget the details… Sorry.

I’m really excited I got this to work, I don’t know what I would have done without this script, thank you so much!!! If anyone has any questions about what I did I’d be happy to help. I spent a lot of time Googling and experimenting, hopefully I can save someone else the time and frustration.

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Join the conversation via an occasional emailGet only replies to your comment, the best of the rest, as well as a daily recap of all comments on this post. No more than a few emails daily, which you can reply to/unsubscribe from directly from your inbox.