a few bugs (pages, tags, draft posts)

What I found after importing from blogger:
– pages are not imported
– draft post is imported, but its title not. Its content is ok, but its title changed to “(no title)”
– tags are imported as categories
– images: I had two images uploaded on blogger. The two images are imported correctly and other two jpg are imported as (no title) and no picture. So after the import I had four images in the wordpress media library.
I think the images are duplicated after the import and one imported correctly and the duplicate one is not.

Kata, thanks for your feedback.
The pages are not currently supported with the version of the API that the importer uses, to switch to the new API means a change of the security protocol from OAUTH to OAUTH2 so it’s no a trivial fix.
I’ve imported lots of draft posts and not had any issue with the titles going across, that’s quite strange. Is there anything specific about the titles perhaps?
Blogger does not distingush between categories and tags in it’s labels. So yes the labels are imported as categories. The are tools to swap those over to tags. If they were loaded as tags then I’m not sure there is a tool to move them in the direction.
There’s not much to go on for your images issue, do you know how the html was represented in the source post?
Cheers,
Andy

If you can give me an email address I can add you to this blog and you can check it yourself.

The draft post title has accents (áéíöüóőúű). I created a new draft post without accents and its title imported correctly. Published posts are not having this issue.

In the previous version of the blogger importer was a bug with the accents with images. If image name had accents, then that image is not imported.
I tested just now and now its imported, but its name in the media library looks like url encoded: st-C3-B3ck-p-C3-BAzzl-C3-A9p-C3-AD-C3-A9c-C3-A9as
🙂

Great diagnostics, looks like the accents are the key to the issues. I can add something similar to my test blog, will let you know if I can’t reproduce it on that.
The image code does some filtering on the filenames and replaces potentially problem characters with a –

I created a new wp blog and imported again from blogger and I have the same error.
Title on blogger: hallihow Télapó
The content is the same as the title and an image.
This post never was published, I saved it as a draft.
My blog language is Hungarian.

I tried that title and that also worked for me. I tried switching blogger so that it’s language was Hungarian and that too made no difference.
In your wordpress config file what is the define (‘WPLANG’, ”); set as?
Do you have the php xml module installed for Apache?

Workshopshed: Find out what the character set of the underlying database that is being imported into is. Accents can be weird with different MySQL character sets. You may need to use some iconv trickery here.

UTF-8 is naturally preferred by WordPress, but it doesn’t enforce the character set because it doesn’t know the character set of random data that you give it.

So, if your database tables are set to UTF-8, but the data is not (say it’s ISO-8859-1), then if it has invalid characters, the resulting insertion into the database can lead bad results, because MySQL rejects the string as non-UTF (or worse yet, truncates it at the first non-UTF-8 character).

Unfortunately, it’s often a guess as to what the data is encoded as. You have to look at the data’s binary representation and sort of figure it out. It’s possible that blogger is returning ISO-8859-1 data, which will work fine if your table is not UTF-8 but will fail if it is. Or vice-versa. Hard to say. This is why details of the specific case matters.

Ideally, WordPress creates UTF-8 tables. WordPress has used UTF-8 as the default for a very long time, but at one point in time it did not. If it’s a new install, it should be a UTF-8 table. Which means that if the data is not UTF-8, then you need to convert it. This also means that if your particular test install is old and/or not using UTF-8 tables for whatever reason, you might not have the same problems inserting seemingly the same data.

So, look at your character set on your test bed too. And try to examine the binary form of whatever blogger is sending back as well.

Looking at the blogger feed, that too is utf-8<?xml version='1.0' encoding='UTF-8'?>
I tried swapping WPLANG and that made no differnce either, although you could see which plugin’s have been translated and which not.

You could try turning on logging to file and use add some logging into the import class.

In blogger-importer-blog.php locate the function import_posts.

After the line 145$blogentry->categories = $item->get_categories();

add this new lineBlogger_Import::_log($blogentry);

That should dump out the content of the data parsed from blogger into the log. This should tell us if it’s the parser that’s erroring or if it’s at the next stage when it writes it to the DB.