Character set issues in Postgres upgrade

Question

I am moving some databases from Postgres 7.4.8 to 8.4.5, on CentOS 5. In the old database the encoding is UNICODE. So I did a text pg_dump, created my new databases like so:

createdb --template template0 --encoding unicode testdb

and imported the data:

psql testdb < testdb.dmp >& testdb.out

Unfortunately, these errors are sprinkled throughout:

ERROR: invalid byte sequence for encoding "UTF8": 0x8e

I know that iconv is the tool most often recommended to fix this, but I'm not sure what the conversion is. I tried running it from Unicode to UTF8, since that is what Postgres reports for my new database, but that produced a file full of what looked like Chinese characters. I also tried UTF8 to UTF8; that chokes on the first of the offending characters with this:

iconv: illegal input sequence at position 874924

I'd like to know a) why am I having problems when I'm not changing the character set? And b) is there any way of fixing this short of editing the dump file by hand?

Thanks!

score 0 · Answer 1 · answered Apr 14 '11 at 17:29

For posterity... it turned out that (I think) I had a mix of several different character sets in the data. So what I ended up doing was first running

iconv -f macintosh -t utf-8

To fix the accented a characters and others that were causing problems on the import, and then doing manual search and replace to get rid of all the Windows smart characters, which got mangled by the above conversion. After that, everything imported cleanly into Postgres and looked ok. Not the best solution, but the only thing I could figure out that worked.

Character set issues in Postgres upgrade

1 Answers1