Discussion:
[ADMIN] utf8 database not dumping utf8 characters
(too old to reply)
Matt Williams
2012-04-06 19:55:19 UTC
Permalink
I have a database that is utf8 and displays utf8 values correctly in psql. When dumped, it displays the utf8 characters incorrectly. ie. ö turns into Ã

In the header of the dump file, I have:

SET client_encoding = 'UTF8';

So I'm not sure where the disconnect is?

Thoughts?

Thanks,

--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Steve Crawford
2012-04-06 20:03:58 UTC
Permalink
Post by Matt Williams
I have a database that is utf8 and displays utf8 values correctly in
psql. When dumped, it displays the utf8 characters incorrectly. ie. ö
turns into Ã
SET client_encoding = 'UTF8';
So I'm not sure where the disconnect is?
Thoughts?
Thanks,
--
Matt Williams
Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
With what are you viewing the dump file and is everything in the chain
(terminal, less/vi/...) set to interpret/display that data as UTF8? You
can always use a hex-dump program to see the actual bytes in the file
and determine if they are what you expect for UTF8.

Cheers,
Steve
Matt Williams
2012-04-06 20:10:46 UTC
Permalink
With that same dump file that is displaying incorrectly open in vim, I can paste in the utf8 character I provided as an example and it displays correctly.

--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Post by Matt Williams
I have a database that is utf8 and displays utf8 values correctly in psql. When dumped, it displays the utf8 characters incorrectly. ie. ö turns into Ã
SET client_encoding = 'UTF8';
So I'm not sure where the disconnect is?
Thoughts?
Thanks,
--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
With what are you viewing the dump file and is everything in the chain (terminal, less/vi/...) set to interpret/display that data as UTF8? You can always use a hex-dump program to see the actual bytes in the file and determine if they are what you expect for UTF8.
Cheers,
Steve
Steve Crawford
2012-04-06 22:19:27 UTC
Permalink
Post by Matt Williams
With that same dump file that is displaying incorrectly open in vim, I
can paste in the utf8 character I provided as an example and it
displays correctly.
I usually find a good first step is to run the file through something
that will give you a hex dump (i.e. xxd or similar) and so I *know* the
actual bytes in the file rather than relying on how they may be
interpreted somewhere else along the chain. Find the hex-byte(s) of your
suspect character and look it up.

Since you are in vim, it may be worth checking ":set termencoding",
":set encoding" and ":set fileencoding".

Cheers,
Steve
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Matt Williams
2012-04-07 03:28:09 UTC
Permalink
I ran it through xxd and the hex-bytes are different than those of the proper utf8 character:

03300a0: 7472 c383 c2b6 6d65 7209 3009 5c4e 0931 tr....mer.0.\N.1 (from the dump file)

ö : c3b6

ö : c383 c2b6

--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Post by Steve Crawford
Post by Matt Williams
With that same dump file that is displaying incorrectly open in vim, I
can paste in the utf8 character I provided as an example and it
displays correctly.
I usually find a good first step is to run the file through something
that will give you a hex dump (i.e. xxd or similar) and so I *know* the
actual bytes in the file rather than relying on how they may be
interpreted somewhere else along the chain. Find the hex-byte(s) of your
suspect character and look it up.
Since you are in vim, it may be worth checking ":set termencoding",
":set encoding" and ":set fileencoding".
Cheers,
Steve
Albe Laurenz
2012-04-10 12:52:45 UTC
Permalink
Post by Matt Williams
03300a0: 7472 c383 c2b6 6d65 7209 3009 5c4e 0931 tr....mer.0.\N.1 (from the dump file)
ö : c3b6
ö : c383 c2b6
That looks like your database does not contain what you think it does.

True, there *are* UTF-8 characters in it, but not the ones you want,
even though everything looks OK on the surface.

Imagine this scenario:
- Database server encoding is UTF8
- Database client encoding is LATIN1
- Application feeds UTF-8 into PostgreSQL.

This can easily happen if the locale of the postgres user account
is ISO8859-1 and the application did not set the PGCLIENTENCODING
environment variable.

The Application stores 'trömer', i.e passes the following bytes to PostgreSQL:
74 72 c3 b6 6d 65 72

PostgreSQL client interprets these bytes as LATIN1, i.e. 'trömer'.

This is converted to UTF-8 and stored in the database as
74 72 c3 83 c2 b6 6d 65 72

When the application retrieves the string, it will get back what
it originally stored, and everybody is happy, that is until somebody
looks closer or wonders why full text search isn't working for
German umlauts.

As to fixing the situation (if the above is actually your problem),
dump the database with -E LATIN1, edit the dump, change LATIN1
to UTF8 in the "SET client_encoding" statement and load it again.

Yours,
Laurenz Albe
--
Sent via pgsql-admin mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
Scott Whitney
2012-04-06 20:04:44 UTC
Permalink
What version, Matt? I had that problem back in 8.1x

----- Original Message -----
Post by Matt Williams
I have a database that is utf8 and displays utf8 values correctly in
psql. When dumped, it displays the utf8 characters incorrectly. ie.
ö turns into Ã
SET client_encoding = 'UTF8';
So I'm not sure where the disconnect is?
Thoughts?
Thanks,
--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Matt Williams
2012-04-06 20:11:07 UTC
Permalink
version 9.1

--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Post by Scott Whitney
What version, Matt? I had that problem back in 8.1x
Post by Matt Williams
I have a database that is utf8 and displays utf8 values correctly in psql. When dumped, it displays the utf8 characters incorrectly. ie. ö turns into Ã
SET client_encoding = 'UTF8';
So I'm not sure where the disconnect is?
Thoughts?
Thanks,
--
Matt Williams
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Loading...