Wednesday, August 24, 2011

Convert a PostgreSQL Database from LATIN1 to UTF8



At first I tried a procedure using iconv.
This procedure is described below and it was based on one found at http:/bryan-murdock.blogspot.com/2008/11/convert-postgresql-database-from-latin1.html
but unfortunately it didn't work for me.
So, I decided to try another alternative, that should be simple and fast, which is described here.

Procedure using editor's encoding and manual settings

Dump the database:
pg_dump -U postgres rental > rental/rental.dbs.out


Edit the file dumped on the prior step using an editor which has encoding features.
My default editor is SciTE and it was the editor that I used to do this.
Note: this editor has Windows and Linux versions.
On the editor, change the original encoding used for the file, switching to UTF-8, and "save as" the document with a new name.
Here, on this procedure I've saved it as "rental/rental.dbs.out.utf8".

After that, change the PostgreSQL's encoding on the same file.
You'll find something like this:
--
-- PostgreSQL database dump
--

SET client_encoding = 'LATIN1';
SET standard_conforming_strings = off;
...


Switch to:
--
-- PostgreSQL database dump
--

SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
...


Save the document and exit.
Remember, save it as UTF-8.

NOTE: if the dump file opens on the editor and shows gibberish and goofy signals instead of the normal data,
you must check your editor encoding or the procedure used to generate the dump file.
The text must appear legible as it should be.

Drop the old latin1 database:
dropdb rental

Log as postgres or another user with the necessary privileges:
psql -U posgres

Create the new utf-8 database:
create database rental with template = template0 owner = postgres encoding = 'utf8';

Connect to the new database created:
\c rental

Restore the converted dumped file - the one which you edited and saved as UTF-8 changing its encoding:
\i rental/rental.dbs.out.utf8

It's done.

Below, you find the final checking topic with the result obtained.


Handling Huge Files

But if the file is huge, for instance bigger than 300MB?
Using editors with huge files become difficult.
The solution is to split the file and when the procedure is finished, join the split files again.
To split you can you use cutf, a simple command line tool.
For help, just type cutf.

To append the files back, you can use:
On linux:
  cat file_to_append >> file_to_join  
On windows:
  type file_to_append >> file_to_join  

That's it.



Final checking

rental=# \c postgres
You are now connected to database "postgres".
postgres=# create database rental with template = template0 owner = postgres encoding = 'utf8';
CREATE DATABASE
postgres=# \c rental
You are now connected to database "rental".


rental=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)


rental=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)


rental=# select * from category;
cod_category | description
---------------+-------------------
1 | ação
2 | aventura
3 | biográficos
4 | comédia
5 | drama
6 | épicos
7 | espionagem
8 | fantasia
9 | ficção científica
10 | ficção histórica
11 | guerra




Attempt using iconv

Dump the database:
pg_dump -U postgres rental > rental/rental.dbs.out

Convert it to utf-8 with iconv:
iconv --from-code latin1 --to-code utf-8 rental/rental.dbs.out > rental/rental.dbs.out.utf8

Drop the old latin1 database:
dropdb rental

Created the new utf-8 database:
psql -U posgres
create database rental with template = template0 owner = postgres encoding = 'utf8';

Restore the converted backup:
psql -U postgres rental < rental/rental.dbs.out.utf8

>Before
postgres=# \l
List of databases
Name | Owner | Encoding
----------------+----------+----------
rental | postgres | LATIN1


rental=# select * from category;
cod_category | description
---------------+----------------------
1 | ação
2 | aventura
3 | biográficos
4 | comédia
5 | drama
6 | épicos
7 | espionagem
8 | fantasia
9 | ficção científica
10 | ficção histórica
11 | guerra

>After
rental=# \l
List of databases
Name | Owner | Encoding
----------------+----------+----------
rental | postgres | UTF8

rental=# select * from category;
cod_category | description
---------------+-------------------------------------------
1 | aÃ\u0083§Ã\u0083£o
2 | aventura
3 | biogrÃ\u0083¡ficos
4 | comÃ\u0083©dia
5 | drama
6 | Ã\u0083©picos
7 | espionagem
8 | fantasia
9 | ficÃ\u0083§Ã\u0083£o cientÃ\u0083­fica
10 | ficÃ\u0083§Ã\u0083£o histÃ\u0083³rica
11 | guerra

No comments:

Post a Comment

eclipse: java: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" or Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder

  >PROBLEM Using Eclipse, you try to run a simple logging test using "org.slf4j.Logger" like the sample below: package Test; im...