渡鴉之丘: 在 PGSQL 裡面把 SQL_ASCII Database 裡面的 Multibytes Encoding 資料轉換到 UTF8 Database

很舊以前，Big5 的資料會放到裡面，不讓資料庫去處理內容。就是所謂的 Garbage in, garbage out. 。這種設置方式是把 PGSQL 資料庫初始化成 LATIN1 的資料庫（也就是 SQL_ASCII 的意思～）。若是 Oracle 的話，則是 iso-8859-1（與 Linux 的 file 指令查看 ASCII 資料所顯示的 MIME 名稱一樣～）
到了現在以 UTF8 為主流編碼的氛圍，可能就會想要把上面這種設置換掉，變成正統 UTF8 編碼，以利於資料交換的使用。
若現在資料已經在 PGSQL 的話，可以怎麼做呢？這邊用一個簡單的例子作練習。

這邊示範的版本是 PGSQL10，initdb 成 UTF8 Instance，在裡面放 SQL_ASCII 與 UTF8 編碼的資料庫（兩個）。

然後，在 SQL_ASCII 的資料庫裡面塞入 BIG5 編碼的資料，接著用 pg_dump 匯出，作兩個調整：

變更匯出 SQL 檔案的編碼：雖然內容是 Big5 的資料，但是檔案本身的標記會是 ISO-8859（也就是 single-byte 編碼）
變更裡面的 client_encoding：pg_dump 匯出時是 sql_ascii，要調整成 big5，才會匹配內容

然後塞到 UTF8 Database 裡面，最後確認沒問題～

a) 這邊準備內容，需要是包含 BIG5 中文字的 DDL 跟 BIG5 的資料內容。

產生 BIG5 編碼的方式，這邊先用 UTF8 處理文字檔，寫好 SQL Script，再用 iconv 處理，轉換成 BIG5 編碼。接著再準備匯入。

由於我這邊用的是 Google Cloud 的 VM，直接使用網頁開啟的 ssh 而不是其他像是 putty 的連線工具，因此沒辦法調整顯示編碼，讓 big5 資料正常呈現，所以使用 hex code 的方式去呈現出來

-bash-4.2$ echo $LC_ALL

-bash-4.2$ echo $LC_CTYPE

-bash-4.2$ echo $LANG
en_US.UTF-8
-bash-4.2$ cat << EOF > ~/original_text_utf8.sql
create table test_big5(
txt TEXT
);
comment on table test_big5 is '把big5中文放在單編碼資料庫的表';
insert into test_big5 values ('測試中文字'), ('聽說以前許功蓋會出事？');
EOF
-bash-4.2$ file -i ~/original_text_utf8.sql 
/var/lib/postgres//original_text_utf8.sql: text/plain; charset=utf-8
-bash-4.2$ iconv -f UTF-8 -t BIG-5 ~/original_text_utf8.sql > ~/original_text.sql
-bash-4.2$ 
-bash-4.2$ cat ~/original_text.sql
create table test_big5(
create table test_big5(
txt TEXT
);
comment on table test_big5 is '��big5�����b��s�X��Ʈw����';
insert into test_big5 values ('��դ���r'), ('ť���H�e�\�\�\�|�X�ơH');
-bash-4.2$ 
-bash-4.2$ 
-bash-4.2$ file -i ~/original_text.sql 
/var/lib/edb//original_text.sql: text/plain; charset=iso-8859-1
-bash-4.2$ 
-bash-4.2$ hexdump -C ~/original_text.sql
00000000  63 72 65 61 74 65 20 74  61 62 6c 65 20 74 65 73  |create table tes|
00000010  74 5f 62 69 67 35 28 0a  74 78 74 20 54 45 58 54  |t_big5(.txt TEXT|
00000020  0a 29 3b 0a 63 6f 6d 6d  65 6e 74 20 6f 6e 20 74  |.);.comment on t|
00000030  61 62 6c 65 20 74 65 73  74 5f 62 69 67 35 20 69  |able test_big5 i|
00000040  73 20 27 a7 e2 62 69 67  35 a4 a4 a4 e5 a9 f1 a6  |s '..big5.......|
00000050  62 b3 e6 bd 73 bd 58 b8  ea ae c6 ae 77 aa ba aa  |b...s.X.....w...|
00000060  ed 27 3b 0a 69 6e 73 65  72 74 20 69 6e 74 6f 20  |.';.insert into |
00000070  74 65 73 74 5f 62 69 67  35 20 76 61 6c 75 65 73  |test_big5 values|
00000080  20 28 27 b4 fa b8 d5 a4  a4 a4 e5 a6 72 27 29 2c  | ('.........r'),|
00000090  20 28 27 c5 a5 bb a1 a5  48 ab 65 b3 5c a5 5c bb  | ('.....H.e.\.\.|
000000a0  5c b7 7c a5 58 a8 c6 a1  48 27 29 3b 0a           |\.|.X...H');.|
000000ad
-bash-4.2$

b) 在同一個 DB Instance 內（initdb 為 UTF8 的系統），同時建立並存的一個 UTF8 與一個 SQL_ASCII 的 database。
預設上，已經有一個 postgres UTF8 database，接著再開一個 SQL_ASCII 的就可以了。這邊要注意的是，我們需要指定 TEMPLATE0 而非保持預設的 TEMPLATE1，這樣才有指定編碼的彈性

postgres=# create database singlebyte encoding sql_ascii template template0;
CREATE DATABASE
postgres=# 
postgres=# \x
Expanded display is on.
postgres=# \l
List of databases
-[ RECORD 1 ]-----+----------------------
Name              | postgres
Owner             | postgres
Encoding          | UTF8
Collate           | en_US.UTF-8
Ctype             | en_US.UTF-8
Access privileges | 
-[ RECORD 2 ]-----+----------------------
Name              | singlebyte
Owner             | postgres
Encoding          | SQL_ASCII
Collate           | en_US.UTF-8
Ctype             | en_US.UTF-8
Access privileges | 
-[ RECORD 3 ]-----+----------------------
Name              | template0
Owner             | postgres
Encoding          | UTF8
Collate           | en_US.UTF-8
Ctype             | en_US.UTF-8
Access privileges | =c/postgres          +
                  | postgres=CTc/postgres
-[ RECORD 4 ]-----+----------------------
Name              | template1
Owner             | postgres
Encoding          | UTF8
Collate           | en_US.UTF-8
Ctype             | en_US.UTF-8
Access privileges | =c/postgres          +
                  | postgres=CTc/postgres

postgres=#

c) 把上面的 SQL 匯入 singlebyte database：這邊需要注意，匯入時要「騙」資料庫，這是 sql_ascii 編碼：這個騙術可以成功，原因是基本的 ASCII 字碼在 Big5、UTF8、sql_ascii 裡面都是重合的，所以 SQL 指令的內容都正常

-bash-4.2$ psql -d test1byte -c 'set client_encoding = sql_ascii;' -f ~/original_text.sql 
SET
CREATE TABLE
COMMENT
INSERT 0 2
-bash-4.2$ 
-bash-4.2$ psql -d test1byte 
test1byte=# SELECT obj_description('public.test_big5'::regclass);
ERROR:  invalid byte sequence for encoding "UTF8": 0xa7
test1byte=# set client_encoding =sql_ascii;
SET
test1byte=# SELECT obj_description('public.test_big5'::regclass);
        obj_description         
--------------------------------
 ��big5�����b��s�X��Ʈw����
(1 row)

test1byte=# 
test1byte=# SELECT obj_description('public.test_big5'::regclass)::bytea;
                        obj_description                         
----------------------------------------------------------------
 \xa7e262696735a4a4a4e5a9f1a662b3e6bd73bd58b8eaaec6ae77aabaaaed
(1 row)

test1byte=#

d) 用 pg_dump 匯出 singlebyte database，確認編碼：內容雖然還是一樣是 BIG5，但是檔案的屬性卻不是

-bash-4.2$ pg_dump -d test1byte -f test1byte_dmp.sql
-bash-4.2$ file -i test1byte_dmp.sql
test1byte_dmp.sql: text/plain; charset=unknown-8bit
-bash-4.2$

e) 變更 dump 裡面的 client_encoding 指令：由於是從 sql_ascii 裡面倒出來的，pg_dump 會指定 client_encoding 為 sql_ascii，這邊要手動調整成 big5

-bash-4.2$ vi ~/test1byte_dmp.sql
-bash-4.2$ ### 編輯完之後
-bash-4.2$ grep 'client_encoding' ~/test1byte_dmp.sql
SET client_encoding = 'BIG-5';
-bash-4.2$

這邊建議一下，如果檔案太大，記得不要用編輯器去打開檔案改掉，盡量拼湊 sed 字串替代指令來處理，不然電腦會卡卡的。。

f) 把調整過的 dump 匯進去 UTF8 的 postgres database

-bash-4.2$ psql -d postgres -f ~/test1byte_dmp.sql

g) 確認匯入後的資料編碼

postgres=# show client_encoding ;
 client_encoding 
-----------------
 UTF8
(1 row)

postgres=# 
postgres=# select * from test_big5 ;
          txt           
------------------------
 測試中文字
 聽說以前許功蓋會出事？
(2 rows)

postgres=# 
postgres=# SELECT obj_description('public.test_big5'::regclass);
        obj_description         
--------------------------------
 把big5中文放在單編碼資料庫的表
(1 row)

postgres=#

上面顯示，資料已經變成 UTF8 的編碼了～

另外，試驗 postgres_fdw 可否跨 sql_ascii 跟 utf8 存取資料呢？
在 postgres_fdw 中，允許使用者指定許多的 Client-Side GUC，但雙方編碼是被固定下來的，因此不允許指定 client_encoding。
接下來的測試中，sql_ascii 的裡面仍然是放著 Big5 的內容，在 utf8 的 database 裡面用外部表存取，自然會有看不到的可能。
這邊希望的處理，是額外用編碼轉換的處理，看看可否成功轉換。

postgres=# -- 建立外部表的連線
postgres=# create extension postgres_fdw;
CREATE EXTENSION
postgres=# create server ascbig5 foreign data wrapper postgres_fdw OPTIONS ( dbname 'test1byte');
CREATE SERVER
postgres=# 
postgres=# create user MAPPING FOR PUBLIC SERVER ascbig5 OPTIONS ( user 'postgres', password '123');
CREATE USER MAPPING
postgres=#

postgres=# -- 測試一.1：直接查詢
postgres=# create foreign table public.test_big5_fdw ( txt TEXT) server ascbig5 options(schema_name 'public', table_name 'test_big5');
CREATE FOREIGN TABLE
postgres=# 
postgres=# select * from public.test_big5_fdw;
ERROR:  invalid byte sequence for encoding "UTF8": 0xb4
CONTEXT:  Remote SQL command: SELECT txt FROM public.test_big5
postgres=# 
postgres=# -- 測試一.2：嘗試指定 Type Casting 變成 bytea 格式
postgres=# select txt::bytea from public.test_big5_fdw;
ERROR:  invalid byte sequence for encoding "UTF8": 0xb4
CONTEXT:  Remote SQL command: SELECT txt FROM public.test_big5
postgres=# 
postgres=# -- 測試一.3：在本地 set client_encoding 看看。。。
postgres=# set client_encoding = 'sql_ascii';
SET
postgres=# select txt::bytea from public.test_big5_fdw;
ERROR:  invalid byte sequence for encoding "UTF8": 0xb4
CONTEXT:  Remote SQL command: SELECT txt FROM public.test_big5
postgres=#

postgres=# -- 測試二：重建表格，嘗試指定 bytea 回傳。。。
postgres=# create foreign table public.test_big5_fdw ( txt bytea) server ascbig5 options(schema_name 'public', table_name 'test_big5');
CREATE FOREIGN TABLE
postgres=# select * from public.test_big5_fdw;
ERROR:  invalid byte sequence for encoding "UTF8": 0xb4
CONTEXT:  Remote SQL command: SELECT txt FROM public.test_big5
postgres=#

由上可知，由於 client_encoding 一開始就被 postgres_fdw 決定，寫死為本地端 DB 的編碼了，因此沒辦法正常拉資料。。。就算嘗試要使用呈現 hex code 的方式處理，由於函數不會從本地 UTF8 端被 pushdown 到外部去執行，所以資料編碼檢查總是會執行，就會被 UTF8 檢查抓包到不合法編碼了（但實為 big5 編碼～）

就結論來說，只要用 pg_dump 直接把 SQL 倒出來，調整 client_encoding 就可以匯入了，不過若是資料量不小，那就需要規劃 DDL 與資料分別處理了。此時資料可能就需要以 COPY 模式匯出，或是匯出成 csv 再行匯入了。至於 Foreign Data Wrapper 可能就愛莫能助了。

最後，如果原先的系統不幸的混雜了不同編碼的資料（利容，同時存放了 big5 跟 utf8 的資料），工程就比較浩大了，需要把兩種資料分別挑出來才行。。。這也是使用 sql_ascii 會造成的潛在問題。

參考資料
- 主要的參考

PostgreSQL: Re: Dumping in LATIN1 and restoring in UTF-8

postgresql - Postgres upgrade and converting latin1 to utf8 - Database Administrators Stack Exchange

PostgreSQL database migration, the SQL_ASCII to UTF8 problem

Cyclopedia Square: Convert a PostgreSQL Database from LATIN1 to UTF8

Thread: ascii to utf-8 : Postgres Professional

Postgres migrating SQL_ASCII to UTF-8 with fix_latin | End Point

PostgreSQL Data Types: Text Encoding - Dimitri Fontaine, PostgreSQL Expert

- 其他的參考

[COSCUP 2017][202] Unicode 不是年年更新嗎，2017 怎麼還在缺字？_issue - YouTube

現代文字工作者必須知道的 UTF-8 | 內容推進實驗室

「許功蓋」的字以及源由 @新精讚

iconv 指令轉換文字檔編碼（Big5 轉 UTF8、UTF8 轉 Big5 ） - G. T. Wang