渡鴉之丘: PGSQL uuid 欄位搭配 hash index 草草筆記

UUID 是一個很像亂碼的東東，有些時候需要用他當作資料識別號。這時就要稍稍了解一下這個在 PGSQL 的加速方式。

1) UUID 以英數字表示時，是一個 36 個字元的無序編號（還是可以用 hex code 方式排序，但沒什麼特殊意義）。若採用 bit 的方式儲存，則是一個 16 byte 長的資料。光從資料長度，就可以看到讀取上節約的差距。
因此可以的話，盡量選擇對應的 datatype，而非用字串代替。

2) 使用 UUID 只會有等號比對，也不太可能有其他比對方式（例如大於小於或是 LIKE 比對等），因此搭配 hash index 是不錯的選項。
當然如果得要求 primary key 在 PGSQL 還是沒得選，只能用預設的 btree index。

3) UUID 有好幾代，通常大家知道的就是隨機號碼（第四代）。
不過 UUID 其實也有 hash 計算的版本（第三、五代），所以可以當作 md5 hash 用。

用 uuid 版本的 md5（嚴格說來應該不完全是 md5，反正就是 v3/v5 的 hash），或是直接把 md5 轉成 uuid。

以下試一下 v3 的使用，注意第一個參數有規則，不能亂塞，只能用指定的內建函數（原理是什麼倒沒仔細去了解。。就名稱而言叫做 UUID namespace）。以下示範相同內容產生的 UUID v3 不會變動

edb=# create table testhash(ky uuid,str text);
CREATE TABLE
edb=# create index on testhash using hash(ky);
CREATE INDEX
edb=# create extension  uuid-ossp;
ERROR:  syntax error at or near "-"
LINE 1: create extension  uuid-ossp;
                              ^
edb=# create extension  "uuid-ossp";
CREATE EXTENSION
edb=# insert into testhash values (uuid_generate_v3 ('xx','http://www.postgresql.org'),'http://www.postgresql.org');
ERROR:  invalid input syntax for type uuid: "xx"
LINE 1: insert into testhash values (uuid_generate_v3 ('xx','http://...
                                                       ^
edb=# 
edb=# insert into testhash values (uuid_generate_v3 (uuid_nil(),'http://www.postgresql.org'),'http://www.postgresql.org');
INSERT 0 1
edb=# insert into testhash values (uuid_generate_v3 (uuid_nil(),'http://www.postgresql.org'),'http://www.postgresql.org');
INSERT 0 1
edb=# select * from testhash ;
                  ky                  |            str            
--------------------------------------+---------------------------
 15615b44-e5e8-3fea-b573-696f6a2b061e | http://www.postgresql.org
 15615b44-e5e8-3fea-b573-696f6a2b061e | http://www.postgresql.org
(2 rows)

edb=#

不過以上的結果，會和 md5 計算值直接以 uuid 格式儲存的結果不一樣（廢話～）。

edb=# insert into testhash values (md5 ('http://www.postgresql.org')::uuid,'http://www.postgresql.org');
INSERT 0 1
edb=# select * from testhash ;
                  ky                  |            str            
--------------------------------------+---------------------------
 15615b44-e5e8-3fea-b573-696f6a2b061e | http://www.postgresql.org
 15615b44-e5e8-3fea-b573-696f6a2b061e | http://www.postgresql.org
 cf03e637-1c82-4c5f-ba23-1c9cc83d629d | http://www.postgresql.org
(3 rows)

edb=#

另外也有機會在程式碼端預先處理 UUID 產生：基本上就是找些對應的 library 處理。

這些方式都可以產生一些可重現的內容。

4) 搭配 hash index 就有機會處理資料比對加速：資料可以拆解，用 join 形式呈現：當然這個 JOIN 通常不太自然，這就看大家程式規劃跟開發技能了。

一個很簡化的構想 — 使用 UUID 同時配合真實資料的比對 — 可以考慮 covering index 配合真實資料來加速：建立真實資料的 index 並帶 UUID 作為 include 內容。

以下模擬，data_id 是通常的查詢條件，ky 是對應的 uuid，相關表格之間以 ky 串起來。

create table table_mst(data_id int, ky uuid);
create table table_dtl(ky uuid, dsc text);
create unique index on table_mst using btree(ky);
create unique index on table_mst using btree(data_id) include (ky);
ALTER TABLE table_dtl ADD CONSTRAINT fk_dtl_mst_ky FOREIGN KEY (ky) REFERENCES table_mst(ky);
create index on table_dtl using hash(ky);
insert into table_mst select g, uuid_generate_v4() from generate_series(1,10000) g;
insert into table_dtl select ky, 'detail data '||data_id from table_mst;
-- 若是單一筆，則可以用 insert returning 取得 ky 值以往下套用，並套在相同 Tx 內。
analyze table_mst;
analyze table_dtl;

edb=# explain select d.dsc from table_mst m
join table_dtl d
on m.ky = d.ky
and m.data_id = 4;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..12.33 rows=1 width=16)
   ->  Index Only Scan using table_mst_data_id_ky_idx on table_mst m  (cost=0.29..4.30 rows=1 width=16)
         Index Cond: (data_id = 4)
   ->  Index Scan using table_dtl_ky_idx on table_dtl d  (cost=0.00..8.02 rows=1 width=32)
         Index Cond: (ky = m.ky)
(5 rows)

edb=# explain analyze select d.dsc from table_mst m
join table_dtl d
on m.ky = d.ky
and m.data_id = 4;
                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..12.33 rows=1 width=16) (actual time=0.043..0.046 rows=1 loops=1)
   ->  Index Only Scan using table_mst_data_id_ky_idx on table_mst m  (cost=0.29..4.30 rows=1 width=16) (actual time=0.014..0.015 rows=1 loops=1)
         Index Cond: (data_id = 4)
         Heap Fetches: 0
   ->  Index Scan using table_dtl_ky_idx on table_dtl d  (cost=0.00..8.02 rows=1 width=32) (actual time=0.025..0.026 rows=1 loops=1)
         Index Cond: (ky = m.ky)
 Planning Time: 0.253 ms
 Execution Time: 0.086 ms
(8 rows)

edb=#

5) UUID 除了用 hex 跟連字號表示的格式，還有以 base64 的格式，在 PGSQL 可以作轉換。

uuid 轉 base64：因為 UUID 還沒支援直接轉成 bytea，有看到一個還沒紀錄到手冊的函數 uuid_send() 搭配著用以下方法完全來自這邊 postgresql - How to short UUIDs with Postgres? - Database Administrators Stack Exchange 的回答，箇中奧秘還不太瞭～

edb=# select encode(uuid_send(uuid_generate_v3 (uuid_nil(), 'http://www.postgresql.org')),'base64');
          encode
--------------------------
 FWFbROXoP+q1c2lvaisGHg==
(1 row)

edb=#

base64 轉 uuid：使用相同範例

edb=# select substring(decode('FWFbROXoP+q1c2lvaisGHg==', 'base64')::text from 3)::uuid;
              substring
--------------------------------------
 15615b44-e5e8-3fea-b573-696f6a2b061e
(1 row)

edb=# -- 記得跟上面 3) 內容比對一下

目前想到&找到的大概是這些～

參考資料
PostgreSQL: Documentation: 72.1. Overview
PostgreSQL: Documentation: F.49. uuid-ossp
How should I index a UUID in Postgres? - Database Administrators Stack Exchange
Generating v5 UUID. What is name and namespace? - Stack Overflow
Re-Introducing Hash Indexes in PostgreSQL | Haki Benita
UUID 原理與實作分析 - 該如何挑選適合的 UUID 版本 | Yuanchieh's Blog
Is there any way to generate the same UUID from a String - Stack Overflow
postgresql - What is the optimal data type for an MD5 field? - Database Administrators Stack Exchange
Are type 3 UUID's collision probablity the same as MD5's - Stack Overflow
UUID GUID Oversimplified – Are They Really Unique? - Scaleyourapp
PostgreSQL UUID type performance - Stack Overflow
UUID, serial or identity columns for PostgreSQL auto-generated primary keys? - Cybertec Blog
Any significant performance disadvantage to using uuid as primary key? : r/PostgreSQL
UUID Benchmark - begriffs/uuid-btree-vs-hash.md · GitHub Gist
PostgreSQL and Databases in general: Hash indexes are faster than Btree indexes? - PostgreSQL and Databases in general
Look It Up: Practical PostgreSQL Indexing
UUID as Primary Key - Ingram Chen
UUID or GUID as Primary Keys? Be Careful! | by Tom Harrison
Why Covering Indexes in Postgres Are Incredibly Helpful - PostgreSQL Blog | Crunchy Data
best index type for uuid in postgres 11 - Database Administrators Stack Exchange
Salts and UUIDs for your PostgreSQL database | by Panos Zafeiropoulos | Nov, 2022 | Dev Genius
Partition Table of PostgreSQL Hash and Partition ID Calculation - Alibaba Cloud Community
How to short UUIDs with Postgres? - Database Administrators Stack Exchange
What format does Elasticsearch use for its ids? - Stack Overflow
Postgresql base64 encode - Stack Overflow
Convert PostgreSQL bytea-stored serialized-java-UUID to postgresql-UUID - Stack Overflow
Generating a UUID in Postgres for Insert statement? - Stack Overflow
Generating URL-Safe Short IDs in Postgresql | &yet blog
Generate unique base64 in database - postgresql
Converting byte arrays to UUIDs in Postgres - defn.io

渡鴉之丘

MathJax

MathJax-2

MathJax-3

Google Code Prettify

置頂入手筆記

EnterproseDB Quickstart — 快速入門筆記

2022年12月13日星期二

PGSQL uuid 欄位搭配 hash index 草草筆記

沒有留言:

張貼留言