Entity-Attribute-Value 型態(縮寫 EAV)的資料蠻常出現在實際應用上,最典型的就是資料分析時很多屬性標籤。
但是。。。PGSQL 其實不擅長對付這種屬性的資料。。。第二個無腦建議總是說:想辦法轉成關聯式結構的表格(第一個建議呢~就是建 GIN Index 之類的ㄅ~)
不過,這種屬性的資料正好可以用另一種主流資料儲存型態代換—JSON 型態。正好現在的 Postgres 也很擅長處理 JSON 資料。
另外值得提到的是,PGSQL 並不像 Cassandra 這種 Wide-Column Store 一樣可以開「很多」欄位,因此若 Attribute 標籤種類繁雜,就不是很適合在 PGSQL 開超多欄位的表格來存放。
因此這篇筆記嘗試把 EAV 轉換成 JSON 儲存的方式(測試版本:EDB PGSQL13,不過用到的功能基本上 PGSQL 9.5 以上都可以)。
另外,當標籤的類型很少,可以轉換成一般欄位的話(也就是關聯式表格),這篇筆記也嘗試演練一下~
如果看到這邊~還是不知道窩在講蝦米~請上網查一下 EAV pattern 與 Relational Model~
create table test_eav_to_json (name text, attr text, value int); insert into test_eav_to_json values ('u1', 'key1', 1), ('u1', 'key2', 2), ('u1', 'key3', 3), ('u2', 'key1', 4), ('u2', 'key2', 5), ('u2', 'key3', 6);
edb=# select * from test_eav_to_json; name | attr | value ------+------+------- u1 | key1 | 1 u1 | key2 | 2 u1 | key3 | 3 u2 | key1 | 4 u2 | key2 | 5 u2 | key3 | 6 (6 rows) Time: 1.232 ms edb=#
- EAV Pattern 並非 Relational Model,先天就直接違背大象本身的存在惹。。無法套用關聯式資料庫的正規劃理論,例如,沒有什麼 foreign key constraint 可以建構
- Attribute 通常是重複值的字串內容,這種欄位建立 BTree Index 沒什麼效,通常得建立比較胖的 GIN 看看有沒有顯著效果
- Value 內容混雜,極度可能發生字串/數值混合存放,這種時候資料庫抽樣(ANALYZE Table)沒辦法有效使用這種資料。例如,若使用 WHERE 條件過濾數值屬性的內容作運算,資料庫內的抽樣就沒辦法在速度上幫上什麼忙
edb=# select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name; name | jsn_kv ------+---------------------------------------- u1 | { "key1" : 1, "key2" : 2, "key3" : 3 } u2 | { "key1" : 4, "key2" : 5, "key3" : 6 } (2 rows) Time: 0.579 ms edb=#
EAV 轉一般表
- 處理目標是僅有少量 attribute 的 EAV 表格(Note:PGSQL 欄位數量限制大約在 1600~2500 之間,取決於資料型態)
- 確認這些資料中,相同 entity 的一個 attribute 只有一筆 value
- 使用 json_object_agg(Attribute, Value) as json_kv group by Entity 的方式,把 EAV 資料轉成「主檔/扁平 json」的形式
- 準備自訂 composite type 與目標存放表格
- 利用 select Entity, json_populate_record(json_kv) 的方式,把扁平 json 攤平成表格
edb=# create type test_eav_flatten_type as ( key1 int, key2 int, key3 int);
CREATE TYPE
Time: 114.50 ms
edb=#
edb=# create type test_eav_flatten_type2 as ( key1 int, key2 int);
CREATE TYPE
Time: 104.650 ms
edb=#
若有型態混用也不成問題,待會的步驟基本上不會有影響
edb=# create type test_eav_flatten_type as ( key1 int, key2 text, key3 int); CREATE TYPE Time: 114.50 ms edb=#
create table test_eav_flatten (name text, key1 int, key2 int, key3 int);
edb=# select name, json_populate_record(null::test_eav_flatten_type, jsn_kv) from ( edb(# select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name); name | json_populate_record ------+---------------------- u1 | (1,2,3) u2 | (4,5,6) (2 rows) Time: 1.159 ms edb=#
edb=# select name, j.* from ( select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name) e, json_populate_record(null::test_eav_flatten_type, jsn_kv) j; name | key1 | key2 | key3 ------+------+------+------ u1 | 1 | 2 | 3 u2 | 4 | 5 | 6 (2 rows) Time: 0.684 ms edb=#
edb=# insert into test_eav_flatten select name, j.* from ( select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name) e, json_populate_record(null::test_eav_flatten_type, jsn_kv) j; INSERT 0 2 Time: 45.378 ms edb=# edb=# select * from test_eav_flatten ; name | key1 | key2 | key3 ------+------+------+------ u1 | 1 | 2 | 3 u2 | 4 | 5 | 6 (2 rows) Time: 0.759 ms edb=#
edb=# select name, j.* from ( select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name), json_populate_record(null::test_eav_flatten_type2, jsn_kv) j; name | key1 | key2 ------+------+------ u1 | 1 | 2 u2 | 4 | 5 (2 rows) Time: 2.118 ms edb=#
重複 Attribute 值的 EAV 轉成 JSON Array
edb=# insert into test_eav_to_json values ('u1', 'key1', -1); INSERT 0 1 Time: 2.646 ms edb=# edb=# select * from test_eav_to_json; name | attr | value ------+------+------- u1 | key1 | 1 u1 | key2 | 2 u1 | key3 | 3 u2 | key1 | 4 u2 | key2 | 5 u2 | key3 | 6 u1 | key1 | -1 (7 rows) Time: 0.464 ms edb=#
edb=# select name, j.* from (
select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name),
json_populate_record(null::test_eav_flatten_type, jsn_kv) j;
name | key1 | key2 | key3
------+------+------+------
u1 | -1 | 2 | 3
u2 | 4 | 5 | 6
(2 rows)
Time: 0.685 ms
edb=#
edb=# select name, json_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name; name | jsn_kv ------+----------------------------------------------------- u1 | { "key1" : 1, "key2" : 2, "key3" : 3, "key1" : -1 } u2 | { "key1" : 4, "key2" : 5, "key3" : 6 } (2 rows) Time: 0.676 ms edb=#
edb=# select name, jsonb_object_agg(attr, value) as jsn_kv from test_eav_to_json group by name; name | jsn_kv ------+------------------------------------ u1 | {"key1": -1, "key2": 2, "key3": 3} u2 | {"key1": 4, "key2": 5, "key3": 6} (2 rows) Time: 0.767 ms edb=#
- 使用 Window function 搭配 count(*) 抓出 EAV 重複 Attribute
- 使用 Window function 把重複 Attribute 的 Value 轉成 JSON Array
- 使用 jsonb_object_agg()(注意,不是 json_object_agg())轉成最終含 JSON Array 的 JSON 資料
edb=# -- 最後成果
edb=# with separate_json as (
select name, attr,
case when count(*) over(partition by name, attr) >1 then
json_agg(to_json (value)) over(partition by name, attr)
else
to_json (value)
end as vlu_array
from test_eav_to_json
)
select name, jsonb_object_agg(attr,vlu_array) from separate_json
group by name ;
name | jsonb_object_agg
------+-----------------------------------------
u1 | {"key1": [-1, 1], "key2": 2, "key3": 3}
u2 | {"key1": 4, "key2": 5, "key3": 6}
(2 rows)
Time: 0.678 ms
edb=#
edb=# -- 錯誤範例 edb=# with separate_json as ( select name, attr, case when count(*) over(partition by name, attr) >1 then json_agg(to_json (value)) over(partition by name, attr) else to_json (value) end as vlu_array from test_eav_to_json ) select name, json_object_agg(attr,vlu_array) from separate_json group by name ; name | json_object_agg ------+---------------------------------------------------------------- u1 | { "key1" : [-1, 1], "key1" : [-1, 1], "key2" : 2, "key3" : 3 } u2 | { "key1" : 4, "key2" : 5, "key3" : 6 } (2 rows) Time: 0.684 ms edb=#
[2022/12/03 補充] 新的 EAV 轉換方式,簡潔非常多~
edb=# with jagg as not materialized ( select name, jsonb_object_agg(attr, value) as jsn from test_eav_to_json group by name) select jagg.name, jrcd.* from jagg cross join jsonb_to_record(jagg.jsn) as jrcd ( key1 int, key2 varchar, key3 int); name | key1 | key2 | key3 ------+------+------+------ u1 | 1 | 2 | 3 u2 | 4 | 5 | 6 (2 rows) edb=#
- 不用像之前筆記的方式寫的那麼複雜了~
- jsonb_to_record() 可以便利的指定要呈現的欄位,不用全數寫出來;欄位寫錯也 OK,只會出現 NULL 而已
- 可以當場轉換 datatype
edb=# explain with jagg as ( select name, jsonb_object_agg(attr, value) as jsn from test_eav_to_json group by name) select jagg.name, jrcd.* from jagg cross join jsonb_to_record(jagg.jsn) as jrcd ( key1 int, key2 varchar, key3 int); QUERY PLAN -------------------------------------------------------------------------------- Nested Loop (cost=22.75..31.25 rows=200 width=72) -> HashAggregate (cost=22.75..25.25 rows=200 width=64) Group Key: test_eav_to_json.name -> Seq Scan on test_eav_to_json (cost=0.00..18.50 rows=850 width=68) -> Function Scan on jsonb_to_record jrcd (cost=0.00..0.01 rows=1 width=40) (5 rows) edb=#
參考資料
EAV 介紹
SQL: Design - Entity Attribute Value Tables (Part 1) - Why? - The Bit Bucket
To EAV, or not to EAV? Choosing your data model - wq Framework
Understanding the EAV data model and when to use it - inviqa
Entity-Attribute-Value Implementation - MariaDB Knowledge Base
Handling EAV-data - Info Support Blog
PGSQL 裡面 EAV 替代建議
Replacing EAV with JSONB in PostgreSQL - coussej · coussej | Jeroen Coussement
postgresql - Postgres 9.4+: JSONB instead of EAV - Database Administrators Stack Exchange
Replacing EAV with JSONB in PostgreSQL - Share Tech Links
The user data is most likely in rows instead of columns. Instead of having ... | Hacker News
Data integrity in JSON(B) when replacing EAV - Javaer101
Faster Operations with the JSONB Data Type in PostgreSQL - Compose Articles
The CRUD of JSON in PostgreSQL | EDB
postgresql - Looking for a right EAV structure based on jsonb - Stack Overflow
How to store schema-less EAV (Entity-Attribute-Value) data using JSON and Hibernate - Vlad Mihalcea
PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns | 2ndQuadrant
Webinar: JSON & ARRAY – Contemporary PostgreSQL Data Types [Follow Up] | 2ndQuadrant
When to Avoid JSONB in a PostgreSQL Schema | Hacker News
entity-attribute-value design in PostgreSQL - don't do it! - CYBERTEC
JSON 文字檔資料匯入
standards - Does JSON syntax allow duplicate keys in an object? - Stack Overflow
Duplicate Keys in JSON Objects - DZone Web Dev
JSON Duplicate Keys - ECMAScript Discussion Archives
Our Friends CSV and JSON. In the history of the Internet, two… | by Martin Drapeau | Medium
postgresql - Using \COPY to load CSV with JSON fields into Postgres - Stack Overflow
Storing JSON in PostgreSQL: Step by Step Guide | Blendo.co
PGSQL JSON 操作
postgresql - Find Duplicates in Postgres Array, and Save as Different Field - Stack Overflow
sql - Find rows with duplicate values in a column - Stack Overflow
postgresql - Merging Concatenating JSON(B) columns in query - Stack Overflow
How to turn a json array into rows in postgres - Stack Overflow
sql - How to find duplicate records in PostgreSQL - Stack Overflow
postgresql - Couldnt identify equality operator of type json[] when using UNION - Stack Overflow
Multiple rows into json - Database Administrators Stack Exchange
postgresql - Postgres convert json with duplicate IDs - Stack Overflow
postgresql - Multiple rows into json - Database Administrators Stack Exchange
Postgres composite types for tables | Geeky Tidbits
EAV 屬性的測試資料 (不知道能不能套用在這次筆記)
Attribute Datasets - RPI Intelligent Systems Lab
Animals with Attributes 2 - Machine Learning and Computer Vision -- Christoph Lampert -- IST Austria
沒有留言:
張貼留言