中文的部份的全文檢索,目前看到的資訊,只有中國的社群開發的簡體中文相關外掛,但這不夠適合台灣的正體中文使用;這主要是兩國文化不同,適用中國的字典檔在切詞上不完全適用於台灣的關係。
不過,還有其他的方案,主要是由日本社群所採用的外掛,分別有 pg_bigm 以及 PGroonga 也支援 CJK 的全文檢索。
最近剛好看到,PGroonga 針對 PGDG YUM Repo 的版本有提供 RHEL 現成的套件(不好意思,我很懶惰,只想用現成的。。。),在台灣 PGroonga 這個套件有點少人提到,因此這邊就嘗試看看,看看可不可以用來作為正體中文的全文檢索工具。
先紀錄目前收集到 PGSQL vs CJK 的狀況
- 韓國的用戶好像有一個標準的字典檔 textsearch_ko
- 台灣的用戶。。。目前沒有直接相關的資源。。。目前看到的,應該是拿中國的結巴模組,作簡繁轉換以及詞庫的調整改良而成,有看到 Jseg 以及 jieba-zh_TW: 結巴中文斷詞台灣繁體版本這兩款。但整合 PGSQL 的部份目前(2019/9)還沒有看到
- 可以用的字典檔是中研院的專利,中研院中文斷詞系統,不過沒有碰過,不清楚狀況~~但~~在 2019 年八月已經釋出了下一版,而且是開源的函式庫 ckiplab/ckiptagger: CKIP Neural Chinese Word Segmentation, POS Tagging, and NER。有機會來試試
- 自己做字典檔。。。不過也還不清楚怎麼進行(例如,如何用爬蟲收集關鍵字,以及 PostgreSQL 的字典檔格式等等問題),有機會再了解看看
PGroonga 是基於 Groonga 這個專案的 PostgreSQL 套件;Groonga 也支援其他的資料庫,例如 MySQL。詳細的相關介紹可以看這邊~
接著就開始來試試~首先,在安裝好 PGDG 提供的原生 PGSQL 11 的主機上(其實是 LXC Container 裡面的服務~),安裝 Groonga 的 Repo(輸出內容略)
[root@pg1 ~]# rpm -qa | grep postgresql postgresql11-server-11.1-1PGDG.rhel7.x86_64 postgresql11-libs-11.1-1PGDG.rhel7.x86_64 postgresql11-11.1-1PGDG.rhel7.x86_64 postgresql11-contrib-11.1-1PGDG.rhel7.x86_64 [root@pg1 ~]# [root@pg1 ~]# yum install -y https://packages.groonga.org/centos/groonga-release-latest.noarch.rpm
然後裡面就有 PGroonga 了
[root@pg1 ~]# yum search pgroonga Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile epel/x86_64/metalink | 15 kB 00:00:00 * base: centos4.zswap.net * epel: mirror.grid.uchicago.edu * extras: centos4.zswap.net * updates: centos.mirrors.hoobly.com epel | 5.4 kB 00:00:00 groonga-centos | 2.9 kB 00:00:00 (1/4): epel/x86_64/group_gz | 88 kB 00:00:00 (2/4): epel/x86_64/primary_db | 6.8 MB 00:00:00 (3/4): epel/x86_64/updateinfo | 1.0 MB 00:00:00 (4/4): groonga-centos/7/x86_64/primary_db | 612 kB 00:00:01 ================================== N/S matched: pgroonga =================================== postgresql-pgroonga-debuginfo.x86_64 : Debug information for package postgresql-pgroonga postgresql10-pgroonga-debuginfo.x86_64 : Debug information for package postgresql10-pgroonga postgresql11-pgroonga-debuginfo.x86_64 : Debug information for package postgresql11-pgroonga postgresql94-pgroonga-debuginfo.x86_64 : Debug information for package postgresql94-pgroonga postgresql95-pgroonga-debuginfo.x86_64 : Debug information for package postgresql95-pgroonga postgresql96-pgroonga-debuginfo.x86_64 : Debug information for package postgresql96-pgroonga postgresql-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga postgresql10-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga postgresql11-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga postgresql94-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga postgresql95-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga postgresql96-pgroonga.x86_64 : Fast full-text search plugin for PostgreSQL based on Groonga Name and summary matches only, use "search all" for everything. [root@pg1 ~]#
由於我這邊裝的是 PGSQL 11,因此要裝相應套件 postgresql11-pgroonga
[root@pg1 ~]# yum install -y postgresql11-pgroonga Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: centos4.zswap.net * epel: mirror.steadfastnet.com * extras: centos4.zswap.net * updates: centos.mirrors.hoobly.com Resolving Dependencies --> Running transaction check ---> Package postgresql11-pgroonga.x86_64 0:2.2.1-1.el7 will be installed --> Processing Dependency: groonga-libs >= 9.0.4 for package: postgresql11-pgroonga-2.2.1-1.el7.x86_64 --> Processing Dependency: msgpack for package: postgresql11-pgroonga-2.2.1-1.el7.x86_64 --> Processing Dependency: libmsgpackc.so.2()(64bit) for package: postgresql11-pgroonga-2.2.1-1.el7.x86_64 --> Processing Dependency: libgroonga.so.0()(64bit) for package: postgresql11-pgroonga-2.2.1-1.el7.x86_64 --> Running transaction check ---> Package groonga-libs.x86_64 0:9.0.7-1.el7 will be installed --> Processing Dependency: libzstd for package: groonga-libs-9.0.7-1.el7.x86_64 --> Processing Dependency: libzstd.so.1()(64bit) for package: groonga-libs-9.0.7-1.el7.x86_64 ---> Package msgpack.x86_64 0:1.4.1-1.el7 will be installed --> Running transaction check ---> Package libzstd.x86_64 0:1.4.2-1.el7 will be installed --> Finished Dependency Resolution Dependencies Resolved ============================================================================================ Package Arch Version Repository Size ============================================================================================ Installing: postgresql11-pgroonga x86_64 2.2.1-1.el7 groonga-centos 420 k Installing for dependencies: groonga-libs x86_64 9.0.7-1.el7 groonga-centos 2.0 M libzstd x86_64 1.4.2-1.el7 epel 252 k msgpack x86_64 1.4.1-1.el7 epel 23 k Transaction Summary ============================================================================================ Install 1 Package (+3 Dependent packages) Total download size: 2.7 M Installed size: 16 M Downloading packages: warning: /var/cache/yum/x86_64/7/epel/packages/libzstd-1.4.2-1.el7.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 352c64e5: NOKEY Public key for libzstd-1.4.2-1.el7.x86_64.rpm is not installed (1/4): libzstd-1.4.2-1.el7.x86_64.rpm | 252 kB 00:00:00 (2/4): msgpack-1.4.1-1.el7.x86_64.rpm | 23 kB 00:00:00 warning: /var/cache/yum/x86_64/7/groonga-centos/packages/postgresql11-pgroonga-2.2.1-1.el7.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 45499429: NOKEY Public key for postgresql11-pgroonga-2.2.1-1.el7.x86_64.rpm is not installed (3/4): postgresql11-pgroonga-2.2.1-1.el7.x86_64.rpm | 420 kB 00:00:01 (4/4): groonga-libs-9.0.7-1.el7.x86_64.rpm | 2.0 MB 00:00:02 -------------------------------------------------------------------------------------------- Total 1.3 MB/s | 2.7 MB 00:00:02 Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-groonga Importing GPG key 0x45499429: Userid : "groonga Key (groonga Official Signing Key)" Fingerprint: c97e 4649 a205 1d0c ea1a 73f9 72a7 496b 4549 9429 Package : groonga-release-1.5.1-1.noarch (installed) From : /etc/pki/rpm-gpg/RPM-GPG-KEY-groonga Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-groonga-RSA4096 Importing GPG key 0x34839225: Userid : "Groonga Key (Groonga Official Signing Key) " Fingerprint: 2701 f317 cfcc cb97 5cad e9c2 624c f774 3483 9225 Package : groonga-release-1.5.1-1.noarch (installed) From : /etc/pki/rpm-gpg/RPM-GPG-KEY-groonga-RSA4096 Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 Importing GPG key 0x352C64E5: Userid : "Fedora EPEL (7) " Fingerprint: 91e9 7d7c 4a5e 96f1 7f3e 888f 6a2f aea2 352c 64e5 Package : epel-release-7-11.noarch (@extras) From : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : msgpack-1.4.1-1.el7.x86_64 1/4 Installing : libzstd-1.4.2-1.el7.x86_64 2/4 Installing : groonga-libs-9.0.7-1.el7.x86_64 3/4 Installing : postgresql11-pgroonga-2.2.1-1.el7.x86_64 4/4 Verifying : groonga-libs-9.0.7-1.el7.x86_64 1/4 Verifying : libzstd-1.4.2-1.el7.x86_64 2/4 Verifying : postgresql11-pgroonga-2.2.1-1.el7.x86_64 3/4 Verifying : msgpack-1.4.1-1.el7.x86_64 4/4 Installed: postgresql11-pgroonga.x86_64 0:2.2.1-1.el7 Dependency Installed: groonga-libs.x86_64 0:9.0.7-1.el7 libzstd.x86_64 0:1.4.2-1.el7 msgpack.x86_64 0:1.4.1-1.el7 Complete! [root@pg1 ~]#
然後就可以確認套件了
postgres=# select name from pg_available_extensions where name like 'p%'; name -------------------- plpgsql pg_trgm pgstattuple postgres_fdw pg_visibility pgcrypto pgrowlocks pageinspect pg_buffercache pg_stat_statements pg_freespacemap pg_prewarm pgroonga_database pgroonga (14 rows) postgres=#
接著就啟用外掛,向其他的 PGSQL 外掛一樣~
postgres=# create extension pgroonga; ERROR: could not load library "/usr/pgsql-11/lib/pgroonga.so": /usr/pgsql-11/lib/pgroonga.so: undefined symbol: getmissingattr postgres=#
ㄟ~怎麼失敗惹。。。。搜尋一下網路,其實可以發現,是因為我太久沒更新 PostgreSQL 的小版本的關係 could not load library "/usr/pgsql-11/lib/timescaledb.so": /usr/pgsql-11/lib/timescaledb.so: undefined symbol: getmissingattr · Issue #1161 · timescale/timescaledb~
因此補一下更新,再回頭啟用套件就成功了~
postgres=# \q -bash-4.2$ exit logout [root@pg1 ~]# yum update postgresql11 Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: centos4.zswap.net * epel: mirror.grid.uchicago.edu * extras: centos4.zswap.net * updates: centos.mirrors.hoobly.com Resolving Dependencies --> Running transaction check ---> Package postgresql11.x86_64 0:11.1-1PGDG.rhel7 will be updated --> Processing Dependency: postgresql11(x86-64) = 11.1-1PGDG.rhel7 for package: postgresql11-contrib-11.1-1PGDG.rhel7.x86_64 --> Processing Dependency: postgresql11(x86-64) = 11.1-1PGDG.rhel7 for package: postgresql11-server-11.1-1PGDG.rhel7.x86_64 ---> Package postgresql11.x86_64 0:11.5-1PGDG.rhel7 will be an update --> Processing Dependency: postgresql11-libs(x86-64) = 11.5-1PGDG.rhel7 for package: postgresql11-11.5-1PGDG.rhel7.x86_64 --> Running transaction check ---> Package postgresql11-contrib.x86_64 0:11.1-1PGDG.rhel7 will be updated ---> Package postgresql11-contrib.x86_64 0:11.5-1PGDG.rhel7 will be an update ---> Package postgresql11-libs.x86_64 0:11.1-1PGDG.rhel7 will be updated ---> Package postgresql11-libs.x86_64 0:11.5-1PGDG.rhel7 will be an update ---> Package postgresql11-server.x86_64 0:11.1-1PGDG.rhel7 will be updated ---> Package postgresql11-server.x86_64 0:11.5-1PGDG.rhel7 will be an update --> Finished Dependency Resolution Dependencies Resolved ============================================================================================ Package Arch Version Repository Size ============================================================================================ Updating: postgresql11 x86_64 11.5-1PGDG.rhel7 pgdg11 1.6 M Updating for dependencies: postgresql11-contrib x86_64 11.5-1PGDG.rhel7 pgdg11 617 k postgresql11-libs x86_64 11.5-1PGDG.rhel7 pgdg11 361 k postgresql11-server x86_64 11.5-1PGDG.rhel7 pgdg11 4.7 M Transaction Summary ============================================================================================ Upgrade 1 Package (+3 Dependent packages) Total download size: 7.3 M Is this ok [y/d/N]: y Downloading packages: Delta RPMs disabled because /usr/bin/applydeltarpm not installed. (1/4): postgresql11-contrib-11.5-1PGDG.rhel7.x86_64.rpm | 617 kB 00:00:00 (2/4): postgresql11-11.5-1PGDG.rhel7.x86_64.rpm | 1.6 MB 00:00:00 (3/4): postgresql11-libs-11.5-1PGDG.rhel7.x86_64.rpm | 361 kB 00:00:00 (4/4): postgresql11-server-11.5-1PGDG.rhel7.x86_64.rpm | 4.7 MB 00:00:00 -------------------------------------------------------------------------------------------- Total 8.6 MB/s | 7.3 MB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Updating : postgresql11-libs-11.5-1PGDG.rhel7.x86_64 1/8 Updating : postgresql11-11.5-1PGDG.rhel7.x86_64 2/8 Updating : postgresql11-server-11.5-1PGDG.rhel7.x86_64 3/8 Updating : postgresql11-contrib-11.5-1PGDG.rhel7.x86_64 4/8 Cleanup : postgresql11-contrib-11.1-1PGDG.rhel7.x86_64 5/8 Cleanup : postgresql11-server-11.1-1PGDG.rhel7.x86_64 6/8 Cleanup : postgresql11-11.1-1PGDG.rhel7.x86_64 7/8 Cleanup : postgresql11-libs-11.1-1PGDG.rhel7.x86_64 8/8 Verifying : postgresql11-server-11.5-1PGDG.rhel7.x86_64 1/8 Verifying : postgresql11-contrib-11.5-1PGDG.rhel7.x86_64 2/8 Verifying : postgresql11-libs-11.5-1PGDG.rhel7.x86_64 3/8 Verifying : postgresql11-11.5-1PGDG.rhel7.x86_64 4/8 Verifying : postgresql11-contrib-11.1-1PGDG.rhel7.x86_64 5/8 Verifying : postgresql11-server-11.1-1PGDG.rhel7.x86_64 6/8 Verifying : postgresql11-libs-11.1-1PGDG.rhel7.x86_64 7/8 Verifying : postgresql11-11.1-1PGDG.rhel7.x86_64 8/8 Updated: postgresql11.x86_64 0:11.5-1PGDG.rhel7 Dependency Updated: postgresql11-contrib.x86_64 0:11.5-1PGDG.rhel7 postgresql11-libs.x86_64 0:11.5-1PGDG.rhel7 postgresql11-server.x86_64 0:11.5-1PGDG.rhel7 Complete! [root@pg1 ~]# service postgresql-11 restart Redirecting to /bin/systemctl restart postgresql-11.service [root@pg1 ~]# su - postgres Last login: Fri Aug 30 15:35:20 UTC 2019 on lxc/console -bash-4.2$ psql psql (11.5) Type "help" for help. postgres=# create extension pgroonga; CREATE EXTENSION postgres=#
接著就是裝一點測試資料了~
postgres=# CREATE TABLE memos ( id integer, content text ); CREATE TABLE postgres=# insert into memos values (1, '我家門前有小河,'), (2, '後面有山坡'), (3, '山坡上面野花多,'), (4, '野花紅似火'), (5, '小河裡有白鵝,'), (6, '鵝兒戲綠波'), (7, '戲弄綠波鵝兒快樂,'), (8, '昂首唱清歌'); INSERT 0 8 postgres=#
要查詢的話,可以參考手冊教學 Tutorial | PGroonga,裡面有列舉可以用的運算子,基本上是新增 &@ 這個比較運算,跟內建的 Full Text Search 運算子不太一樣。這個運算子在啟用外掛才能使用,因此沒有裝 PGroonga 的資料庫就不會有~~
postgres=# SELECT * FROM memos WHERE content &@ '山坡'; id | content ----+-------------------------- 2 | 後面有山坡 3 | 山坡上面野花多, (2 rows) postgres=#
不過除了直接查詢之外,也可以直接以 pgroonga 的功能,建立 index:這一點跟內建的全文檢索操作,其實是不太一樣的。(內建的全文檢索是針對字串產生 tsvector,裡面會依據字典去切割出字串與出現次數的紀錄)
postgres=# CREATE INDEX ON memos USING pgroonga (content); CREATE INDEX postgres=# postgres=# explain SELECT * FROM memos WHERE content &@ '山坡'; QUERY PLAN -------------------------------------------------------------------------------- Index Scan using memos_content_idx on memos (cost=0.00..4.04 rows=1 width=36) Index Cond: (content &@ '山坡'::text) (2 rows) postgres=#
簡易測試就到這邊~以後有機會再仔細的找真實一點的資料來用用看。
參考資料
只是把查到的都紀錄起來而已
沒有留言:
張貼留言