渡鴉之丘: 在 PostgreSQL 產生次數分配圖表的 SQL 語法

在 Excel 或 Libreoffice Calc 中，可以用資料畫次數分配長條圖（Frequency Distribution Chart）；那在 PostgreSQL 要怎麼用 SQL 語法達成？

目前找的到的資料，約有兩三種方法～
這裡列出比較簡單理解的方式，主要是參考這裡。這裡借用的範例，能夠繪製等區間的次數分配表（還有對應的圖～）；另外還有比較貼近平常在試算表中取間隔的方式。最後，提供一個被打包成 Aggregate Function 的 PL/pgSQL 以及它的用法。

以下範例資料使用 EnterpriseDB 企業版資料庫內建的範例資料進行。資料可以從這裡看到。

1) 使用 width_bucket() 以及 range 相關函數，依照資料極大/極小值範圍切等分進行計數

WITH sal_range AS (
  SELECT min(sal) AS min,
           max(sal) AS max
  FROM emp
)
   , histogram AS (
  SELECT width_bucket(sal, min, max, 9) AS bucket,
         numrange(min(sal), max(sal), '[]') AS range,
         count(*) AS freq
  FROM emp, sal_range 
  GROUP BY bucket
  ORDER BY bucket
)
SELECT bucket, range, freq,
       repeat('■',
               ( freq::float
                 / max(freq) over()
                 * 30
               )::int
        ) AS bar
FROM histogram;

 bucket |       range       | freq |              bar               
--------+-------------------+------+--------------------------------
      1 | [1250.00,1953.13] |    5 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      2 | [2031.25,2500.00] |    3 | ■■■■■■■■■■■■■■■■■■
      4 | [3828.13,3828.13] |    1 | ■■■■■■
      5 | [4453.13,4687.50] |    4 | ■■■■■■■■■■■■■■■■■■■■■■■■
     10 | [7812.50,7812.50] |    1 | ■■■■■■
(5 rows)

上面的 Common Table Expression 有三塊段落，最下面的主要部份是畫指令界面的長條圖的段落；如果只要運算值，只要取前兩區塊即可。

2) 借助 Window function 產生指定間距的 range 型態資料，然後拿來作圖。這比較像一般長條圖取間距的方式

WITH value_range AS(
    SELECT numrange(coord
         , lead(coord) OVER (ORDER BY coord)
         ) AS bucket
    FROM generate_series(0, 10000, 700) AS coord
)
   , histogram AS ( 
   SELECT bucket, count(sal) AS freq 
   FROM emp, value_range 
   WHERE sal <@ bucket 
   GROUP BY bucket 
   ORDER BY bucket 
) 
 SELECT bucket, freq, 
        repeat('■', 
                ( freq::float 
                  / max(freq) OVER() 
                  * 30 
                )::int 
        ) AS bar 
   FROM histogram;

   bucket    | freq |              bar               
-------------+------+--------------------------------
 [700,1400)  |    1 | ■■■■■■
 [1400,2100) |    5 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
 [2100,2800) |    2 | ■■■■■■■■■■■■
 [3500,4200) |    1 | ■■■■■■
 [4200,4900) |    4 | ■■■■■■■■■■■■■■■■■■■■■■■■
 [7700,8400) |    1 | ■■■■■■
(6 rows)

這段 SQL 使用 lead() 這個 Window Function ，在產生的序列上，前後兩筆資料（數值）做成等間隔；並計數落在各間隔範圍的 Salary 資料筆數。值得注意的是，裡面使用了 Postgres 的 range 資料型態以及其運算 sal <@ bucket。

要直接把上式湊成函數，看起來得要用 pl/pgSQL 的 EXEC() 函數（參考這裡），不過倒是可以用 psql 的 Meta-command 達成，包成一個獨立的 Script 來呼叫～

\set tblname emp
\set histocol sal
\set nbuckets 9
\set barscaling 30

WITH col_bndry AS (
    SELECT min(:histocol) AS min,
           max(:histocol) AS max
    FROM :tblname 
)
   , histogram AS (
   SELECT width_bucket(:histocol, min, max, :nbuckets) AS bucket,
          numrange(floor(min(:histocol)), floor(max(:histocol)), '[]') AS range,
          count(*) AS freq
     FROM :tblname, col_bndry 
     GROUP BY bucket
     ORDER BY bucket
)
SELECT bucket, range, freq,
       repeat('■',
              ( freq::float
                / max(freq) over()
                * :barscaling
              )::int
        ) AS bar
FROM histogram;

等間隔的 SQL 做成的 Script

\set tblname emp
\set histocol sal

\set begin_val 0
\set end_val   10000
\set specify_interval 700
\set barscaling 30
WITH value_range AS (
    SELECT numrange(coord
         , lead(coord) over (ORDER BY coord)
         ) AS bucket
    FROM generate_series(:begin_val, :end_val, :specify_interval) AS coord
)
   , histogram AS ( 
   SELECT bucket, count(:histocol) AS freq 
   FROM :tblname, value_range 
   WHERE :histocol <@ bucket 
   GROUP BY bucket 
   ORDER BY bucket 
) 
SELECT bucket, freq, 
       repeat('■', 
               ( freq::float 
                 / max(freq) over() 
                 * :barscaling
               )::int 
        ) AS bar 
FROM histogram;

psql 參數也可以透過指令的選項帶進去

bash-$ source /opt/edb/as9.6/pgplus_env.sh
bash-$ cat freq_dist.sql
\set begin_val 0
\set end_val   10000
\set specify_interval 700
\set barscaling 30

WITH intvl_series AS (
    SELECT numrange(coord
           , lead(coord) over (ORDER BY coord)
           ) AS bucket
    FROM generate_series(:begin_val, :end_val, 
                         :specify_interval) AS coord
)
   , freq_dist AS ( 
   SELECT bucket AS intvl
        , count(:data_col) AS freq 
   FROM :tblname, intvl_series 
   WHERE :data_col <@ bucket 
   GROUP BY bucket 
   ORDER BY bucket 
) 
SELECT intvl, freq, 
       repeat('■', 
               ( freq::float 
                 / max(freq) over() 
                 * :barscaling
               )::int 
        ) AS bar 
FROM freq_dist;

bash-$ psql -v tblname=emp -v data_col=sal -f freq_dist.sql
    intvl    | freq |              bar               
-------------+------+--------------------------------
 [700,1400)  |    1 | ■■■■■■
 [1400,2100) |    5 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
 [2100,2800) |    2 | ■■■■■■■■■■■■
 [3500,4200) |    1 | ■■■■■■
 [4200,4900) |    4 | ■■■■■■■■■■■■■■■■■■■■■■■■
 [7700,8400) |    1 | ■■■■■■
(6 rows)

最後，在幾番找尋之後，在 Postgres Wiki 找到一個回傳長條圖的 Aggregation Function，以下是從該網頁轉貼來的，適用 9.2 以上的 Postgres

CREATE OR REPLACE FUNCTION hist_sfunc (state INTEGER[], val REAL, MIN REAL, MAX REAL, nbuckets INTEGER)
 RETURNS INTEGER[] AS $$
DECLARE
  bucket INTEGER;
  i INTEGER;
BEGIN
  -- width_bucket uses nbuckets + 1 (!) and starts at 1.
  bucket := width_bucket(val, MIN, MAX, nbuckets - 1) - 1;
 
  -- Init the array with the correct number of 0's so the caller doesn't see NULLs
  IF state[0] IS NULL THEN
    FOR i IN SELECT * FROM generate_series(0, nbuckets - 1) LOOP
      state[i] := 0;
    END LOOP;
  END IF;
 
  state[bucket] = state[bucket] + 1;
 
  RETURN state;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
 
-- Tell Postgres how to use the new function
DROP AGGREGATE IF EXISTS histogram (REAL, REAL, REAL, INTEGER);
CREATE AGGREGATE histogram (REAL, REAL, REAL, INTEGER) (
       SFUNC = hist_sfunc,
       STYPE = INTEGER[]
);

可以看出來也是使用 width_bucket() 達成。可惜的是，這個 Aggregate Function 沒有允許多執行緒運算，需要有人幫忙加上去。

該函數的使用範例如下：要注意的是，這裡的取間隔數目和上一個範例不同，而且回傳的是 array 型態

with col_bndry as (
    select min(sal) as min,
           max(sal) as max
      from emp 
)
SELECT  histogram(sal, min, max, 10) as sal_histogram
 FROM emp,col_bndry ;

        sal_histogram        
-----------------------------
 [0:9]={5,3,0,1,4,0,0,0,0,1}

最後，看起來使用 range datatype＋generate_series() 比起使用 width_bucker() 處理還要自然一點～～

此外，擔心 count(*) 會算很久的人，在 Postgres 9.6 裡面已經有 SQL 多執行緒執行的功能支援，「有機會」把表格拆分多塊，用多個程序去進行，不然可以考慮使用 Greenplum，分散到多台節點，也可以算很快～

參考資料

以下兩篇比較主要

Calculating Summaries with Histogram Frequency Distributions · Advanced SQL · SILOTA

PostgreSQL, Aggregates and Histograms - Dimitri Fontaine, PostgreSQL Expert

Aggregate Histogram - PostgreSQL wiki

Binning functions | Axle Project

sql - Postgres width_bucket() not assigning values to buckets correctly - Stack Overflow

Quant Meditations: Creating histograms using the SQL function WIDTH_BUCKET

PostgreSQL 9.5新特性 width_bucket求位置插值，可展示柱状图-博客-云栖社区-阿里云

人分九等，数有阶梯 - PostgreSQL 阶品（颗粒）分析函数width_bucket, kmean应用-博客-云栖社区-阿里云（個人不偏好這篇用的開場白. . .）

Functions to create and draw histograms with PostgreSQL.

How to quickly see data distribution at the psql command line

Postgres width_bucket() not assigning values to buckets correctly | sql

calculating equal frequency buckets, or n-tile ranges in postgresql

8.17.1. Built-in Range Types - PostgreSQL: Documentation: 9.6: Range Types

次數分配與其圖表 | 科學Online – 科技部高瞻自然科學教學資源平台

Statistical SQL Quaries: Calculating Frequencies and Histograms (part 1)

sql - Rounding numbers to the nearest 10 in Postgres - Stack Overflow

postgresql - Round number UP to multiple of 10 - Stack Overflow

How to find data distributions with SQL · Baron Schwartz's Blog

PostgreSQL: Documentation: 9.1: Set Returning Functions

PostgreSQL: Documentation: 9.6: Range Functions and Operators

comparison - Comparing record to previous record in postgresql - Stack Overflow

渡鴉之丘

MathJax

MathJax-2

MathJax-3

Google Code Prettify

置頂入手筆記

EnterproseDB Quickstart — 快速入門筆記

2017年7月10日星期一

在 PostgreSQL 產生次數分配圖表的 SQL 語法

沒有留言:

張貼留言