渡鴉之丘: 在 PostgreSQL 產生累積次數分配圖表的 SQL 語法

一般敘述性統計，常見到兩種圖，次數分佈圖和累計次數分佈圖。繼上次的次數分佈的 SQL 筆記之後，這次換成累計次數分佈（Cumulative Frequency Distribution Chart）了。

要達到目的，只要基於先前筆記的次數分配表的 SQL，增加加總和求比例進行計算就能得到了～

下面範例其實是從 Oracle 借來的指令（來源在這裡），這指令中的語法，原生的 PostgreSQL 都有支援！

with freq_dist as
(select intvl, count(*) freq
  from (select width_bucket(sal,
                (select min(sal) from emp),
                (select max(sal)+1 from emp), 10) intvl
          from emp)
group by intvl
order by intvl),
--
rel_freq_dist as
(select intvl, freq::float/(select count(*) from emp) rel_freq
  from freq_dist)
--
select intvl,
      sum(rel_freq) over (order by intvl
      rows between unbounded preceding and current row) cum_freq
 from rel_freq_dist;

 intvl |     cum_freq      
-------+-------------------
     1 | 0.214285714285714
     2 | 0.571428571428571
     4 | 0.642857142857143
     5 | 0.714285714285714
     6 | 0.928571428571428
    10 |                 1
(6 rows)

另外，調整上面範例，使用 generate_series() 也可以達成一樣的功能，以下做成一個 psql Script

\set tblname  emp
\set data_col sal
\set begin_val 0
\set end_val   10000
\set specify_interval 700
\set barscaling 30

WITH intvl_series AS(
    SELECT numrange( coord
           , lead(coord) OVER (ORDER BY coord)
           ) AS bucket
    FROM generate_series(:begin_val, :end_val, 
                         :specify_interval) AS coord
)
   , freq_dist AS ( 
   SELECT bucket, count(:data_col) AS freq 
   FROM :tblname, intvl_series 
   WHERE :data_col <@ bucket 
   GROUP BY bucket 
   ORDER BY bucket 
) 
   , rel_freq_dist AS ( 
   SELECT bucket AS intvl 
        , freq::float/(SELECT sum(freq) 
                       FROM freq_dist) rel_freq 
   FROM freq_dist 
) 
   , cum_freq_dist AS ( 
   SELECT intvl, 
          sum(rel_freq) over (order by intvl 
                              rows between unbounded preceding and current row) cum_freq 
   from rel_freq_dist 
)
 SELECT intvl, cum_freq, 
        repeat('■', 
                ( cum_freq::float 
                  / max(cum_freq) OVER() 
                  * :barscaling 
                )::int 
        ) AS bar 
   FROM cum_freq_dist;

    intvl    |      cum_freq      |              bar               
-------------+--------------------+--------------------------------
 [700,1400)  | 0.0714285714285714 | ■■
 [1400,2100) |  0.428571428571429 | ■■■■■■■■■■■■■
 [2100,2800) |  0.571428571428571 | ■■■■■■■■■■■■■■■■■
 [3500,4200) |  0.642857142857143 | ■■■■■■■■■■■■■■■■■■■
 [4200,4900) |  0.928571428571428 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
 [7700,8400) |                  1 | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
(6 rows)

和前一篇筆記一樣，一樣可惜的是，沒有資料的區間沒辦法用零表示. . .

最後，附上一個不畫圖的版本，在前一篇也如法炮製即可～

\set tblname  emp
\set data_col sal
\set begin_val 0
\set end_val   10000
\set specify_interval 700

WITH intvl_series AS(
    SELECT numrange( coord
           , lead(coord) OVER (ORDER BY coord)
           ) AS bucket
    FROM generate_series(:begin_val, :end_val, 
                         :specify_interval) AS coord
)
   , freq_dist AS ( 
   SELECT bucket, count(:data_col) AS freq 
   FROM :tblname, intvl_series 
   WHERE :data_col <@ bucket 
   GROUP BY bucket 
   ORDER BY bucket 
) 
   , rel_freq_dist AS ( 
   SELECT bucket AS intvl 
        , freq::float/(SELECT sum(freq) 
                       FROM freq_dist) rel_freq 
   FROM freq_dist 
)  
   SELECT intvl, 
          sum(rel_freq) over (order by intvl 
                              rows between unbounded preceding and current row) cum_freq 
   from rel_freq_dist 
;

    intvl    |      cum_freq      
-------------+--------------------
 [700,1400)  | 0.0714285714285714
 [1400,2100) |  0.428571428571429
 [2100,2800) |  0.571428571428571
 [3500,4200) |  0.642857142857143
 [4200,4900) |  0.928571428571428
 [7700,8400) |                  1
(6 rows)

參考資料

Calculating Summaries with Histogram Frequency Distributions · Advanced SQL · SILOTA

postgresql - SQL: Fast Cumulative Frequency Query (postgres) - Stack Overflow

postgresql - Calculating cumulative for each month change - Postgres SQL issue - Stack Overflow

sql - Calculating Cumulative Sum in PostgreSQL - Stack Overflow

sql - Count cumulative total in Postgresql - Stack Overflow

Cumulative percentiles in PostgreSQL - journal of recreational computer science, Jan Szumiec

cumulative frequency graph postgres - Google 搜尋

Cumulative Frequency Graph | R Tutorial

width_bucket cumulative histogram - Google 搜尋

Numoracle Recipes: Oracle Descriptive Statistics

4.2.8. Window Function Calls - PostgreSQL: Documentation: 9.6: Value Expressions