渡鴉之丘: 把 iris 資料集放進 PGSQL

iris 資料集，是資料分析的入門範例資料集，很多的入門教學都會用到這份資料作範例。這邊把大統計學家 Ronald Fisher 使用的 iris 資料拿來匯入 PGSQL 裡面，用一點點 PGSQL 的功能讓儲存便簡潔一點點，順便把裡面提到的幾項敘述統計量整理成一個 VIew。以便作為延伸學習的敲門磚～～

首先先下載，網路上太多來源了，而且 R 或 Python 甚至打包成下載 API，這邊隨便找一個下載下來就可以了，這邊找的是 UCI Machine Learning Repository: Iris Data Set。

我這邊手上有四份資料文字檔：Index，iris.names，iris.data，bezdekIris.data。主要使用 iris.data 跟 iris.names：iris.data 是一個 csv 檔，而 iris.names 是一些描述，還包含了一個敘述統計量的表格跟基本的觀察資訊。

首先，建立相關的資料庫物件，共有一個 enum 資料型態，一個表格（方便起見放了一個 Sequence），跟幾個註解（從 iris.data 取得）

CREATE TYPE iris_class 
 AS ENUM ('Iris-setosa', 
          'Iris-versicolor', 
          'Iris-virginica');

CREATE TABLE iris (
 entry_id     SERIAL PRIMARY KEY,
 sepal_length NUMERIC(3,2),
 sepal_width  NUMERIC(3,2),
 petal_length NUMERIC(3,2),
 petal_width  NUMERIC(3,2),
 class        iris_class
);

COMMENT ON TABLE iris IS 
$$
1. Title: Iris Plants Database
 Updated Sept 21 by C.Blake - Added discrepency information

2. Sources:
     (a) Creator: R.A. Fisher
     (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
     (c) Date: July, 1988

＊＊＊中間內容略，請自行補上

Summary Statistics:
          Min  Max   Mean    SD   Class Correlation
   sepal length: 4.3  7.9   5.84  0.83    0.7826   
    sepal width: 2.0  4.4   3.05  0.43   -0.4194
   petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)
    petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)

9. Class Distribution: 33.3% for each of 3 classes.
$$;
COMMENT ON COLUMN iris.sepal_length IS 'sepal length in cm';
COMMENT ON COLUMN iris.sepal_width IS 'sepal width in cm';
COMMENT ON COLUMN iris.petal_length IS 'petal length in cm';
COMMENT ON COLUMN iris.petal_width IS 'petal width in cm';
COMMENT ON TYPE iris_class IS 
$$
class: 
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
$$;

然後就是匯入資料：這邊用 COPY 指令吃 csv 進去表格

COPY iris(sepal_length,
 sepal_width,
 petal_length,
 petal_width,
 class ) FROM '/path/to/iris.data'
 WITH ( FORMAT 'text', 
      DELIMITER ',' ,
      HEADER false);

最後，就是上面有看到的敘述統計資訊，這邊用 Union 作一個 View。此外，這邊利用了 PGSQL 的 enum 型態的特性，內部儲存的資料其實是數值，在查詢時會用 Mapping 的方式與前端互動

CREATE or replace VIEW iris_summary AS 
 SELECT 'sepal_length' as "Feature", 
  MIN(sepal_length), MAX(sepal_length),
  AVG(sepal_length)::numeric(3,2), STDDEV_SAMP(sepal_length)::numeric(3,2) 
  , CORR(sepal_length, enumsortorder)::numeric(5,4) 
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL 
SELECT 'sepal_width' as "Feature", 
  MIN(sepal_width), MAX(sepal_width),
  AVG(sepal_width)::numeric(3,2), STDDEV_SAMP(sepal_width)::numeric(3,2)
  , CORR(sepal_width, enumsortorder)::numeric(5,4) 
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL 
SELECT 'petal_length' as "Feature", 
  MIN(petal_length), MAX(petal_length),
  AVG(petal_length)::numeric(3,2), STDDEV_SAMP(petal_length)::numeric(3,2)
  , CORR(petal_length, enumsortorder)::numeric(5,4) 
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL 
SELECT 'petal_width' as "Feature", 
  MIN(petal_width), MAX(petal_width),
  AVG(petal_width)::numeric(3,2), STDDEV_SAMP(petal_width)::numeric(3,2)
  , CORR(petal_width, enumsortorder)::numeric(5,4) 
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid;

＊Note：這邊好像不容易用 PostgreSQL 樞紐表處理...

把上面匯入後，這樣就可以進行「小數據」整合 PGSQL 的分析練習了～

參考資料
Ex 3: The iris 鳶尾花資料集 · Machine Learning: Python 機器學習：使用Python
UCI Machine Learning Repository: Iris Data Set
Iris Species | Kaggle
Calculating Summary Statistics · Advanced SQL · SILOTA
weka - What is class correlation? - Cross Validated
Using the PostgreSQL Correlation Function
PostgreSQL: Is it possible cast enum to integer? - Stack Overflow
postgresql - SQL query to get all values a enum can have - Stack Overflow

渡鴉之丘

MathJax

MathJax-2

MathJax-3

Google Code Prettify

置頂入手筆記

EnterproseDB Quickstart — 快速入門筆記

2019年1月9日星期三

把 iris 資料集放進 PGSQL

沒有留言:

張貼留言