首先先下載,網路上太多來源了,而且 R 或 Python 甚至打包成下載 API,這邊隨便找一個下載下來就可以了,這邊找的是 UCI Machine Learning Repository: Iris Data Set。
我這邊手上有四份資料文字檔:Index,iris.names,iris.data,bezdekIris.data。主要使用 iris.data 跟 iris.names:iris.data 是一個 csv 檔,而 iris.names 是一些描述,還包含了一個敘述統計量的表格跟基本的觀察資訊。
首先,建立相關的資料庫物件,共有一個 enum 資料型態,一個表格(方便起見放了一個 Sequence),跟幾個註解(從 iris.data 取得)
CREATE TYPE iris_class AS ENUM ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica');
CREATE TABLE iris ( entry_id SERIAL PRIMARY KEY, sepal_length NUMERIC(3,2), sepal_width NUMERIC(3,2), petal_length NUMERIC(3,2), petal_width NUMERIC(3,2), class iris_class );
COMMENT ON TABLE iris IS $$ 1. Title: Iris Plants Database Updated Sept 21 by C.Blake - Added discrepency information 2. Sources: (a) Creator: R.A. Fisher (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) (c) Date: July, 1988 ***中間內容略,請自行補上 Summary Statistics: Min Max Mean SD Class Correlation sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) 9. Class Distribution: 33.3% for each of 3 classes. $$; COMMENT ON COLUMN iris.sepal_length IS 'sepal length in cm'; COMMENT ON COLUMN iris.sepal_width IS 'sepal width in cm'; COMMENT ON COLUMN iris.petal_length IS 'petal length in cm'; COMMENT ON COLUMN iris.petal_width IS 'petal width in cm'; COMMENT ON TYPE iris_class IS $$ class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica $$;
然後就是匯入資料:這邊用 COPY 指令吃 csv 進去表格
COPY iris(sepal_length,
sepal_width,
petal_length,
petal_width,
class ) FROM '/path/to/iris.data'
WITH ( FORMAT 'text',
DELIMITER ',' ,
HEADER false);
最後,就是上面有看到的敘述統計資訊,這邊用 Union 作一個 View。此外,這邊利用了 PGSQL 的 enum 型態的特性,內部儲存的資料其實是數值,在查詢時會用 Mapping 的方式與前端互動
CREATE or replace VIEW iris_summary AS
SELECT 'sepal_length' as "Feature",
MIN(sepal_length), MAX(sepal_length),
AVG(sepal_length)::numeric(3,2), STDDEV_SAMP(sepal_length)::numeric(3,2)
, CORR(sepal_length, enumsortorder)::numeric(5,4)
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL
SELECT 'sepal_width' as "Feature",
MIN(sepal_width), MAX(sepal_width),
AVG(sepal_width)::numeric(3,2), STDDEV_SAMP(sepal_width)::numeric(3,2)
, CORR(sepal_width, enumsortorder)::numeric(5,4)
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL
SELECT 'petal_length' as "Feature",
MIN(petal_length), MAX(petal_length),
AVG(petal_length)::numeric(3,2), STDDEV_SAMP(petal_length)::numeric(3,2)
, CORR(petal_length, enumsortorder)::numeric(5,4)
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid
UNION ALL
SELECT 'petal_width' as "Feature",
MIN(petal_width), MAX(petal_width),
AVG(petal_width)::numeric(3,2), STDDEV_SAMP(petal_width)::numeric(3,2)
, CORR(petal_width, enumsortorder)::numeric(5,4)
FROM iris JOIN pg_enum ON class::TEXT=enumlabel WHERE enumtypid = 'iris_class'::regtype::oid;
*Note:這邊好像不容易用 PostgreSQL 樞紐表處理...把上面匯入後,這樣就可以進行「小數據」整合 PGSQL 的分析練習了~
參考資料
Ex 3: The iris 鳶尾花資料集 · Machine Learning: Python 機器學習:使用Python
UCI Machine Learning Repository: Iris Data Set
Iris Species | Kaggle
Calculating Summary Statistics · Advanced SQL · SILOTA
weka - What is class correlation? - Cross Validated
Using the PostgreSQL Correlation Function
PostgreSQL: Is it possible cast enum to integer? - Stack Overflow
postgresql - SQL query to get all values a enum can have - Stack Overflow
沒有留言:
張貼留言