晶文摘-[大數據] 資料前置處理實作

[大數據] 資料前置處理實作

給晶新聞一個讚

資料前置處理實作

作者: 夏肇毅

初稿: 20220819

資料前處理步驟:

資料清理（Data Cleaning）資料清洗(Data Cleansing)

資料整合（Data Integration）

資料轉換（Data Transformation）

資料清理步驟:

遺漏值(Missing Value): 缺值無內容

極端值(outlier): 不合理的數值

處理方法: 可以補上中位數或相同分布之亂數, 或直接刪除.

常見的缺失值處理有：

把缺失值單獨作為一類，比如對類別型用none。

採用平均數、中值、眾數等特定統計值來填充缺失值。

採用函數預測等方法填充缺失值。

資料清理方法:

Pandas DataFrames - W3Schools

填充nil: fillna('None',inplace=True) Pandas DataFrame fillna() Method - W3Schools

填充0: fillna(0,inplace=True)

填眾數: fillna(mode_value)

用中值代替: transform( lambda x: x.fillna(x.median())) Pandas DataFrame median() Method - W3Schools

去掉異常值 Outliner Pandas DataFrame dropna() Method - W3Schools

train_df.drop(train_df[(train_df['GrLivArea']>4000)&(train_df['GrLivArea']<30000)].index,inplace=True)

資料合併方法:

DataFrames Reference

Concat: 串接 pandas.concat

Append: append() 加入到尾端

Merge: merge() 合併 DataFrame 物件

Join: join() 加入另一個 DataFrame 的行

資料轉換方法:

特徵工程對資料做特徵變換:

對於類別資料，一般採用LabelEncoder的方式，把每個類別的資料變成數值型；

也可以採用one-hot變成稀疏矩陣

對於數值型的資料，盡量將其變為正態分佈。

對數變換 np.log() ufunc Logs

指數變換 np.exp()

冪函數變換 **0.5, **2

常態性常態性、均方差性

Relationship with numerical variables

Relationship with categorical features

常用模組:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

from scipy import stats

#pandas 讀檔

df_train = pd.read_csv('檔名')

#顯示欄位

df_train.columns

#descriptive statistics summary

#描述性統計摘要

df_train['SalePrice'].describe()

#histogram

#直方圖

sns.distplot(df_train['SalePrice']);

#skewness and kurtosis

#偏度和峰度

print("Skewness: %f" % df_train['SalePrice'].skew())

print("Kurtosis: %f" % df_train['SalePrice'].kurt())

#scatter plot grlivarea/saleprice

#散點圖grliarea/saleprice

data.plot.scatter(x='GrLivArea', y='SalePrice', ylim=(0,800000));

#box plot overallqual/saleprice

#箱線圖整體質量/銷售價格

fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)

#correlation matrix heatmap

#相關矩陣熱圖

sns.heatmap(corrmat, vmax=.8, square=True);

#scatterplot

#散點對圖

sns.pairplot(df_train[cols], size = 2.5)

#count missing data

#計算缺失資料

total = df_train.isnull().sum().sort_values(ascending=False)

percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)

#dealing with missing data

#處理缺失資料

df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)

#normal probability plot

#正態概率圖

res = stats.probplot(df_train['SalePrice'], plot=plt)

#transform data

#轉換資料

df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])

#convert categorical variable into dummy

#將分類變量轉換為虛擬變量

df_train = pd.get_dummies(df_train)

範例:

註冊後下載資料: Comprehensive data exploration with Python | Kaggle

https://www.kaggle.com › code › pmarcelino › comprehe…

下載comprehensive-data-exploration-with-python.ipynb

存入自己的雲端硬碟

註冊後下載資料: Kaggle: House Prices: Advanced Regression Techniques

存入自己的雲端硬碟

修改加入由雲端硬碟存取:

from google.colab import drive

drive.mount('/drive', force_remount=True)

train_df = pd.read_csv('/drive/MyDrive/Colab Notebooks/Training/house-prices-advanced-regression-techniques-train.csv')

Search: 資料前處理 python

資料前處理 - iT 邦幫忙

https://ithelp.ithome.com.tw › articles

資料前處理(續)

[Day12] Python程式如何做到資料前處理的各個步驟？

https://ithelp.ithome.com.tw › articles

[資料分析&機器學習] 第2.4講：資料前處理(Missing data, One ...

https://medium.com › jameslearningnote › 資料分析-機...

資料前處理必須要做的事- 資料清理與型態調整

https://blog.v123582.tw › 2020/12/04 › 資料前處理必...

Search: 資料清理 python

[Pandas教學]使用Pandas套件實作資料清理的必備觀念(上)

https://www.learncodewithmike.com › 2021/03 › panda...

Cleaning Data in Python 如何簡單上手資料清洗

https://chriskang028.medium.com › datacamp-cleaning...

Day 25 [Python ML、資料清理] 處理遺失值 - iT 邦幫忙

https://ithelp.ithome.com.tw › articles

Search: python 資料整合

[Day12]Learning Pandas - 資料合併 - iT 邦幫忙

https://ithelp.ithome.com.tw › articles

Search: 資料轉換 Data Transformation python

【資料科學】 - 資料的正規化與標準化

https://aifreeblog.herokuapp.com › data_science_203

Search: 資料轉換分布 python

Python 資料分析- 遺漏值處理、離群值、常態性、均方差性

https://medium.com › python-資料分析-24f7e826fca9

Search: 每日一課 Kaggle 練習講解

每日一課 Kaggle 練習講解 - 知乎

https://zhuanlan.zhihu.com › ...

Search: Kaggle 練習講解 House Prices 上下

每日一課Kaggle 練習講解：House Prices(上)

https://kknews.cc › 程式開發

每日一課Kaggle 練習講解：House Prices(下)

https://kknews.cc › 程式開發

晶文摘

[大數據] 資料前置處理實作

About

最新晶新聞

最新晶文摘

商品