¸ê®Æ«e¸m³B²z¹ê§@
§@ªÌ: ®L»F¼Ý
ªì½Z: 20220819
¸ê®Æ«e³B²z¨BÆJ:
¸ê®Æ²M²z ¡]Data Cleaning¡^ ¸ê®Æ²M¬~(Data Cleansing)
¸ê®Æ¾ã¦X ¡]Data Integration¡^
¸ê®ÆÂà´« ¡]Data Transformation¡^
¸ê®Æ²M²z¨BÆJ:
¿òº|È(Missing Value): ¯ÊȵL¤º®e
·¥ºÝÈ(outlier): ¤£¦X²zªº¼ÆÈ
³B²z¤èªk: ¥i¥H¸É¤W¤¤¦ì¼Æ©Î¬Û¦P¤À¥¬¤§¶Ã¼Æ, ©Îª½±µ§R°£.
±`¨£ªº¯Ê¥¢È³B²z¦³¡G
§â¯Ê¥¢È³æ¿W§@¬°¤@Ãþ¡A¤ñ¦p¹ïÃþ§O«¬¥Înone¡C
±Ä¥Î¥§¡¼Æ¡B¤¤È¡B²³¼Æµ¥¯S©w²ÎpȨӶñ¥R¯Ê¥¢È¡C
±Ä¥Î¨ç¼Æ¹w´úµ¥¤èªk¶ñ¥R¯Ê¥¢È¡C
¸ê®Æ²M²z¤èªk:
¶ñ¥Rnil: fillna('None',inplace=True) Pandas DataFrame fillna() Method - W3Schools
¶ñ¥R0: fillna(0,inplace=True)
¶ñ²³¼Æ: fillna(mode_value)
¥Î¤¤È¥N´À: transform( lambda x: x.fillna(x.median())) Pandas DataFrame median() Method - W3Schools
¥h±¼²§±`È Outliner Pandas DataFrame dropna() Method - W3Schools
train_df.drop(train_df[(train_df['GrLivArea']>4000)&(train_df['GrLivArea']<30000)].index,inplace=True)
¸ê®Æ¦X¨Ö¤èªk:
Concat: ¦ê±µ pandas.concat
Append: append() ¥[¤J¨ì§ÀºÝ
Merge: merge() ¦X¨Ö DataFrame ª«¥ó
Join: join() ¥[¤J¥t¤@Ó DataFrame ªº¦æ
¸ê®ÆÂà´«¤èªk:
¯S¼x¤uµ{ ¹ï¸ê®Æ°µ¯S¼xÅÜ´«:
¹ï©óÃþ§O¸ê®Æ¡A¤@¯ë±Ä¥ÎLabelEncoderªº¤è¦¡¡A§â¨CÓÃþ§Oªº¸ê®ÆÅܦ¨¼ÆÈ«¬¡F
¤]¥i¥H±Ä¥Îone-hotÅܦ¨µ}²¨¯x°}
¹ï©ó¼ÆÈ«¬ªº¸ê®Æ¡AºÉ¶q±N¨äÅܬ°¥¿ºA¤À§G¡C
¹ï¼ÆÅÜ´« np.log() ufunc Logs
«ü¼ÆÅÜ´« np.exp()
¾¨ç¼ÆÅÜ´« **0.5, **2
±`ºA©Ê ±`ºA©Ê¡B§¡¤è®t©Ê
Relationship with numerical variables
Relationship with categorical features
±`¥Î¼Ò²Õ:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
#pandas ŪÀÉ
df_train = pd.read_csv('ÀɦW')
#Åã¥ÜÄæ¦ì
df_train.columns
#descriptive statistics summary
#´yz©Ê²ÎpºKn
df_train['SalePrice'].describe()
#histogram
#ª½¤è¹Ï
sns.distplot(df_train['SalePrice']);
#skewness and kurtosis
#°¾«×©M®p«×
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())
#scatter plot grlivarea/saleprice
#´²ÂI¹Ïgrliarea/saleprice
data.plot.scatter(x='GrLivArea', y='SalePrice', ylim=(0,800000));
#box plot overallqual/saleprice
#½c½u¹Ï¾ãÅé½è¶q/¾P°â»ù®æ
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
#correlation matrix heatmap
#¬ÛÃö¯x°} ¼ö¹Ï
sns.heatmap(corrmat, vmax=.8, square=True);
#scatterplot
#´²ÂI¹ï¹Ï
sns.pairplot(df_train[cols], size = 2.5)
#count missing data
#pºâ¯Ê¥¢¸ê®Æ
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
#dealing with missing data
#³B²z¯Ê¥¢¸ê®Æ
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
#normal probability plot
#¥¿ºA·§²v¹Ï
res = stats.probplot(df_train['SalePrice'], plot=plt)
#transform data
#Âà´«¸ê®Æ
df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])
#convert categorical variable into dummy
#±N¤ÀÃþÅܶqÂà´«¬°µêÀÀÅܶq
df_train = pd.get_dummies(df_train)
½d¨Ò:
µù¥U«á¤U¸ü¸ê®Æ: Comprehensive data exploration with Python | Kaggle
https://www.kaggle.com › code › pmarcelino › comprehe…
¤U¸ücomprehensive-data-exploration-with-python.ipynb
¦s¤J¦Û¤vªº¶³ºÝµwºÐ
µù¥U«á¤U¸ü¸ê®Æ: Kaggle: House Prices: Advanced Regression Techniques
¦s¤J¦Û¤vªº¶³ºÝµwºÐ
קï¥[¤J¥Ñ¶³ºÝµwºÐ¦s¨ú:
from google.colab import drive
drive.mount('/drive', force_remount=True)
train_df = pd.read_csv('/drive/MyDrive/Colab Notebooks/Training/house-prices-advanced-regression-techniques-train.csv')
Search: ¸ê®Æ«e³B²z python
https://ithelp.ithome.com.tw › articles
[Day12] Pythonµ{¦¡¦p¦ó°µ¨ì¸ê®Æ«e³B²zªº¦UÓ¨BÆJ¡H
https://ithelp.ithome.com.tw › articles
[¸ê®Æ¤ÀªR&¾÷¾¹¾Ç²ß] ²Ä2.4Á¿¡G¸ê®Æ«e³B²z(Missing data, One ...
https://medium.com › jameslearningnote › ¸ê®Æ¤ÀªR-¾÷...
¸ê®Æ«e³B²z¥²¶·n°µªº¨Æ- ¸ê®Æ²M²z»P«¬ºA½Õ¾ã
https://blog.v123582.tw › 2020/12/04 › ¸ê®Æ«e³B²z¥²...
Search: ¸ê®Æ²M²z python
[Pandas±Ð¾Ç]¨Ï¥ÎPandas®M¥ó¹ê§@¸ê®Æ²M²zªº¥²³ÆÆ[©À(¤W)
https://www.learncodewithmike.com › 2021/03 › panda...
Cleaning Data in Python ¦p¦ó²³æ¤W¤â¸ê®Æ²M¬~
https://chriskang028.medium.com › datacamp-cleaning...
Day 25 [Python ML¡B¸ê®Æ²M²z] ³B²z¿ò¥¢È - iT ¨¹À°¦£
https://ithelp.ithome.com.tw › articles
Search: python ¸ê®Æ¾ã¦X
[Day12]Learning Pandas - ¸ê®Æ¦X¨Ö - iT ¨¹À°¦£
https://ithelp.ithome.com.tw › articles
Search: ¸ê®ÆÂà´« Data Transformation python
¡i¸ê®Æ¬ì¾Ç¡j - ¸ê®Æªº¥¿³W¤Æ»P¼Ð·Ç¤Æ
https://aifreeblog.herokuapp.com › data_science_203
Search: ¸ê®ÆÂà´« ¤À¥¬ python
Python ¸ê®Æ¤ÀªR- ¿òº|ȳB²z¡BÂ÷¸sÈ¡B±`ºA©Ê¡B§¡¤è®t©Ê
https://medium.com › python-¸ê®Æ¤ÀªR-24f7e826fca9
Search: ¨C¤é¤@½Ò Kaggle ½m²ßÁ¿¸Ñ
¨C¤é¤@½Ò Kaggle ½m²ßÁ¿¸Ñ - ª¾¥G
https://zhuanlan.zhihu.com › ...
Search: Kaggle ½m²ßÁ¿¸Ñ House Prices ¤W ¤U
¨C¤é¤@½ÒKaggle ½m²ßÁ¿¸Ñ¡GHouse Prices(¤W)