´¹¤åºK

[¤j¼Æ¾Ú] ¸ê®Æ«e¸m³B²z¹ê§@

µ¹´¹·s»D¤@­ÓÆg



¸ê®Æ«e¸m³B²z¹ê§@


§@ªÌ: ®L»F¼Ý

ªì½Z: 20220819





¸ê®Æ«e³B²z¨BÆJ:

¸ê®Æ²M²z ¡]Data Cleaning¡^ ¸ê®Æ²M¬~(Data Cleansing)

¸ê®Æ¾ã¦X ¡]Data Integration¡^

¸ê®ÆÂà´« ¡]Data Transformation¡^


¸ê®Æ²M²z¨BÆJ:

¿òº|­È(Missing Value): ¯Ê­ÈµL¤º®e

·¥ºÝ­È(outlier): ¤£¦X²zªº¼Æ­È

³B²z¤èªk: ¥i¥H¸É¤W¤¤¦ì¼Æ©Î¬Û¦P¤À¥¬¤§¶Ã¼Æ, ©Îª½±µ§R°£.


±`¨£ªº¯Ê¥¢­È³B²z¦³¡G

§â¯Ê¥¢­È³æ¿W§@¬°¤@Ãþ¡A¤ñ¦p¹ïÃþ§O«¬¥Înone¡C

±Ä¥Î¥­§¡¼Æ¡B¤¤­È¡B²³¼Æµ¥¯S©w²Î­p­È¨Ó¶ñ¥R¯Ê¥¢­È¡C

±Ä¥Î¨ç¼Æ¹w´úµ¥¤èªk¶ñ¥R¯Ê¥¢­È¡C


¸ê®Æ²M²z¤èªk:

Pandas DataFrames - W3Schools


¶ñ¥Rnil: fillna('None',inplace=True) Pandas DataFrame fillna() Method - W3Schools

¶ñ¥R0: fillna(0,inplace=True)

¶ñ²³¼Æ: fillna(mode_value)

¥Î¤¤­È¥N´À: transform( lambda x: x.fillna(x.median())) Pandas DataFrame median() Method - W3Schools

¥h±¼²§±`­È Outliner Pandas DataFrame dropna() Method - W3Schools

train_df.drop(train_df[(train_df['GrLivArea']>4000)&(train_df['GrLivArea']<30000)].index,inplace=True)



¸ê®Æ¦X¨Ö¤èªk:

DataFrames Reference

Concat: ¦ê±µ pandas.concat

Append: append() ¥[¤J¨ì§ÀºÝ

Merge: merge() ¦X¨Ö DataFrame ª«¥ó

Join: join() ¥[¤J¥t¤@­Ó DataFrame ªº¦æ


¸ê®ÆÂà´«¤èªk:


¯S¼x¤uµ{ ¹ï¸ê®Æ°µ¯S¼xÅÜ´«:

¹ï©óÃþ§O¸ê®Æ¡A¤@¯ë±Ä¥ÎLabelEncoderªº¤è¦¡¡A§â¨C­ÓÃþ§Oªº¸ê®ÆÅܦ¨¼Æ­È«¬¡F

¤]¥i¥H±Ä¥Îone-hotÅܦ¨µ}²¨¯x°}

¹ï©ó¼Æ­È«¬ªº¸ê®Æ¡AºÉ¶q±N¨äÅܬ°¥¿ºA¤À§G¡C


¹ï¼ÆÅÜ´« np.log() ufunc Logs

«ü¼ÆÅÜ´« np.exp()

¾­¨ç¼ÆÅÜ´« **0.5, **2


±`ºA©Ê ±`ºA©Ê¡B§¡¤è®t©Ê

Relationship with numerical variables

Relationship with categorical features


±`¥Î¼Ò²Õ:


import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

from scipy import stats


#pandas ŪÀÉ

df_train = pd.read_csv('ÀɦW')

 

#Åã¥ÜÄæ¦ì

df_train.columns


#descriptive statistics summary

#´y­z©Ê²Î­pºK­n

df_train['SalePrice'].describe()


#histogram

#ª½¤è¹Ï

sns.distplot(df_train['SalePrice']);


#skewness and kurtosis

#°¾«×©M®p«×

print("Skewness: %f" % df_train['SalePrice'].skew())

print("Kurtosis: %f" % df_train['SalePrice'].kurt())


#scatter plot grlivarea/saleprice

#´²ÂI¹Ïgrliarea/saleprice

data.plot.scatter(x='GrLivArea', y='SalePrice', ylim=(0,800000));


#box plot overallqual/saleprice

#½c½u¹Ï¾ãÅé½è¶q/¾P°â»ù®æ

fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)


#correlation matrix heatmap

#¬ÛÃö¯x°} ¼ö¹Ï

sns.heatmap(corrmat, vmax=.8, square=True);


#scatterplot

#´²ÂI¹ï¹Ï

sns.pairplot(df_train[cols], size = 2.5)


#count missing data

#­pºâ¯Ê¥¢¸ê®Æ

total = df_train.isnull().sum().sort_values(ascending=False)

percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)

 

#dealing with missing data

#³B²z¯Ê¥¢¸ê®Æ

df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)

 

#normal probability plot

#¥¿ºA·§²v¹Ï 

res = stats.probplot(df_train['SalePrice'], plot=plt)

 

#transform data

#Âà´«¸ê®Æ

df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])


#convert categorical variable into dummy 

#±N¤ÀÃþÅܶqÂà´«¬°µêÀÀÅܶq

df_train = pd.get_dummies(df_train)



½d¨Ò:


µù¥U«á¤U¸ü¸ê®Æ: Comprehensive data exploration with Python | Kaggle

https://www.kaggle.com › code › pmarcelino › comprehe…

 

¤U¸ücomprehensive-data-exploration-with-python.ipynb

¦s¤J¦Û¤vªº¶³ºÝµwºÐ

 

µù¥U«á¤U¸ü¸ê®Æ: Kaggle: House Prices: Advanced Regression Techniques

¦s¤J¦Û¤vªº¶³ºÝµwºÐ

 

­×§ï¥[¤J¥Ñ¶³ºÝµwºÐ¦s¨ú:

from google.colab import drive

drive.mount('/drive', force_remount=True)


train_df = pd.read_csv('/drive/MyDrive/Colab Notebooks/Training/house-prices-advanced-regression-techniques-train.csv')






Search:  ¸ê®Æ«e³B²z python

 

¸ê®Æ«e³B²z - iT ¨¹À°¦£

https://ithelp.ithome.com.tw › articles

¸ê®Æ«e³B²z(Äò)

 

[Day12] Pythonµ{¦¡¦p¦ó°µ¨ì¸ê®Æ«e³B²zªº¦U­Ó¨BÆJ¡H

https://ithelp.ithome.com.tw › articles

 

[¸ê®Æ¤ÀªR&¾÷¾¹¾Ç²ß] ²Ä2.4Á¿¡G¸ê®Æ«e³B²z(Missing data, One ...

https://medium.com › jameslearningnote › ¸ê®Æ¤ÀªR-¾÷...


 

¸ê®Æ«e³B²z¥²¶·­n°µªº¨Æ- ¸ê®Æ²M²z»P«¬ºA½Õ¾ã

https://blog.v123582.tw › 2020/12/04 › ¸ê®Æ«e³B²z¥²...

 



Search:  ¸ê®Æ²M²z python



[Pandas±Ð¾Ç]¨Ï¥ÎPandas®M¥ó¹ê§@¸ê®Æ²M²zªº¥²³ÆÆ[©À(¤W)

https://www.learncodewithmike.com › 2021/03 › panda...

 

 

Cleaning Data in Python ¦p¦ó²³æ¤W¤â¸ê®Æ²M¬~

https://chriskang028.medium.com › datacamp-cleaning...

 

 

Day 25 [Python ML¡B¸ê®Æ²M²z] ³B²z¿ò¥¢­È - iT ¨¹À°¦£

https://ithelp.ithome.com.tw › articles

 



Search:  python ¸ê®Æ¾ã¦X



[Day12]Learning Pandas - ¸ê®Æ¦X¨Ö - iT ¨¹À°¦£

https://ithelp.ithome.com.tw › articles



 

Search:  ¸ê®ÆÂà´« Data Transformation python

 

¡i¸ê®Æ¬ì¾Ç¡j - ¸ê®Æªº¥¿³W¤Æ»P¼Ð·Ç¤Æ

https://aifreeblog.herokuapp.com › data_science_203



Search:  ¸ê®ÆÂà´« ¤À¥¬ python

 

Python ¸ê®Æ¤ÀªR- ¿òº|­È³B²z¡BÂ÷¸s­È¡B±`ºA©Ê¡B§¡¤è®t©Ê

https://medium.com › python-¸ê®Æ¤ÀªR-24f7e826fca9

 


 

Search:  ¨C¤é¤@½Ò Kaggle ½m²ßÁ¿¸Ñ

 

¨C¤é¤@½Ò Kaggle ½m²ßÁ¿¸Ñ - ª¾¥G

https://zhuanlan.zhihu.com › ...

 

 

Search:  Kaggle ½m²ßÁ¿¸Ñ House Prices  ¤W ¤U

 

¨C¤é¤@½ÒKaggle ½m²ßÁ¿¸Ñ¡GHouse Prices(¤W)

https://kknews.cc › µ{¦¡¶}µo

 

¨C¤é¤@½ÒKaggle ½m²ßÁ¿¸Ñ¡GHouse Prices(¤U)

https://kknews.cc › µ{¦¡¶}µo