AI/AI 실습

모의 캐글 경진대회 [EXP_6]

Mo_bi!e 2021. 10. 12. 22:35

나의 첫 번째 캐글 경진대회, 무작정 따라해보기¶

1. - 일단 제출하고 시작해! Baseline 모델¶

1. Baseline 셋팅하기¶

In [1]:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

2. 라이브러리, 데이터 가져오기¶

필요한 라이브러리 import 하기

In [3]:

import warnings
warnings.filterwarnings("ignore")

import os
from os.path import join

import pandas as pd
import numpy as np

import missingno as msno

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
import xgboost as xgb
import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns

print('얍💢')

얍💢

데이터 경로 지정하기 / 주의해야 할 점은 우리의 파일 경로는 Baseline 커널과 다르다는 점

In [4]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

train_data_path = join(data_dir, 'train.csv')
sub_data_path = join(data_dir, 'test.csv')      # 테스트, 즉 submission 시 사용할 데이터 경로

print(train_data_path)
print(sub_data_path)

/aiffel/aiffel/kaggle_kakr_housing/data/train.csv
/aiffel/aiffel/kaggle_kakr_housing/data/test.csv

3. 데이터 이해하기¶

데이터 불러오기

In [5]:

data = pd.read_csv(train_data_path)
sub = pd.read_csv(sub_data_path)
print('train data dim : {}'.format(data.shape))
print('sub data dim : {}'.format(sub.shape))

train data dim : (15035, 21)
sub data dim : (6468, 20)

학습 데이터에서 라벨 제거하기

In [6]:

y = data['price']
del data['price']

print(data.columns)

Index(['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

학습 데이터와 테스트 데이터 합치기

In [7]:

train_len = len(data)
data = pd.concat((data, sub), axis=0)

print(len(data))

In [8]:

data.head()

Out[8]:

	id	date	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	20141013T000000	3	1.00	1180	5650	1.0	3	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	20150225T000000	2	1.00	770	10000	1.0	3	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	20150218T000000	3	2.00	1680	8080	1.0	3	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	20140627T000000	3	2.25	1715	6819	2.0	3	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	20150115T000000	3	1.50	1060	9711	1.0	3	7	1060	1963	98198	47.4095	-122.315	1650	9711

간단한 전처리

결측치, 즉 빈 데이터가 있는지는 위에서 설치했던 missingno 라이브러리를 사용해서 확인

만약 특정 row, col에 NaN이라는 결측치가 있었다면 해당 부분이 하얗게 나옵니다. 결측치가 없다면 매트릭스 전체가 까맣게 나올 겁니다.

In [9]:

msno.matrix(data)

Out[9]:

<AxesSubplot:>

In [11]:

# 1. id 컬럼이 결측치인지 확인합니다.
null_check = pd.isnull(data['id'])
print(null_check)

0       False
1       False
2       False
3       False
4       False
        ...  
6463    False
6464    False
6465    False
6466    False
6467    False
Name: id, Length: 21503, dtype: bool

In [12]:

# 2. 결측치인 데이터만 뽑아냅니다.
null_data = data.loc[null_check, 'id']
null_data.head()

Out[12]:

Series([], Name: id, dtype: int64)

In [13]:

# 3. 결측치인 데이터의 개수를 셉니다.
print('{}: {}'.format('id', len(null_data.values)))

id: 0

In [15]:

# 한 번에 뿅!
print('{} : {}'.format('id', len(data.loc[pd.isnull(data['id']), 'id'].values)))

id : 0

id 컬럼 외에도 모든 컬럼에 적용해야 하니 for문을 사용하면 완성

In [16]:

for c in data.columns:
    print('{} : {}'.format(c, len(data.loc[pd.isnull(data[c]), c].values)))

id : 0
date : 0
bedrooms : 0
bathrooms : 0
sqft_living : 0
sqft_lot : 0
floors : 0
waterfront : 0
view : 0
condition : 0
grade : 0
sqft_above : 0
sqft_basement : 0
yr_built : 0
yr_renovated : 0
zipcode : 0
lat : 0
long : 0
sqft_living15 : 0
sqft_lot15 : 0

id, date 변수 정리

필요 없는 id 컬럼을 제거합니다. 나중에 예측 결과를 제출할 때를 대비하여 sub_id 변수에 id 칼럼을 저장해두고 지우도록

In [17]:

sub_id = data['id'][train_len:]
del data['id']

print(data.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

date 컬럼은 apply 함수로 필요한 부분만 잘라줌

In [19]:

data['date'] = data['date'].apply(lambda x : str(x[:6]))

data.head()

Out[19]:

	date	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	201410	3	1.00	1180	5650	1.0	3	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	201502	2	1.00	770	10000	1.0	3	6	770	1933	98028	47.7379	-122.233	2720	8062
2	201502	3	2.00	1680	8080	1.0	3	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	201406	3	2.25	1715	6819	2.0	3	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	201501	3	1.50	1060	9711	1.0	3	7	1060	1963	98198	47.4095	-122.315	1650	9711

-각 변수들의 분포 확인

In [21]:

fig, ax = plt.subplots(9, 2, figsize=(12, 50))   # 가로스크롤 때문에 그래프 확인이 불편하다면 figsize의 x값을 조절해 보세요. 

# id 변수(count==0인 경우)는 제외하고 분포를 확인합니다.
count = 1
columns = data.columns

#subplot에 그래프를 그리기 위해 2중 for문을 사용
for row in range(9):
    for col in range(2):
        
        #kdeplot은 이산(discrete) 데이터의 경우에도 
        #부드러운 곡선으로 전체 분포를 확인할 수 있도록 하는 시각화 함수
        sns.kdeplot(data[columns[count]], ax=ax[row][col])
        ax[row][col].set_title(columns[count], fontsize=15)
        count += 1
        if count == 19 : #19개 칼럼 모드 그래프 그려주기
            break

skew_columns 리스트 안에 담고, 모두 np.log1p()를 활용해서 로그 변환

In [24]:

# bedrooms, sqft_living, sqft_lot, sqft_above, sqft_basement, sqft_living15, sqft_lot15 변수가 한쪽으로 치우친 경향
skew_columns = ['bedrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_lot15', 'sqft_living15']

for c in skew_columns:
    # numpy.log1p() 함수는 
    # 입력 배열의 각 요소에 대해 자연로그 log(1 + x)을 반환해 주는 함수
    data[c] = np.log1p(data[c].values)

print('얍💢')

얍💢

변환이 된 후의 분포를 다시 한 번 확인

In [25]:

fig, ax = plt.subplots(4, 2, figsize=(12, 24))

count = 0
for row in range(4):
    for col in range(2):
        if count == 7:
            break
        sns.kdeplot(data[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count += 1

왜 로그 변환은 분포의 치우침을 줄어들게 하는지?

x를 비교적 작은 y값의 구간 내에 모이게 하는 특징

이 같은 특성 때문에 한 쪽으로 몰려있는 분포에 로그 변환을 취하게 되면 넓게 퍼질 수 있는 것

In [28]:

xx = np.linspace(0, 10, 500)
yy = np.log(xx)

plt.hlines(0, 0, 10)
plt.vlines(0, -5, 5)
plt.plot(xx, yy, c='r')
plt.show()

data[price]의 분포를 로그 변환했을 때 결과를 유추

In [30]:

#원래 price의 분포

sns.kdeplot(y)
plt.show()

위 분포를 log 변환

이 분포에 로그 변환을 취하면, 0에 가깝게 몰려있는 데이터들은 넓게 퍼질 것이고, 매우 크게 퍼져있는 소수의 데이터들은 작은 y값으로 모일 것

In [31]:

y_log_transformation = np.log1p(y)

sns.kdeplot(y_log_transformation)
plt.show()

train_len을 인덱스로 활용해서 :train_len까지는 학습 데이터, 즉 x에 저장하고, train_len: 부터는 실제로 추론을 해야 하는 테스트 데이터, 즉 sub 변수에 저장

In [32]:

sub = data.iloc[train_len:, :]
x = data.iloc[:train_len, :]

print(x.shape)
print(sub.shape)

(15035, 19)
(6468, 19)

4. 모델 설계¶

모델링

블렌딩(blending) 이라는 기법을 활용

블렌딩이란 하나의 개별 모델을 사용하는 것이 아니라 다양한 여러 모델을 종합하여 결과를 얻는 기법입니다. 블렌딩은 앙상블 기법이라고 하기도 하는데요

Average Blending

모델은 부스팅 계열인 gboost, xgboost, lightgbm 세 가지를 사용

In [33]:

gboost = GradientBoostingRegressor(random_state=2019)
xgboost = xgb.XGBRegressor(random_state=2019)
lightgbm = lgb.LGBMRegressor(random_state=2019)

models = [{'model':gboost, 'name':'GradientBoosting'}, {'model':xgboost, 'name':'XGBoost'},
          {'model':lightgbm, 'name':'LightGBM'}]

print('얍💢')

얍💢

Cross Validation

교차 검증을 통해 모델의 성능을 간단히 평가

In [34]:

def get_cv_score(models):
    kfold = KFold(n_splits=5, random_state=2019).get_n_splits(x.values)
    for m in models:
        CV_score = np.mean(cross_val_score(m['model'], X=x.values, y=y, cv=kfold))
        print(f"Model: {m['name']}, CV score:{CV_score:.4f}")
print('얍💢')

얍💢

cross_val_score() 함수는 회귀모델을 전달할 경우 R^2 점수를 반환합니다. R^2값은 1에 가까울수록 모델이 잘 학습되었다는 것을 나타냄

In [35]:

get_cv_score(models)

Model: GradientBoosting, CV score:0.8598
Model: XGBoost, CV score:0.8860
Model: LightGBM, CV score:0.8819

Make Submission File

AveragingBlending() 함수는 models 딕셔너리 안에 있는 모델을 모두 x와 y로 학습시킨 뒤 predictions에 그 예측 결괏값을 모아서 평균한 값을 반환

In [36]:

def AveragingBlending(models, x, y, sub_x):
    for m in models : 
        m['model'].fit(x.values, y)
    
    predictions = np.column_stack([
        m['model'].predict(sub_x.values) for m in models
    ])
    return np.mean(predictions, axis=1)

print('얍💢')

얍💢

In [37]:

y_pred = AveragingBlending(models, x, y, sub)
print(len(y_pred))
y_pred

Out[37]:

array([ 529966.66304912,  430726.21272617, 1361676.91242777, ...,
        452081.69137012,  341572.97685942,  421725.1231835 ])

결과를 캐글에 제출하기 위해 어떻게 해야 할까요? [예시]

In [38]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

submission_path = join(data_dir, 'sample_submission.csv')
submission = pd.read_csv(submission_path)
submission.head()

Out[38]:

	id	price
0	15035	100000
1	15036	100000
2	15037	100000
3	15038	100000
4	15039	100000

In [39]:

result = pd.DataFrame({
    'id' : sub_id, 
    'price' : y_pred
})

result.head()

Out[39]:

	id	price
0	15035	5.299667e+05
1	15036	4.307262e+05
2	15037	1.361677e+06
3	15038	3.338036e+05
4	15039	3.089006e+05

제출할 일만 남았습니다. 다음 코드로 submission.csv 파일을 저장해 주세요!

In [40]:

my_submission_path = join(data_dir, 'submission.csv')
result.to_csv(my_submission_path, index=False)

print(my_submission_path)

/aiffel/aiffel/kaggle_kakr_housing/data/submission.csv

캐글에 첫 결과 제출하기

2. 랭킹을 올리고 싶다면?¶

1. 다시 한 번, 내 입맛대로 데이터 준비하기¶

다시 한번, 내 입맛대로 데이터 준비하기

In [43]:

#data 다시 가져오기 
data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

train_data_path = join(data_dir, 'train.csv')
test_data_path = join(data_dir, 'test.csv') 

train = pd.read_csv(train_data_path)
test = pd.read_csv(test_data_path)

print('얍💢')

얍💢

In [44]:

train.head()

Out[44]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	20141013T000000	221900.0	3	1.00	1180	5650	1.0	...	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	20150225T000000	180000.0	2	1.00	770	10000	1.0	...	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	20150218T000000	510000.0	3	2.00	1680	8080	1.0	...	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	20140627T000000	257500.0	3	2.25	1715	6819	2.0	...	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	20150115T000000	291850.0	3	1.50	1060	9711	1.0	...	7	1060	1963	98198	47.4095	-122.315	1650	9711

5 rows × 21 columns

int, 즉 정수형 데이터로 처리해보겠습니다. 이렇게 하면 모델이 date도 예측을 위한 특성으로 활용가능

In [45]:

train['date'] = train['date'].apply(lambda i: i[:6]).astype(int)
train.head()

Out[45]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	yr_built	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	201410	221900.0	3	1.00	1180	5650	1.0	...	7	1180	1955	98178	47.5112	-122.257	1340	5650
1	1	201502	180000.0	2	1.00	770	10000	1.0	...	6	770	1933	98028	47.7379	-122.233	2720	8062
2	2	201502	510000.0	3	2.00	1680	8080	1.0	...	8	1680	1987	98074	47.6168	-122.045	1800	7503
3	3	201406	257500.0	3	2.25	1715	6819	2.0	...	7	1715	1995	98003	47.3097	-122.327	2238	6819
4	4	201501	291850.0	3	1.50	1060	9711	1.0	...	7	1060	1963	98198	47.4095	-122.315	1650	9711

5 rows × 21 columns

타겟 데이터에 해당하는 price 컬럼이죠. y 변수에 price를 넣어두고, train에서는 삭제

In [46]:

y = train['price']
del train['price']

print(train.columns)

Index(['id', 'date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

id 컬럼을 삭제하는 것까지 하면 기본적인 전처리는 모두 마무리

In [47]:

del train['id']

print(train.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

test 데이터에 대해서도 같은 작업을 진행

price는 없으니 훈련 데이터셋과는 다르게 price에 대한 처리는 해주지 않아도 된다는 것

In [48]:

test['date'] = test['date'].apply(lambda i: i[:6]).astype(int)

del test['id']

print(test.columns)

Index(['date', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [49]:

#y 확인
y

Out[49]:

0         221900.0
1         180000.0
2         510000.0
3         257500.0
4         291850.0
           ...    
15030     610685.0
15031    1007500.0
15032     360000.0
15033     400000.0
15034     325000.0
Name: price, Length: 15035, dtype: float64

직접 코드를 작성해 가격 데이터의 분포도 한번 확인

In [54]:

"""
seaborn의 `kdeplot`을 활용해 `y`의 분포를 확인해주세요!
"""

#코드 작성
sns.kdeplot(y)
plt.show()

비교적 완만한 정규분포의 형태로 잘 변환

In [55]:

y = np.log1p(y)
sns.kdeplot(y)
plt.show()

info() 함수로 전체 데이터의 자료형을 한눈에 확인

In [56]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15035 entries, 0 to 15034
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           15035 non-null  int64  
 1   bedrooms       15035 non-null  int64  
 2   bathrooms      15035 non-null  float64
 3   sqft_living    15035 non-null  int64  
 4   sqft_lot       15035 non-null  int64  
 5   floors         15035 non-null  float64
 6   waterfront     15035 non-null  int64  
 7   view           15035 non-null  int64  
 8   condition      15035 non-null  int64  
 9   grade          15035 non-null  int64  
 10  sqft_above     15035 non-null  int64  
 11  sqft_basement  15035 non-null  int64  
 12  yr_built       15035 non-null  int64  
 13  yr_renovated   15035 non-null  int64  
 14  zipcode        15035 non-null  int64  
 15  lat            15035 non-null  float64
 16  long           15035 non-null  float64
 17  sqft_living15  15035 non-null  int64  
 18  sqft_lot15     15035 non-null  int64  
dtypes: float64(4), int64(15)
memory usage: 2.2 MB

2. 다양한 실험을 위해 함수로 만들어 쓰자¶

-RMSE 계산

In [57]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

print('얍💢')

얍💢

y_test나 y_pred는 위에서 np.log1p()로 변환이 된 값이기 때문에 원래 데이터의 단위에 맞게 되돌리기 위해 np.expm1()을 추가 함

exp로 다시 변환해서 mean_squared_error를 계산한 값에 np.sqrt를 취하면 RMSE 값을 얻을 수 있음

In [58]:

def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_error(np.expm1(y_test), np.expm1(y_pred)))

print('얍💢')

얍💢

XGBRegressor, LGBMRegressor, GradientBoostingRegressor, RandomForestRegressor 네 가지 모델을 가져오겠음

In [60]:

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

print('얍💢')

얍💢

모델 파라미터 초기화나 데이터셋 구성에 사용되는 랜덤 시드값인 random_state 값을 특정 값으로 고정시키거나, 아니면 지정하지 않고 None으로 세팅할 수 있습니다.

random_state를 고정값으로 주면 모델과 데이터셋이 동일한 경우 머신러닝 학습결과도 항상 동일하게 재현됩니다.

베이스라인에서부터 시작해서 다양한 실험을 통해 성능이 개선되는지를 검증필요 어떤 시도가 모델 성능 향상에 긍정적이었는지 여부를 판단하기 위해서는 랜덤적 요소의 변화 때문에 생기는 불확실성을 제거

In [61]:

# random_state는 모델초기화나 데이터셋 구성에 사용되는 랜덤 시드값입니다. 
#random_state=None    # 이게 초기값입니다. 아무것도 지정하지 않고 None을 넘겨주면 모델 내부에서 임의로 선택합니다.  
random_state=2020        # 하지만 우리는 이렇게 고정값을 세팅해 두겠습니다. 

gboost = GradientBoostingRegressor(random_state=random_state)
xgboost = XGBRegressor(random_state=random_state)
lightgbm = LGBMRegressor(random_state=random_state)
rdforest = RandomForestRegressor(random_state=random_state)

models = [gboost, xgboost, lightgbm, rdforest]

print('얍💢')

얍💢

각 모델의 이름은 다음과 같이 클래스의 name 속성에 접근해서 얻기 가능

In [62]:

gboost.__class__.__name__

Out[62]:

'GradientBoostingRegressor'

이름을 접근할 수 있다면 다음과 같이 for문 안에서 각 모델 별로 학습 및 예측을 해볼 수 있죠.

In [63]:

df = {}

for model in models:
    # 모델 이름 획득
    model_name = model.__class__.__name__

    # train, test 데이터셋 분리 - 여기에도 random_state를 고정합니다. 
    X_train, X_test, y_train, y_test = train_test_split(train, y, random_state=random_state, test_size=0.2)

    # 모델 학습
    model.fit(X_train, y_train)
    
    # 예측
    y_pred = model.predict(X_test)

    # 예측 결과의 rmse값 저장
    df[model_name] = rmse(y_test, y_pred)
    
    # data frame에 저장
    score_df = pd.DataFrame(df, index=['RMSE']).T.sort_values('RMSE', ascending=False)
    
df

Out[63]:

{'GradientBoostingRegressor': 128360.19649691365,
 'XGBRegressor': 110318.66956616656,
 'LGBMRegressor': 111920.36735892233,
 'RandomForestRegressor': 125487.07102453562}

위 과정을 get_scores(models, train, y) 함수로 만들어 보겠습

In [65]:

def get_scores(models, train, y):
    # 답안 작성
    
    df = {}

    for model in models:
        # 모델 이름 획득
        model_name = model.__class__.__name__

        # train, test 데이터셋 분리 - 여기에도 random_state를 고정합니다. 
        X_train, X_test, y_train, y_test = train_test_split(train, y, random_state=random_state, test_size=0.2)

        # 모델 학습
        model.fit(X_train, y_train)
    
        # 예측
        y_pred = model.predict(X_test)

        # 예측 결과의 rmse값 저장
        df[model_name] = rmse(y_test, y_pred)
    
        # data frame에 저장
        score_df = pd.DataFrame(df, index=['RMSE']).T.sort_values('RMSE', ascending=False)
    
    return score_df
    
get_scores(models, train, y)

Out[65]:

	RMSE
GradientBoostingRegressor	128360.196497
RandomForestRegressor	125487.071025
LGBMRegressor	111920.367359
XGBRegressor	110318.669566

3. 하이퍼 파라미터 튜닝의 최강자, 그리드 탐색¶

다양한 하이퍼 파라미터로 실험하는 일

실험은 sklearn.model_selection 라이브러리 안에 있는 GridSearchCV 클래스를 활용

In [66]:

from sklearn.model_selection import GridSearchCV

print('얍💢')

얍💢

GridSearchCV란 다양한 파라미터를 입력하면 가능한 모든 조합을 탐색

그리드 탐색과 랜덤 탐색을 알아볼 필요, 두 가지 모두 하이퍼 파라미터를 조합해 보는 방법

<그리드 탐색>은 사람이 먼저 "탐색할 하이퍼 파라미터의 값"들을 정해두고, 그 값들로 만들어질 수 있는 모든 조합을 탐색 / 특정 값에 대한 하이퍼 파라미터 조합을 모두 탐색하고자 할 때 유리

그리드 탐색에서는 사람이 정해둔 값들로 이루어지는 조합만 탐색하기 때문에 최적의 조합을 놓칠 수 있음

<랜덤 탐색>은 사람이 탐색할 하이퍼 파라미터의 "공간"만 정해둠 / 그 안에서 랜덤으로 조합을 선택해서 탐색하는 방법

랜덤 탐색에서는 말 그대로 랜덤으로 탐색하기 때문에 최적의 조합을 찾을 수 있는 가능성이 언제나 열려 있음 다만 가능성 또한 랜덤성에 의존하기 때문에 언제나 최적을 찾는다는 보장은 없음

그리드 탐색의 탐색 공간은 매우 제한적인 반면 랜덤 탐색은 탐색하는 공간이 훨씬 넓습

param_grid에 탐색할 xgboost 관련 하이퍼 파라미터를 넣어서 준비

In [67]:

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [1, 10],
}

모델은 LightGBM(lgbm)를 사용

In [68]:

model = LGBMRegressor(random_state=random_state)

print('얍💢')

얍💢

GridSearchCV를 이용해서 grid_model 모델을 초기화

train과 y 데이터로 모델을 간단히 학습

param_grid 내의 모든 하이퍼 파라미터의 조합에 대해 실험이 완료

In [69]:

grid_model = GridSearchCV(model, param_grid=param_grid, \
                          #param_grid : 탐색할 파라미터의 종류 (딕셔너리로 입력)
                        scoring='neg_mean_squared_error', \ #음수
                          #scoring : 모델의 성능을 평가할 지표
                        cv=5, verbose=1, n_jobs=5)
        #cv : cross validation을 수행하기 위해 train 데이터셋을 나누는 조각의 개수
        #verbose : 그리드 탐색을 진행하면서 진행 과정을 출력해서 보여줄 메세지의 양 
                    #(숫자가 클수록 더 많은 메세지를 출력합니다.)
        #n_jobs : 그리드 탐색을 진행하면서 사용할 CPU의 개수
    

grid_model.fit(train, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    4.0s finished

[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).

Out[69]:

GridSearchCV(cv=5, estimator=LGBMRegressor(random_state=2020), n_jobs=5,
             param_grid={'max_depth': [1, 10], 'n_estimators': [50, 100]},
             scoring='neg_mean_squared_error', verbose=1)

실험에 대한 결과는 다음과 같이 grid_model.cvresults 안에 저장

In [70]:

grid_model.cv_results_

Out[70]:

{'mean_fit_time': array([0.2509254 , 0.218397  , 0.49575438, 0.72538347]),
 'std_fit_time': array([0.06987448, 0.0140034 , 0.05982799, 0.08844025]),
 'mean_score_time': array([0.01850719, 0.02320142, 0.04260893, 0.04079165]),
 'std_score_time': array([0.01401909, 0.012861  , 0.0275384 , 0.01500111]),
 'param_max_depth': masked_array(data=[1, 1, 10, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[50, 100, 50, 100],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 1, 'n_estimators': 50},
  {'max_depth': 1, 'n_estimators': 100},
  {'max_depth': 10, 'n_estimators': 50},
  {'max_depth': 10, 'n_estimators': 100}],
 'split0_test_score': array([-0.0756974 , -0.05555652, -0.02885847, -0.02665428]),
 'split1_test_score': array([-0.07666447, -0.057876  , -0.03041465, -0.02795896]),
 'split2_test_score': array([-0.07354904, -0.05546079, -0.03068533, -0.02834112]),
 'split3_test_score': array([-0.07510863, -0.05582109, -0.02987609, -0.02774809]),
 'split4_test_score': array([-0.06595281, -0.05038773, -0.02605217, -0.02443328]),
 'mean_test_score': array([-0.07339447, -0.05502043, -0.02917734, -0.02702714]),
 'std_test_score': array([0.00385583, 0.00247946, 0.00168295, 0.00141292]),
 'rank_test_score': array([4, 3, 2, 1], dtype=int32)}

params에는 각 파라미터의 조합이, score에는 각 조합에 대한 점수

In [75]:

params = grid_model.cv_results_['params']
params

Out[75]:

[{'max_depth': 1, 'n_estimators': 50},
 {'max_depth': 1, 'n_estimators': 100},
 {'max_depth': 10, 'n_estimators': 50},
 {'max_depth': 10, 'n_estimators': 100}]

In [76]:

score = grid_model.cv_results_['mean_test_score']
score

Out[76]:

array([-0.07339447, -0.05502043, -0.02917734, -0.02702714])

In [86]:

results = pd.DataFrame(params)
results['score'] = score

results

Out[86]:

	max_depth	n_estimators	score
0	1	50	-0.073394
1	1	100	-0.055020
2	10	50	-0.029177
3	10	100	-0.027027

Regression 문제를 풀고 있기 때문에 그에 알맞은 성능 평가 지표를 사용

score는 neg_mean_squared_error를 사용했기 때문에 점수가 음수로 표현

RMSE 점수를 볼 수 있도록 음수로 된 MSE였으니, -1을 곱해주고 np.sqrt로 루트 연산

In [94]:

results['RMSE'] = np.sqrt(-1 * results['score'])
results

Out[94]:

	max_depth	n_estimators	score	RMSE
0	1	50	-0.073394	0.270914
1	1	100	-0.055020	0.234564
2	10	50	-0.029177	0.170814
3	10	100	-0.027027	0.164399

위의 데이터 프레임에 나타난 값은 정확히 말하면 RMSE가 아니라 RMSLE, 즉 Root Mean Squared Log Error 입니다. log를 취한 값에서 RMSE를 구했다는 뜻

In [95]:

results = results.rename(columns={'RMSE': 'RMSLE'})
results

Out[95]:

	max_depth	n_estimators	score	RMSLE
0	1	50	-0.073394	0.270914
1	1	100	-0.055020	0.234564
2	10	50	-0.029177	0.170814
3	10	100	-0.027027	0.164399

RMSLE가 낮은 순서대로 정렬하는 것뿐이군요! sort_values

In [101]:

results =results.sort_values('RMSLE')
results

Out[101]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

In [107]:

"""
다음과 같은 과정을 진행할 수 있는 `my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)` 함수를 구현해 보세요.

1. GridSearchCV 모델로 `model`을 초기화합니다.
2. 모델을 fitting 합니다.
3. params, score에 각 조합에 대한 결과를 저장합니다. 
4. 데이터 프레임을 생성하고, RMSLE 값을 추가한 후 점수가 높은 순서로 정렬한 `results`를 반환합니다.
"""

# 코드 입력

def my_GridSearch(model, train, y, param_grid, verbose, n_jobs):
    
    #1. GridSearchCV 모델로 `model`을 초기화합니다.
    model = GridSearchCV(model, param_grid=param_grid, \
                          #param_grid : 탐색할 파라미터의 종류 (딕셔너리로 입력)
                        scoring='neg_mean_squared_error', \
                          #scoring : 모델의 성능을 평가할 지표
                        cv=5, verbose = verbose, n_jobs = n_jobs)
        #cv : cross validation을 수행하기 위해 train 데이터셋을 나누는 조각의 개수
        #verbose : 그리드 탐색을 진행하면서 진행 과정을 출력해서 보여줄 메세지의 양 
                    #(숫자가 클수록 더 많은 메세지를 출력합니다.)
        #n_jobs : 그리드 탐색을 진행하면서 사용할 CPU의 개수
    
    #2. 모델을 fitting 합니다.
    grid_model.fit(train, y)
    grid_model.cv_results_
    
    #3. params, score에 각 조합에 대한 결과를 저장합니다. 
    params = grid_model.cv_results_['params']
    score = grid_model.cv_results_['mean_test_score']

    #4. 데이터 프레임을 생성하고, RMSLE 값을 추가한 후 점수가 높은 순서로 정렬한 `results`를 반환합니다.
    results = pd.DataFrame(params)
    results['score'] = score
    results['RMSE'] = np.sqrt(-1 * results['score'])
    results = results.rename(columns={'RMSE': 'RMSLE'})
    results =results.sort_values('RMSLE')

    return results



    
    
my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    3.9s finished

Out[107]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

4. 제출하는 것도, 빠르고 깔끔하게!¶

제출 과정 또한 하나의 함수로 깔끔하게 진행

In [111]:

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [1, 10],
}

model = LGBMRegressor(random_state=random_state)
my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
[Parallel(n_jobs=5)]: Done  20 out of  20 | elapsed:    2.0s finished

Out[111]:

	max_depth	n_estimators	score	RMSLE
3	10	100	-0.027027	0.164399
2	10	50	-0.029177	0.170814
1	1	100	-0.055020	0.234564
0	1	50	-0.073394	0.270914

가장 좋은 조합은 max_depth=10, n_estimators=100 이군요. 해당 모델로 학습을 해서 예측값인 submission.csv 파일을 만들어서 제출

해당 파라미터로 구성된 모델을 준비하고, 학습 후 예측 결과를 생성

In [112]:

model = LGBMRegressor(max_depth=10, n_estimators=100, random_state=random_state)
model.fit(train, y)
prediction = model.predict(test)
prediction

Out[112]:

array([13.13580793, 13.08051399, 14.11202067, ..., 13.01592878,
       12.69894979, 12.96297768])

np.expm1()을 씌워서 다시 원래 스케일

In [113]:

prediction = np.expm1(prediction)
prediction

Out[113]:

array([ 506766.66784595,  479506.10405112, 1345155.15609376, ...,
        449515.92243642,  327402.87855805,  426332.71354302])

sample_submission.csv 파일을 가져와보겠음

In [115]:

data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'

submission_path = join(data_dir, 'sample_submission.csv')
submission = pd.read_csv(submission_path)

#데이터프레임에 우리의 모델이 예측한 값을 덮어씌우면 제출할 데이터가 완성
submission['price'] = prediction
submission.head()

Out[115]:

	id	price
0	15035	5.067667e+05
1	15036	4.795061e+05
2	15037	1.345155e+06
3	15038	3.122579e+05
4	15039	3.338645e+05

파일 이름에 모델의 "종류"와 위에서 확인했던 "RMSLE 값"을 넣어주면 제출 파일들이 깔끔하게 관리

In [116]:

submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, 'lgbm', '0.164399')
submission.to_csv(submission_csv_path, index=False)
print(submission_csv_path)

/aiffel/aiffel/kaggle_kakr_housing/data/submission_lgbm_RMSLE_0.164399.csv

위의 과정들도 하나의 함수로 정리

In [125]:

"""
아래의 과정을 수행하는 `save_submission(model, train, y, test, model_name, rmsle)` 함수를 구현해 주세요.
1. 모델을 `train`, `y`로 학습시킵니다.
2. `test`에 대해 예측합니다.
3. 예측값을 `np.expm1`으로 변환하고, `submission_model_name_RMSLE_100000.csv` 형태의 `csv` 파일을 저장합니다.
"""

# 코드 작성

def save_submission(model, train, y, test, model_name, rmsle):
    
    #1. 모델을 `train`, `y`로 학습시킵니다.
    model.fit(train, y)
    
    #2.`test`에 대해 예측합니다.
    prediction = model.predict(test)
    
    #3. 예측값을 `np.expm1`으로 변환하고, `submission_model_name_RMSLE_100000.csv` 형태의 `csv` 파일을 저장
    prediction = np.expm1(prediction)
    
    
    data_dir = os.getenv('HOME')+'/aiffel/kaggle_kakr_housing/data'
    submission_path = join(data_dir, 'sample_submission.csv')
    submission = pd.read_csv(submission_path)

    #데이터프레임에 우리의 모델이 예측한 값을 덮어씌우면 제출할 데이터가 완성
    submission['price'] = prediction
    
    submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, model_name, rmsle)
    submission.to_csv(submission_csv_path, index=False)
    
    print('yes! \n {} saved!'.format(submission_csv_path))
    

In [126]:

save_submission(model, train, y, test, 'lgbm', rmsle='0.0168')

[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
yes! 
 /aiffel/aiffel/kaggle_kakr_housing/data/submission_lgbm_RMSLE_0.0168.csv saved!

3. 프로젝트 : This is your playground! Leaderboard를 정복해 주세요!¶

In [ ]:

'AI > AI 실습' 카테고리의 다른 글

2차 해커톤(반값 소송) (0)	2021.12.03
1차 해커톤 : 코로나 데이터 셋 (0)	2021.12.03
[EXP-07] 영화리뷰 텍스트 감성분석하기 (0)	2021.10.14

현재글모의 캐글 경진대회 [EXP_6]

모-코드 Mo_bi!e 님의 블로그입니다.

모-코드

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

모-코드

모의 캐글 경진대회 [EXP_6]

나의 첫 번째 캐글 경진대회, 무작정 따라해보기¶

1. - 일단 제출하고 시작해! Baseline 모델¶

1. Baseline 셋팅하기¶

2. 라이브러리, 데이터 가져오기¶

3. 데이터 이해하기¶

4. 모델 설계¶

2. 랭킹을 올리고 싶다면?¶

1. 다시 한 번, 내 입맛대로 데이터 준비하기¶

2. 다양한 실험을 위해 함수로 만들어 쓰자¶

3. 하이퍼 파라미터 튜닝의 최강자, 그리드 탐색¶

4. 제출하는 것도, 빠르고 깔끔하게!¶

3. 프로젝트 : This is your playground! Leaderboard를 정복해 주세요!¶

'AI > AI 실습' 카테고리의 다른 글

'AI/AI 실습'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

2025. 04
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

모의 캐글 경진대회 [EXP_6]

나의 첫 번째 캐글 경진대회, 무작정 따라해보기¶

1. - 일단 제출하고 시작해! Baseline 모델¶

1. Baseline 셋팅하기¶

2. 라이브러리, 데이터 가져오기¶

3. 데이터 이해하기¶

4. 모델 설계¶

2. 랭킹을 올리고 싶다면?¶

1. 다시 한 번, 내 입맛대로 데이터 준비하기¶

2. 다양한 실험을 위해 함수로 만들어 쓰자¶

3. 하이퍼 파라미터 튜닝의 최강자, 그리드 탐색¶

4. 제출하는 것도, 빠르고 깔끔하게!¶

3. 프로젝트 : This is your playground! Leaderboard를 정복해 주세요!¶

'AI > AI 실습' 카테고리의 다른 글

'AI/AI 실습'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역