Note

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ model_selection ๋ชจ๋“ˆ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ถ„๋ฆฌํ•˜๊ฑฐ๋‚˜ ๊ต์ฐจ ๊ฒ€์ฆ ๋ถ„ํ•  ๋ฐ ํ‰๊ฐ€, ๊ทธ๋ฆฌ๊ณ  Estimator์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠœ๋‹ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜์™€ ํด๋ž˜์Šค๋ฅผ ์ œ๊ณต

1. ํ•™์Šต/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ถ„๋ฆฌ - train_test_split()

ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ๋งŒ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋ฉด ๋ฌด์—‡์ด ๋ฌธ์ œ์ผ๊นŒ?

  • ๋‹ค์Œ ์˜ˆ์ œ๋Š” ํ•™์Šต๊ณผ ์˜ˆ์ธก์„ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
 
iris = load_iris()
dt_clf = DecisionTreeClassifier()
train_data = iris.data
train_label = iris.target
dt_clf.fit(train_data, train_label)
 
# ํ•™์Šต ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์˜ˆ์ธก ์ˆ˜ํ–‰
pred = dt_clf.predict(train_data)
print('์˜ˆ์ธก ์ •ํ™•๋„:',accuracy_score(train_label,pred))
 
>>> ์˜ˆ์ธก ์ •ํ™•๋„: 1.0
  • ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ 1.0์ด๋ผ๋Š” ๋œป์€ ์ •ํ™•๋„๊ฐ€ 100%
  • ์ฆ‰, ๋ฌธ์ œ์˜ ์ •๋‹ต์„ ์•Œ๊ณ  ์žˆ๋Š” ์ƒํƒœ์—์„œ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ…Œ์ŠคํŠธ ํ•œ ๊ฒƒ!
  • ๋”ฐ๋ผ์„œ, ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•œ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ์•„๋‹Œ ์ „์šฉ์˜ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—ฌ์•ผ ํ•จ

์‚ฌ์ดํ‚ท๋Ÿฐ์˜ train_test_split()

  • ์›๋ณธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‰ฝ๊ฒŒ ๋ถ„๋ฆฌ ๊ฐ€๋Šฅ
  • train_test_split()๋Š” ์ฒซ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ feature ๋ฐ์ดํ„ฐ ์„ธํŠธ, ๋‘ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ label ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ž…๋ ฅ๋ฐ›๊ณ , ์„ ํƒ์ ์œผ๋กœ ๋‹ค์Œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž…๋ ฅ ๋ฐ›์Œ
    • test_size : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ test ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ๋ฅผ ์–ผ๋งˆ๋กœ ์ƒ˜ํ”Œ๋งํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ฒฐ์ • (default : 0.25, ์ฆ‰ 25%)
    • train_size : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ train ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ๋ฅผ ์–ผ๋งˆ๋กœ ์ƒ˜ํ”Œ๋งํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ฒฐ์ • (test_size parameter๋ฅผ ํ†ต์ƒ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— train_size๋Š” ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ)
    • shuffle : ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๊ธฐ ์ „์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ์„ž์„์ง€๋ฅผ ๊ฒฐ์ • (default : True), ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ์‹œ์ผœ์„œ ์ข€ ๋” ํšจ์œจ์ ์ธ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ์‚ฌ์šฉ
    • random_state : ํ˜ธ์ถœํ•  ๋•Œ๋งˆ๋‹ค ๋™์ผํ•œ ํ•™์Šต/ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง€๋Š” ๋‚œ์ˆ˜ ๊ฐ’ (train_test_split()๋Š” ํ˜ธ์ถœ ์‹œ ๋ฌด์ž‘์œ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋ฏ€๋กœ random_state๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ์ˆ˜ํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค๋ฅธ ํ•™์Šต/ํ…Œ์ŠคํŠธ ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ)
    • train_test_split()์˜ ๋ฐ˜ํ™˜๊ฐ’์€ tuple ํ˜•ํƒœ๋กœ, ์ˆœ์ฐจ์ ์œผ๋กœ train-feature, test-feature, train-label, test-label ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ฐ˜ํ™˜
  • ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ train_test_split()์„ ์ด์šฉํ•ด test ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ „์ฒด์˜ 30%, train ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ 70%๋กœ ๋ถ„๋ฆฌ
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
dt_clf = DecisionTreeClassifier( )
iris_data = load_iris()
 
X_train, X_test, y_train, y_test = train_test_split(iris_data.data,iris_data.target,
                 test_size=0.3, random_state=121)
  • train ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ DecisionTreeClassfier๋ฅผ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„ ์ธก์ •
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
print('์˜ˆ์ธก ์ •ํ™•๋„: {0:.4f}'.format(accuracy_score(y_test,pred)))
 
>>> ์˜ˆ์ธก ์ •ํ™•๋„: 0.9556

๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋Š” 150๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ํฌ์ง€ ์•Š์•„ ์ „์ฒด์˜ 30% ์ •๋„์ธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” 45๊ฐœ ์ •๋„๋ฐ–์— ๋˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํŒ๋‹จํ•˜๊ธฐ์—๋Š” ๊ทธ๋ฆฌ ์ ์ ˆํ•˜์ง€ ์•Š์Œ โ†’ ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ์œผ๋กœ ๋ณด์žฅํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜์ง€๋งŒ, ํ•™์Šต๋œ ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ด ๋ณด๋Š” ๊ฒƒ๋„ ๋งค์šฐ ์ค‘์š”!


2. ๊ต์ฐจ ๊ฒ€์ฆ (Cross-Validation, CV)

๊ณผ์ ํ•ฉ (Overfitting)

๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์—๋งŒ ๊ณผ๋„ํ•˜๊ฒŒ ์ตœ์ ํ™”๋˜์–ด, ์‹ค์ œ ์˜ˆ์ธก์„ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋กœ ์ˆ˜ํ–‰ํ•  ๊ฒฝ์šฐ์—๋Š” ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ๊ณผ๋„ํ•˜๊ฒŒ ๋–จ์–ด์ง€๋Š” ๊ฒƒ

  • ๊ณ ์ •๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ‰๊ฐ€๋ฅผ ํ•˜๋‹ค ๋ณด๋ฉด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋งŒ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํŽธํ–ฅ๋˜๊ฒŒ ๋ชจ๋ธ์„ ์œ ๋„ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ์Œ
  • ๊ฒฐ๊ตญ์€ ํ•ด๋‹น ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋งŒ ๊ณผ์ ํ•ฉ๋˜๋Š” ํ•™์Šต ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์ ธ ๋‹ค๋ฅธ ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ฌ ๊ฒฝ์šฐ์—๋Š” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ โ†’ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ด์šฉํ•ด ๋” ๋‹ค์–‘ํ•œ ํ•™์Šต๊ณผ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰!

๊ต์ฐจ ๊ฒ€์ฆ (Cross-Validation, CV)

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ„์–ด ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋ฐ์ดํ„ฐ ํŽธ์ค‘์„ ๋ง‰๊ธฐ ์œ„ํ•ด์„œ ๋ณ„๋„์˜ ์—ฌ๋Ÿฌ ์„ธํŠธ๋กœ ๊ตฌ์„ฑ๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ํ•™์Šต๊ณผ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ
  • ๊ฐ ์„ธํŠธ์—์„œ ์ˆ˜ํ–‰ํ•œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ์— ๋”ฐ๋ผ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋“ฑ์˜ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ๋”์šฑ ์†์‰ฝ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋Œ€๋ถ€๋ถ„์˜ ML ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋Š” ๊ต์ฐจ ๊ฒ€์ฆ ๊ธฐ๋ฐ˜์œผ๋กœ 1์ฐจ ํ‰๊ฐ€๋ฅผ ํ•œ ๋’ค์— ์ตœ์ข…์ ์œผ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ ์šฉํ•ด ํ‰๊ฐ€ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค

a. k ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ

k ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ (K-Fold Cross-Validation)

k๊ฐœ์˜ ๋ฐ์ดํ„ฐ fold(์กฐ๊ฐ) ์„ธํŠธ๋ฅผ ๋งŒ๋“ค์–ด์„œ k๋ฒˆ๋งŒํผ ๊ฐ fold ์„ธํŠธ์— ํ•™์Šต๊ณผ ๊ฒ€์ฆ ํ‰๊ฐ€๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋‹ค์Œ ๊ทธ๋ฆผ์€ 5 fold ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰ (k=5)

    • 5๊ฐœ์˜ fold๋œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ํ•™์Šต๊ณผ ๊ฒ€์ฆ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ 5๋ฒˆ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋’ค, ์ด 5๊ฐœ์˜ ํ‰๊ฐ€๋ฅผ ํ‰๊ท ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€
    • ์ด๋ ‡๊ฒŒ ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ ์ง„์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ๋งˆ์ง€๋ง‰ 5๋ฒˆ์งธ(k๋ฒˆ์งธ)๊นŒ์ง€ ํ•™์Šต๊ณผ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ k fold ๊ต์ฐจ ๊ฒ€์ฆ
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ๋Š” k fold ๊ต์ฐจ ๊ฒ€์ฆ ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด KFold์™€ StratifiedKFold๋ฅผ ์ œ๊ณต

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np
 
iris = load_iris()
features = iris.data
label = iris.target
dt_clf = DecisionTreeClassifier(random_state=156)
 
# 5๊ฐœ์˜ ํด๋“œ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” KFold ๊ฐ์ฒด์™€ ํด๋“œ ์„ธํŠธ๋ณ„ ์ •ํ™•๋„๋ฅผ ๋‹ด์„ ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด ์ƒ์„ฑ.
kfold = KFold(n_splits=5)
cv_accuracy = []
print('๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ:',features.shape[0])
 
>>> ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ ํฌ๊ธฐ: 150
  • KFold(n_splits=5)๋กœ KFold ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ–ˆ์œผ๋‹ˆ, ์ด์ œ ์ƒ์„ฑ๋œ KFold ๊ฐ์ฒด์˜ split()์„ ํ˜ธ์ถœํ•ด ์ „์ฒด ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋ฅผ 5๊ฐœ์˜ fold ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ
n_iter = 0
 
# KFold๊ฐ์ฒด์˜ split( ) ํ˜ธ์ถœํ•˜๋ฉด ํด๋“œ ๋ณ„ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ์˜ ๋กœ์šฐ ์ธ๋ฑ์Šค๋ฅผ array๋กœ ๋ฐ˜ํ™˜ ย 
for train_index, test_index ย in kfold.split(features):
ย  ย  # kfold.split( )์œผ๋กœ ๋ฐ˜ํ™˜๋œ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
ย  ย  X_train, X_test = features[train_index], features[test_index]
ย  ย  y_train, y_test = label[train_index], label[test_index]
 
ย  ย  #ํ•™์Šต ๋ฐ ์˜ˆ์ธก
ย  ย  dt_clf.fit(X_train , y_train) ย  ย 
ย  ย  pred = dt_clf.predict(X_test)
ย  ย  n_iter += 1
 
ย  ย  # ๋ฐ˜๋ณต ์‹œ ๋งˆ๋‹ค ์ •ํ™•๋„ ์ธก์ •
ย  ย  accuracy = np.round(accuracy_score(y_test,pred), 4)
ย  ย  train_size = X_train.shape[0]
ย  ย  test_size = X_test.shape[0]
 
ย  ย  print('\n#{0} ๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„ :{1}, ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {2}, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {3}'
ย  ย  ย  ย  ย  .format(n_iter, accuracy, train_size, test_size))
ย  ย  print('#{0} ๊ฒ€์ฆ ์„ธํŠธ ์ธ๋ฑ์Šค:{1}'.format(n_iter,test_index))
ย  ย  cv_accuracy.append(accuracy)
 
# ๊ฐœ๋ณ„ iteration๋ณ„ ์ •ํ™•๋„๋ฅผ ํ•ฉํ•˜์—ฌ ํ‰๊ท  ์ •ํ™•๋„ ๊ณ„์‚ฐ
print('\n## ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.mean(cv_accuracy))

  • 5๋ฒˆ ๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„๋Š” 0.9์ด๊ณ , ๊ต์ฐจ ๊ฒ€์ฆ ์‹œ๋งˆ๋‹ค ๊ฒ€์ฆ ์„ธํŠธ์˜ ์ธ๋ฑ์Šค๊ฐ€ ๋‹ฌ๋ผ์ง์„ ์•Œ ์ˆ˜ ์žˆ์Œ!

b. Stratified K Fold ํด๋ž˜์Šค

Stratified K Fold

๋ถˆ๊ท ํ˜•ํ•œ(imbalanced) ๋ถ„ํฌ๋„๋ฅผ ๊ฐ€์ง„ label(๊ฒฐ์ • ํด๋ž˜์Šค) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ์œ„ํ•œ K fold ๋ฐฉ์‹

  • ๋ถˆ๊ท ํ˜•ํ•œ ๋ถ„ํฌ๋„๋ฅผ ๊ฐ€์ง„ label ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์€ ํŠน์ • label ๊ฐ’์ด ํŠน์ดํ•˜๊ฒŒ ๋งŽ๊ฑฐ๋‚˜ ๋งค์šฐ ์ ์–ด์„œ ๊ฐ’์˜ ๋ถ„ํฌ๊ฐ€ ํ•œ์ชฝ์œผ๋กœ ์น˜์šฐ์น˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ
  • K Fold๊ฐ€ label ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ์›๋ณธ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์˜ label ๋ถ„ํฌ๋ฅผ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ์ œ๋Œ€๋กœ ๋ถ„๋ฐฐํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ!
    • ์ด๋ฅผ ์œ„ํ•ด ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ label ๋ถ„ํฌ๋ฅผ ๋จผ์ € ๊ณ ๋ คํ•œ ๋’ค ์ด ๋ถ„ํฌ์™€ ๋™์ผํ•˜๊ฒŒ ํ•™์Šต๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ถ„๋ฐฐ

๋Œ€์ถœ ์‚ฌ๊ธฐ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค๊ณ  ๊ฐ€์ •!

  • ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” 1์–ต ๊ฑด์ด๊ณ , ์ˆ˜์‹ญ ๊ฐœ์˜ feature์™€ ๋Œ€์ถœ ์‚ฌ๊ธฐ ์—ฌ๋ถ€๋ฅผ ๋œปํ•˜๋Š” label(์‚ฌ๊ธฐ:1, ์ •์ƒ:0)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ
  • ๊ทธ๋Ÿฐ๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ๋Š” ์ •์ƒ ๋Œ€์ถœ์ผ ๊ฒƒ!
  • ๋Œ€์ถœ ์‚ฌ๊ธฐ๊ฐ€ ์•ฝ 1000๊ฑด์ด ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค๋ฉด ์ „์ฒด์˜ 0.0001%์˜ ์•„์ฃผ ์ž‘์€ ํ™•๋ฅ ๋กœ ๋Œ€์ถœ ์‚ฌ๊ธฐ label์ด ์กด์žฌ
  • ์ด๋ ‡๊ฒŒ ๋œ๋‹ค๋ฉด K Fold๋กœ ๋žœ๋คํ•˜๊ฒŒ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ธ๋ฑ์Šค๋ฅผ ๊ณ ๋ฅด๋”๋ผ๊ณ  label ๊ฐ’์ธ 0๊ณผ 1์˜ ๋น„์œจ์„ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์‰ฝ๊ฒŒ ๋ฐœ์ƒ!
  • ๋”ฐ๋ผ์„œ ์›๋ณธ ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•œ ๋Œ€์ถœ ์‚ฌ๊ธฐ ๋ ˆ์ด๋ธ” ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ํ•™์Šต/ํ…Œ์ŠคํŠธ ์„ธํŠธ์—๋„ ์œ ์ง€ํ•˜๋Š” ๊ฒŒ ๋งค์šฐ ์ค‘์š”!

Note

๋จผ์ € K Fold๊ฐ€ ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•ด ๋ณด๊ณ  ์ด๋ฅผ ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ StratifiedKFold ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•ด ๊ฐœ์„ !

  1. ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ DataFrame์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ  label ๊ฐ’์˜ ๋ถ„ํฌ๋„ ํ™•์ธ
import pandas as pd
 
iris = load_iris()
 
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['label']=iris.target
iris_df['label'].value_counts()

  1. ๊ฐ ๊ต์ฐจ ๊ฒ€์ฆ ์‹œ๋งˆ๋‹ค ์ƒ์„ฑ๋˜๋Š” ํ•™์Šต/๊ฒ€์ฆ label ๋ฐ์ดํ„ฐ ๊ฐ’์˜ ๋ถ„ํฌ๋„ ํ™•์ธ
kfold = KFold(n_splits=3)
# kfold.split(X)๋Š” ํด๋“œ ์„ธํŠธ๋ฅผ 3๋ฒˆ ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค ๋‹ฌ๋ผ์ง€๋Š” ํ•™์Šต/ํ…Œ์ŠคํŠธ ์šฉ ๋ฐ์ดํ„ฐ ๋กœ์šฐ ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ ๋ฐ˜ํ™˜.
n_iter =0
for train_index, test_index ย in kfold.split(iris_df):
ย  ย  n_iter += 1
ย  ย  label_train= iris_df['label'].iloc[train_index]
ย  ย  label_test= iris_df['label'].iloc[test_index]
 
ย  ย  print('## ๊ต์ฐจ ๊ฒ€์ฆ: {0}'.format(n_iter))
ย  ย  print('ํ•™์Šต ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_train.value_counts())
ย  ย  print('๊ฒ€์ฆ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_test.value_counts())

  • ๊ต์ฐจ ๊ฒ€์ฆ ์‹œ๋งˆ๋‹ค 3๊ฐœ์˜ fold ์„ธํŠธ๋กœ ๋งŒ๋“ค์–ด์ง€๋Š” ํ•™์Šต label๊ณผ ๊ฒ€์ฆ label์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ถ”์ถœ๋จ
  • ์ด๋Ÿฐ ์œ ํ˜•์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ถ„ํ• ํ•˜๋ฉด ๊ฒ€์ฆ ์˜ˆ์ธก ์ •ํ™•๋„๋Š” 0์ด ๋  ์ˆ˜๋ฐ–์— ์—†์Œ
  1. ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• ์„ StratifiedKFold๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ  ํ•™์Šต/๊ฒ€์ฆ label ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋„ ํ™•์ธ

StratifiedKFold๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ KFold๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๊ฑฐ์˜ ๋น„์Šทํ•˜์ง€๋งŒ, ๋‹จ ํ•˜๋‚˜ ํฐ ์ฐจ์ด๋Š” StratifiedKFold๋Š” label ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„์— ๋”ฐ๋ผ ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๊ธฐ ๋•Œ๋ฌธ์— split()๋ฉ”์„œ๋“œ์— ์ธ์ž๋กœ feature ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ label ๋ฐ์ดํ„ฐ ์„ธํŠธ๋„ ๋ฐ˜๋“œ์‹œ ํ•„์š”!

from sklearn.model_selection import StratifiedKFold
 
skf = StratifiedKFold(n_splits=3)
n_iter=0
 
for train_index, test_index in skf.split(iris_df, iris_df['label']):
ย  ย  n_iter += 1
ย  ย  label_train= iris_df['label'].iloc[train_index]
ย  ย  label_test= iris_df['label'].iloc[test_index]
 
ย  ย  print('## ๊ต์ฐจ ๊ฒ€์ฆ: {0}'.format(n_iter))
ย  ย  print('ํ•™์Šต ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_train.value_counts())
ย  ย  print('๊ฒ€์ฆ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ:\n', label_test.value_counts())

  • ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ํ•™์Šต label๊ณผ ๊ฒ€์ฆ label ๋ฐ์ดํ„ฐ ๊ฐ’์˜ ๋ถ„ํฌ๋„๊ฐ€ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ํ• ๋‹น๋์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
  • ์ด๋ ‡๊ฒŒ ๋ถ„ํ• ์ด ๋˜์–ด์•ผ label ๊ฐ’ 0, 1, 2๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ , ์ด์— ๊ธฐ๋ฐ˜ํ•ด ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
  1. StratifiedKFold๋ฅผ ์ด์šฉํ•ด ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ๊ต์ฐจ ๊ฒ€์ฆ
dt_clf = DecisionTreeClassifier(random_state=156)
 
skfold = StratifiedKFold(n_splits=3)
n_iter=0
cv_accuracy=[]
 
# StratifiedKFold์˜ split( ) ํ˜ธ์ถœ์‹œ ๋ฐ˜๋“œ์‹œ ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ ์…‹๋„ ์ถ”๊ฐ€ ์ž…๋ ฅ ํ•„์š” ย 
for train_index, test_index ย in skfold.split(features, label):
ย  ย  # split( )์œผ๋กœ ๋ฐ˜ํ™˜๋œ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์šฉ, ๊ฒ€์ฆ์šฉ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
ย  ย  X_train, X_test = features[train_index], features[test_index]
ย  ย  y_train, y_test = label[train_index], label[test_index]
 
ย  ย  #ํ•™์Šต ๋ฐ ์˜ˆ์ธก
ย  ย  dt_clf.fit(X_train, y_train) ย  ย 
ย  ย  pred = dt_clf.predict(X_test)
 
ย  ย  # ๋ฐ˜๋ณต ์‹œ ๋งˆ๋‹ค ์ •ํ™•๋„ ์ธก์ •
ย  ย  n_iter += 1
ย  ย  accuracy = np.round(accuracy_score(y_test,pred), 4)
ย  ย  train_size = X_train.shape[0]
ย  ย  test_size = X_test.shape[0]
 
ย  ย  print('\n#{0} ๊ต์ฐจ ๊ฒ€์ฆ ์ •ํ™•๋„ :{1}, ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {2}, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {3}'
ย  ย  ย  ย  ย  .format(n_iter, accuracy, train_size, test_size))
ย  ย  print('#{0} ๊ฒ€์ฆ ์„ธํŠธ ์ธ๋ฑ์Šค:{1}'.format(n_iter,test_index))
ย  ย  cv_accuracy.append(accuracy)
 
# ๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„ ๋ฐ ํ‰๊ท  ์ •ํ™•๋„ ๊ณ„์‚ฐ
print('\n## ๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„:', np.round(cv_accuracy, 4))
print('## ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.round(np.mean(cv_accuracy), 4))

Note

Stratified K Fold ์˜ ๊ฒฝ์šฐ ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ label ๋ถ„ํฌ๋„ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•œ ํ•™์Šต ๋ฐ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์™œ๊ณก๋œ label ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ๋Š” ๋ฐ˜๋“œ์‹œ Stratified K Fold๋ฅผ ์ด์šฉํ•ด ๊ต์ฐจ ๊ฒ€์ฆํ•ด์•ผ ํ•จ!

c. ๊ต์ฐจ ๊ฒ€์ฆ์„ ๋ณด๋‹ค ๊ฐ„ํŽธํ•˜๊ฒŒ - cross_val_score()

- ์‚ฌ์ดํ‚ท๋Ÿฐ์€ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ข€ ๋” ํŽธ๋ฆฌํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” API ์ œ๊ณต
  • KFold๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธก ํ•˜๋Š” ์ฝ”๋“œ ์ˆœ์„œ
      1. fold ์„ธํŠธ๋ฅผ ์„ค์ •
      1. for ๋ฃจํ”„์—์„œ ๋ฐ˜๋ณต์œผ๋กœ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœ
      1. ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต๊ณผ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์˜ˆ์ธก ์„ฑ๋Šฅ ๋ฐ˜ํ™˜
  • cross_val_score()๋Š” ์ด๋Ÿฐ ์ผ๋ จ์˜ ๊ณผ์ •์„ ํ•œ๊บผ๋ฒˆ์— ์ˆ˜ํ–‰ํ•ด์ฃผ๋Š” API

cross_val_score()

cross_val_score(estimator, X, y=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch='2*n_jobs')

  • ์ฃผ์š” parameter : estimator, X, y, scoring, cv
    • estimator : Classifier ํด๋ž˜์Šค or Regressor ํด๋ž˜์Šค๋ฅผ ์˜๋ฏธ
    • X : feature ๋ฐ์ดํ„ฐ ์„ธํŠธ
    • y : label ๋ฐ์ดํ„ฐ ์„ธํŠธ
    • scoring : ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ
    • cv : ๊ต์ฐจ ๊ฒ€์ฆ fold ์ˆ˜
      • KFold ๊ฐ์ฒด๋‚˜ StratifiedKFold ๊ฐ์ฒด๋ฅผ ์ž…๋ ฅํ•  ์ˆ˜๋„ ์žˆ์Œ
  • ๋ฐ˜ํ™˜ ๊ฐ’์€ scoring ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •๋œ ์„ฑ๋Šฅ ์ง€ํ‘œ ์ธก์ •๊ฐ’์„ ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
  • ์ฆ‰, classifier๊ฐ€ ์ž…๋ ฅ๋˜๋ฉด Stratified K fold ๋ฐฉ์‹์œผ๋กœ label ๊ฐ’์˜ ๋ถ„ํฌ์— ๋”ฐ๋ผ ํ•™์Šต/ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋ถ„ํ•  (ํšŒ๊ท€์ธ ๊ฒฝ์šฐ๋Š” Stratified K fold ๋ฐฉ์‹์œผ๋กœ ๋ถ„ํ•  ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ K fold ๋ฐฉ์‹์œผ๋กœ ๋ถ„ํ• )
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score , cross_validate
from sklearn.datasets import load_iris
 
iris_data = load_iris()
dt_clf = DecisionTreeClassifier(random_state=156)
 
data = iris_data.data
label = iris_data.target
 
# ์„ฑ๋Šฅ ์ง€ํ‘œ๋Š” ์ •ํ™•๋„(accuracy) , ๊ต์ฐจ ๊ฒ€์ฆ ์„ธํŠธ๋Š” 3๊ฐœ
scores = cross_val_score(dt_clf , data , label , scoring='accuracy',cv=3)
print('๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„:',np.round(scores, 4))
print('ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:', np.round(np.mean(scores), 4))
 
>>> ๊ต์ฐจ ๊ฒ€์ฆ๋ณ„ ์ •ํ™•๋„: [0.98 0.94 0.98]
    ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„: 0.9667
  • cross_val_score() API๋Š” ๋‚ด๋ถ€์—์„œ Estimator๋ฅผ ํ•™์Šต(fit), ์˜ˆ์ธก(predict), ํ‰๊ฐ€(evaluation)์‹œ์ผœ์ฃผ๋ฏ€๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ!
  • cv ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ •์ˆ˜๊ฐ’(fold ์ˆ˜)๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋‚ด๋ถ€์ ์œผ๋กœ StratifiedKFold๋ฅผ ์ด์šฉ
  • ๋น„์Šทํ•œ API๋กœ cross_validate() ์กด์žฌ
    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ‰๊ฐ€ ์ง€ํ‘œ ๋ฐ˜ํ™˜ ๊ฐ€๋Šฅ
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ์™€ ์ˆ˜ํ–‰ ์‹œ๊ฐ„๋„ ๊ฐ™์ด ์ œ๊ณต

3. GridSearchCV - ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ๋ฒˆ์—

Note

ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ (Hyperparameter)

  • ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์ „์— ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„ค์ •ํ•˜๋Š” ๊ฐ’์œผ๋กœ, ์ด ๊ฐ’์„ ์กฐ์ •ํ•ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ ๊ฐœ์„ 

GridSearchCV

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” API
  • Classifier๋‚˜ Regressor์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์‚ฌ์šฉ๋˜๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅํ•˜๋ฉด์„œ ํŽธ๋ฆฌํ•˜๊ฒŒ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ ์ œ๊ณต
  • Grid๋Š” ๊ฒฉ์ž๋ผ๋Š” ๋œป์œผ๋กœ, ์ด˜์ด˜ํ•˜๊ฒŒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด์„œ ํ…Œ์ŠคํŠธ๋ฅผ ํ•˜๋Š” ๋ฐฉ์‹
  • ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์—ฌ๋Ÿฌ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์„ ์ฐพ๊ณ ์ž ํ•œ๋‹ค๋ฉด
  • ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ง‘ํ•ฉ์„ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉํ•˜๋ฉด์„œ ์ตœ์ ํ™” ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
grid_parameters = {'max_depth': [1, 2, 3],
				   'min_samples_split': [2, 3]}
  • GridSearchCV๋Š” ๊ต์ฐจ ๊ฒ€์ฆ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ตœ์ ๊ฐ’์„ ์ฐพ๊ฒŒ ํ•ด์คŒ!
      1. ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ cross-validation์„ ์œ„ํ•œ ํ•™์Šต/ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์ž๋™์œผ๋กœ ๋ถ„ํ• 
      1. ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ grid์— ๊ธฐ์ˆ ๋œ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉํ•ด ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ
  • ๋‹จ, ๋™์‹œ์— ์ˆœ์ฐจ์ ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ฏ€๋กœ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์ƒ๋Œ€์ ์œผ๋กœ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ๋‹จ์  ์กด์žฌ!
    • ์œ„์˜ ๊ฒฝ์šฐ CV๊ฐ€ 3ํšŒ๋ผ๋ฉด CV 3ํšŒ x 6๊ฐœ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ = 18ํšŒ์˜ ํ•™์Šต/ํ‰๊ฐ€ ์ด๋ฃจ์–ด์ง

GridSearchCV ํด๋ž˜์Šค

  • estimator : classifier, regressor, pipeline์ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Œ
  • param_grid : key + ๋ฆฌ์ŠคํŠธ ๊ฐ’์„ ๊ฐ–๋Š” ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ ์ฃผ์–ด์ง (estimator์˜ ํŠœ๋‹์„ ์œ„ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ช…๊ณผ ์‚ฌ์šฉ๋  ์—ฌ๋Ÿฌ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ์ง€์ •)
  • scoring : ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ์ง€์ •, ๋ณดํ†ต์€ ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์ง€์ •ํ•˜๋Š” ๋ฌธ์ž์—ด(ex:โ€˜accuracyโ€™)๋กœ ์ง€์ •ํ•˜๋‚˜ ๋ณ„๋„์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ ํ•จ์ˆ˜๋„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ
  • cv : ๊ต์ฐจ ๊ฒ€์ฆ์„ ์œ„ํ•ด ๋ถ„ํ• ๋˜๋Š” ํ•™์Šต/ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ๊ฐœ์ˆ˜ ์ง€์ •
  • refit : default=True์ด๋ฉฐ True๋กœ ์ƒ์„ฑ ์‹œ ๊ฐ€์žฅ ์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์€ ๋’ค ์ž…๋ ฅ๋œ estimator ๊ฐ์ฒด๋ฅผ ํ•ด๋‹น ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์žฌํ•™์Šต

<์˜ˆ์ œ> ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์ตœ์ ํ™” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉํ•ด ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธก ๋ถ„์„ํ•˜๋Š” ๋ฐ GridSearchCV ์ด์šฉ

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
 
# ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋”ฉํ•˜๊ณ  ํ•™์Šต๋ฐ์ดํƒ€์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target,
ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  test_size=0.2, random_state=121)
dtree = DecisionTreeClassifier()
 
### parameter ๋“ค์„ dictionary ํ˜•ํƒœ๋กœ ์„ค์ •
parameters = {'max_depth':[1,2,3], 'min_samples_split':[2,3]}
  • train_test_split()์„ ์ด์šฉํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ๋ถ„๋ฆฌ
  • ํ…Œ์ŠคํŠธํ•  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ dictionary ํ˜•ํƒœ๋กœ ์„ค์ •
import pandas as pd
 
# param_grid์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ 3๊ฐœ์˜ train, test set fold ๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ…Œ์ŠคํŠธ ์ˆ˜ํ–‰ ์„ค์ •. ย 
### refit=True ๊ฐ€ default ์ž„. True์ด๋ฉด ๊ฐ€์žฅ ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์œผ๋กœ ์žฌ ํ•™์Šต ์‹œํ‚ด. ย 
grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=3, refit=True)
 
# ๋ถ“๊ฝƒ Train ๋ฐ์ดํ„ฐ๋กœ param_grid์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต/ํ‰๊ฐ€ .
grid_dtree.fit(X_train, y_train)
 
# GridSearchCV ๊ฒฐ๊ณผ ์ถ”์ถœํ•˜์—ฌ DataFrame์œผ๋กœ ๋ณ€ํ™˜
scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
ย  ย  ย  ย  ย  ย 'split0_test_score', 'split1_test_score', 'split2_test_score']]
  • ํ•™์Šต ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ GridSearchCV ๊ฐ์ฒด์˜ fit() ๋ฉ”์„œ๋“œ์— ์ธ์ž๋กœ ์ž…๋ ฅ
  • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ cv์— ๊ธฐ์ˆ ๋œ ํด๋”ฉ ์„ธํŠธ๋กœ ๋ถ„ํ• ํ•ด param_grid์— ๊ธฐ์ˆ ๋œ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ํ•™์Šต/ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ cv_results_ ์†์„ฑ์— ๊ธฐ๋ก

<๊ฒฐ๊ณผ>

  • params ์ปฌ๋Ÿผ์—๋Š” ์ˆ˜ํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ์ ์šฉ๋œ ๊ฐœ๋ณ„ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์„ ๋‚˜ํƒ€๋ƒ„
  • rank_test_score๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„๋กœd ์„ฑ๋Šฅ์ด ์ข‹์€ score ์ˆœ์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ„ (1์ด ๊ฐ€์žฅ ๋›ฐ์–ด๋‚œ ์ˆœ์œ„์ด๋ฉฐ ์ด๋•Œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ตœ์ ์˜ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ)
  • mean_test_score๋Š” ๊ฐœ๋ณ„ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„๋กœ CV์˜ ํด๋”ฉ ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•ด ์ด ์ˆ˜ํ–‰ํ•œ ํ‰๊ฐ€ ํ‰๊ท ๊ฐ’
print('GridSearchCV ์ตœ์  ํŒŒ๋ผ๋ฏธํ„ฐ:', grid_dtree.best_params_)
print('GridSearchCV ์ตœ๊ณ  ์ •ํ™•๋„: {0:.4f}'.format(grid_dtree.best_score_))

  • GridSearchCV ๊ฐ์ฒด์˜ fit()์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’๊ณผ ๊ทธ๋•Œ์˜ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ๊ฐ’์ด ๊ฐ๊ฐ best_params_, best_score_ ์†์„ฑ์— ๊ธฐ๋ก
# GridSearchCV์˜ refit์œผ๋กœ ์ด๋ฏธ ํ•™์Šต์ด ๋œ estimator ๋ฐ˜ํ™˜
estimator = grid_dtree.best_estimator_
 
# GridSearchCV์˜ best_estimator_๋Š” ์ด๋ฏธ ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•™์Šต์ด ๋จ
pred = estimator.predict(X_test)
print('ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ •ํ™•๋„: {0:.4f}'.format(accuracy_score(y_test,pred)))
 
>>> ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ •ํ™•๋„: 0.9667
  • refit=True์ด๋ฉด GridSearchCV๊ฐ€ ์ตœ์  ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ Estimator๋ฅผ ๋‹ค์‹œ ํ•™์Šตํ•ด best_estimator_๋กœ ์ €์žฅ
    • refit=False์ธ ๊ฒฝ์šฐ ์ตœ์ ์˜ ๋ชจ๋ธ(best_estimator_)์„ ์ž๋™์œผ๋กœ ๋‹ค์‹œ ํ•™์Šตํ•˜์ง€ ์•Š์Œ!
    • ์ฆ‰, best_estimator_ ์†์„ฑ์ด ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์˜ค์ง ๊ต์ฐจ ๊ฒ€์ฆ ๊ฒฐ๊ณผ(cv_results_)๋งŒ ์ œ๊ณต๋จ
  • ์ด๋ฏธ ํ•™์Šต๋œ best_estimator_๋ฅผ ์ด์šฉํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•œ ๊ฒฐ๊ณผ ์•ฝ 96.67%์˜ ๊ฒฐ๊ณผ ๋„์ถœ

Tip

ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ GridSearchCV๋ฅผ ์ด์šฉํ•ด ์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ˆ˜ํ–‰ํ•œ ๋’ค์— ๋ณ„๋„์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์ด๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ ์šฉ ๋ฐฉ๋ฒ•