Predicting house prices: a regression example¶

Predicting a continuous value instead of a discrete label

The Boston Housing Price dataset¶

We want to predict the median price of homes in a given Boston suburb in the mid-1970s, given the crime rate, the local property tax rate, and so on.
It has relatively few data points: only 506 (404 training samples and 102 test samples).
Each feature in the input has a different scale.
- For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, and so on.

attribute 마다 다른 스케일에 대해 어떻게 handling 할 것인가?¶

집값의 median을 예측하는 셋¶

from tensorflow.keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

400 여개 train set, 102개의 test set, 각 13개의 컬럼¶

train_data.shape

(404, 13)

train_data[0] # 스케일이 다양한 것을 확인.

array([  1.23247,   0.     ,   8.14   ,   0.     ,   0.538  ,   6.142  ,
        91.7    ,   3.9769 ,   4.     , 307.     ,  21.     , 396.9    ,
        18.72   ])

test_data.shape

(102, 13)

train_targets[:30]

array([15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6, 14.4, 12.1,
       17.9, 23.1, 19.9, 15.7,  8.8, 50. , 22.5, 24.1, 27.5, 10.9, 30.8,
       32.9, 24. , 18.5, 13.3, 22.9, 34.7, 16.6, 17.5])

Preparing the data¶

It would be problematic to feed into a neural network values that all take wildly different ranges.
Let's do feature-wise normalization
- For each feature, we subtract the mean of the feature and divide by the standard deviation.
- Then, the feature is centered around 0 and has a unit standard deviation.

표준화작업을 한다. normalizing¶

mean = train_data.mean(axis=0)
std = train_data.std(axis=0)

train_data -= mean
train_data /= std

# test셋으로 하면 절대 안된다!! 모델의 성능만을 평가해야해서 test를 사용하면 안된다. train set으로 얻은 평균과 표준편차를 활용하자.
test_data -= mean # test도 당연히 해주어야 한다  ! 안하면 이상한 예측이 될 것이라는 것은. 인지상정.
test_data /= std

Note that the quantities used for normalizing the test data are computed using the training data.
NEVER use any quantity computed on the test data, even for something as simple as data normalization.

Building the network¶

from tensorflow.keras import models
from tensorflow.keras import layers

def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1)) # output node의 수 1개만 있으면 된다. activate function 이 필요없음 ! 주면안댐.
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) # loss function: mse, mae로 모니터링.
    return model

This network ends with a single unit and no activation (it is called a linear layer).
mse loss
- mean squared error, the square of the difference between the predictions and the targets
mae for monitoring
- mean absolute error, the absolute error of the difference between the predictions and the targets

Validation with k-fold cross validation technique¶

데이터 셋이 너무 적다. validation셋이 어떻게 뽑히느냐에 따라 성능이 또 달라질 수 있다.¶

그래서 k-fold validation을 한다.¶

아래의 Fold2, Fold3에 흰 박스의 Validation 은 Training 이다. (오타)¶

Since we have few data points, the validation set would be very small if we randomly split the data into a training set and a validation set.
- It means that the validation scores might change a lot depending on which data points we chose for the validation.
- We can say that the validation scores might have a high variance with regard to the validation split.
The best practice in such situations is to use k-fold cross-validation.
- It consists of splitting the available data into k partitions, instantiating k identical models, and training each one of k-1 partitions while evaluating on the remaining partition.
- The validation score for the model used is then the average of the k validation scores obtained.

import numpy as np

k = 4   # 4번 나눠본다.
num_val_samples = len(train_data) // k
num_epochs = 300
all_mae_histories = []

for i in range(k):
    print('processing fold #', i)
    val_data = train_data[i*num_val_samples: (i+1)*num_val_samples]
    val_targets = train_targets[i*num_val_samples: (i+1)*num_val_samples]

    partial_train_data = np.concatenate([train_data[:i*num_val_samples],
                                       train_data[(i+1)*num_val_samples:]],
                                      axis=0)
    partial_train_targets = np.concatenate([train_targets[:i*num_val_samples],
                                          train_targets[(i+1)*num_val_samples:]],
                                         axis=0)


model = build_model() # 모델 초기화

history = model.fit(partial_train_data, 
                      partial_train_targets,
                      validation_data=(val_data, val_targets),
                      epochs=num_epochs,
                      batch_size=16, 
                      verbose=0) # print하지 않겠다.
mae_history = history.history['val_mae']
all_mae_histories.append(mae_history)

processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3

저장해놓은 결과 값의 평균을 구함¶

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

Plotting validation scores

처음에는 엉뚱한 값을 예측하지만 점점 떨어진다. 한번 떨어지면 score의 형태 파악이 힘듬¶

import matplotlib.pyplot as plt

plt.plot(range(1, len(average_mae_history)+1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

moving average를 취해서 완만한 곡선으로 표현해본다¶

어느순간부터 증가하는 양상이 보이려함.¶

def smooth_curve(points, factor=0.9):
  smoothed_points = []
  for point in points:
    if smoothed_points:
      previous = smoothed_points[-1]
      smoothed_points.append(previous*factor + point*(1-factor))
    else:
      smoothed_points.append(point)
  return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1, len(smooth_mae_history)+1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

Exercise¶

We found that validation MAE stops improving at a some point.
Write a code to train a final production model on all of the training data and then look at its performance on the test data.

Convolution and Pooling (1)	2020.06.27
fundamentals of machine learning (0)	2020.06.20
multi class classification (0)	2020.06.20
binary classification_multi perceptron (0)	2020.06.20
MNIST 데이터를 활용한 딥러닝 기초 (2)	2020.05.16

데하

regression

Predicting house prices: a regression example¶

The Boston Housing Price dataset¶

attribute 마다 다른 스케일에 대해 어떻게 handling 할 것인가?¶

집값의 median을 예측하는 셋¶

400 여개 train set, 102개의 test set, 각 13개의 컬럼¶

Preparing the data¶

표준화작업을 한다. normalizing¶

Building the network¶

Validation with k-fold cross validation technique¶

데이터 셋이 너무 적다. validation셋이 어떻게 뽑히느냐에 따라 성능이 또 달라질 수 있다.¶

그래서 k-fold validation을 한다.¶

아래의 Fold2, Fold3에 흰 박스의 Validation 은 Training 이다. (오타)¶

저장해놓은 결과 값의 평균을 구함¶

처음에는 엉뚱한 값을 예측하지만 점점 떨어진다. 한번 떨어지면 score의 형태 파악이 힘듬¶

moving average를 취해서 완만한 곡선으로 표현해본다¶

어느순간부터 증가하는 양상이 보이려함.¶

Exercise¶

최종적인 모델은 내가 정한 parameter를 가지고 전체 모델을 학습시켜서 test set에 대한 예측을 하는 모델이다.¶

'딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

regression

Predicting house prices: a regression example¶

The Boston Housing Price dataset¶

attribute 마다 다른 스케일에 대해 어떻게 handling 할 것인가?¶

집값의 median을 예측하는 셋¶

400 여개 train set, 102개의 test set, 각 13개의 컬럼¶

Preparing the data¶

표준화작업을 한다. normalizing¶

Building the network¶

Validation with k-fold cross validation technique¶

데이터 셋이 너무 적다. validation셋이 어떻게 뽑히느냐에 따라 성능이 또 달라질 수 있다.¶

그래서 k-fold validation을 한다.¶

아래의 Fold2, Fold3에 흰 박스의 Validation 은 Training 이다. (오타)¶

저장해놓은 결과 값의 평균을 구함¶

처음에는 엉뚱한 값을 예측하지만 점점 떨어진다. 한번 떨어지면 score의 형태 파악이 힘듬¶

moving average를 취해서 완만한 곡선으로 표현해본다¶

어느순간부터 증가하는 양상이 보이려함.¶

Exercise¶

최종적인 모델은 내가 정한 parameter를 가지고 전체 모델을 학습시켜서 test set에 대한 예측을 하는 모델이다.¶

'딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바