728x90
반응형
Predicting house prices: a regression example¶
- Predicting a continuous value instead of a discrete label
The Boston Housing Price dataset¶
- We want to predict the median price of homes in a given Boston suburb in the mid-1970s, given the crime rate, the local property tax rate, and so on.
- It has relatively few data points: only 506 (404 training samples and 102 test samples).
- Each feature in the input has a different scale.
- For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, and so on.
attribute 마다 다른 스케일에 대해 어떻게 handling 할 것인가?¶
집값의 median을 예측하는 셋¶
In [1]:
from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
400 여개 train set, 102개의 test set, 각 13개의 컬럼¶
In [2]:
train_data.shape
Out[2]:
In [3]:
train_data[0] # 스케일이 다양한 것을 확인.
Out[3]:
In [4]:
test_data.shape
Out[4]:
In [5]:
train_targets[:30]
Out[5]:
Preparing the data¶
- It would be problematic to feed into a neural network values that all take wildly different ranges.
- Let's do feature-wise normalization
- For each feature, we subtract the mean of the feature and divide by the standard deviation.
- Then, the feature is centered around 0 and has a unit standard deviation.
표준화작업을 한다. normalizing¶
In [6]:
mean = train_data.mean(axis=0)
std = train_data.std(axis=0)
train_data -= mean
train_data /= std
# test셋으로 하면 절대 안된다!! 모델의 성능만을 평가해야해서 test를 사용하면 안된다. train set으로 얻은 평균과 표준편차를 활용하자.
test_data -= mean # test도 당연히 해주어야 한다 ! 안하면 이상한 예측이 될 것이라는 것은. 인지상정.
test_data /= std
- Note that the quantities used for normalizing the test data are computed using the training data.
- NEVER use any quantity computed on the test data, even for something as simple as data normalization.
Building the network¶
In [7]:
from tensorflow.keras import models
from tensorflow.keras import layers
def build_model():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1)) # output node의 수 1개만 있으면 된다. activate function 이 필요없음 ! 주면안댐.
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) # loss function: mse, mae로 모니터링.
return model
- This network ends with a single unit and no activation (it is called a linear layer).
mse
loss- mean squared error, the square of the difference between the predictions and the targets
mae
for monitoring- mean absolute error, the absolute error of the difference between the predictions and the targets
Validation with k-fold cross validation technique¶
Since we have few data points, the validation set would be very small if we randomly split the data into a training set and a validation set.
- It means that the validation scores might change a lot depending on which data points we chose for the validation.
- We can say that the validation scores might have a high variance with regard to the validation split.
The best practice in such situations is to use k-fold cross-validation.
- It consists of splitting the available data into k partitions, instantiating k identical models, and training each one of k-1 partitions while evaluating on the remaining partition.
- The validation score for the model used is then the average of the k validation scores obtained.
In [8]:
import numpy as np
k = 4 # 4번 나눠본다.
num_val_samples = len(train_data) // k
num_epochs = 300
all_mae_histories = []
for i in range(k):
print('processing fold #', i)
val_data = train_data[i*num_val_samples: (i+1)*num_val_samples]
val_targets = train_targets[i*num_val_samples: (i+1)*num_val_samples]
partial_train_data = np.concatenate([train_data[:i*num_val_samples],
train_data[(i+1)*num_val_samples:]],
axis=0)
partial_train_targets = np.concatenate([train_targets[:i*num_val_samples],
train_targets[(i+1)*num_val_samples:]],
axis=0)
model = build_model() # 모델 초기화
history = model.fit(partial_train_data,
partial_train_targets,
validation_data=(val_data, val_targets),
epochs=num_epochs,
batch_size=16,
verbose=0) # print하지 않겠다.
mae_history = history.history['val_mae']
all_mae_histories.append(mae_history)
저장해놓은 결과 값의 평균을 구함¶
In [10]:
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
- Plotting validation scores
처음에는 엉뚱한 값을 예측하지만 점점 떨어진다. 한번 떨어지면 score의 형태 파악이 힘듬¶
In [11]:
import matplotlib.pyplot as plt
plt.plot(range(1, len(average_mae_history)+1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
moving average를 취해서 완만한 곡선으로 표현해본다¶
어느순간부터 증가하는 양상이 보이려함.¶
In [12]:
def smooth_curve(points, factor=0.9):
smoothed_points = []
for point in points:
if smoothed_points:
previous = smoothed_points[-1]
smoothed_points.append(previous*factor + point*(1-factor))
else:
smoothed_points.append(point)
return smoothed_points
smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1, len(smooth_mae_history)+1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
Exercise¶
- We found that validation MAE stops improving at a some point.
- Write a code to train a final production model on all of the training data and then look at its performance on the test data.
최종적인 모델은 내가 정한 parameter를 가지고 전체 모델을 학습시켜서 test set에 대한 예측을 하는 모델이다.¶
SangheumHwang[deep learning class]
728x90
반응형
'딥러닝' 카테고리의 다른 글
Convolution and Pooling (1) | 2020.06.27 |
---|---|
fundamentals of machine learning (0) | 2020.06.20 |
multi class classification (0) | 2020.06.20 |
binary classification_multi perceptron (0) | 2020.06.20 |
MNIST 데이터를 활용한 딥러닝 기초 (2) | 2020.05.16 |
댓글