Classifying movie reviews: a binary classification problem¶
The IMDB dataset¶
- A set of 50,000 highly polarized reviews (positive and negative) from the Internet Movie Database
- 25,000 reviews for training and 25,000 reviews for testing
- Each set consists of 50% negative and 50% positive reviews
Why use separate training and test sets?
Loading the IMDB dataset
- The argument
num_words=10000
means that we only keep the top 10,000 most frequently occurring words in the training data.
- The argument
영화 리뷰를 2 개의 그룹으로 분류한다.¶
절반은 train, 절반은 test 용 데이터로 나뉜다.¶
import tensorflow
tensorflow.__version__
자주 등장하는 상위 10000개의 데이터를 불러온다.¶
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
train_data
andtest_data
are lists of reviews- Each review is a list of word indices.
train_labels
andtest_labels
are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
train_data[0][:10] # 2만5천개중 첫번째 리뷰이다. 이 숫자들은 vocab속 word에 해당한다.
train_labels[0] # positive한 rivew이라는 것.
max([max(sequence) for sequence in train_data]) # 10000개의 단어라는 의미.
word_index = imdb.get_word_index() #단어 index 정보를 얻을 수 있다.
# how to decode one of reviews back to English words
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]]) # i-3:3번째부터 실제단어와 연결되어있음.
print(decoded_review) #decoding해서 찍어보자. 실제 텍스트 형태로 출력된다.
reverse_word_index[100] # 100번째 인덱스에 해당하는 단어를 출력하는 것.
Preparing the data¶
- We have to turn our lists into tensors.
- Pad the lists so that they all have the same length. --> Turn them into an integer tensor of shape
(samples, word_indices)
. - One-hot encode the lists to turn them into vectors of 0s and 1s.
- For example, the sequence
[3, 5]
--> a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s.
- For example, the sequence
- Pad the lists so that they all have the same length. --> Turn them into an integer tensor of shape
텐서로 표현하자. one-hot encoding !¶
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
x_train[0]
x_train.shape # 25000개의 리뷰 x 10000개의 딘어
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
y_train # train set의 label값.
Building the network¶
시그모이드 함수와 렐루 함수.¶
- Note that the input data is vectors, and the labels are scalars.
Consider a simple stack of fully connected (
Dense
) layers withrelu
activations.Dense(16, activation='relu')
- The argument
16
represents the number of hidden units (nodes) of the layer. - A hidden unit is a dimension in the representation space of the layer.
- Recall that
output = relu(dot(W, input) + b)
. - Then, what is the shape of the weight matrix
W
?
The dimensionality of the representation space = How much freedom you are allowing the network to have when learning internal representations
There are two key architecture decisions about a stack of
Dense
layers:- How many layers to use
- How many hidden units to choose for each layer
In this example, we will use:
- Two intermediate layers with 16 hidden units each
- A third layer that will output the scalar prediction regarding the sentiment of the input review
The intermediate layers will use
relu
as their activation function, and the final layer will use a sigmoid activation so as to output a probability.A
relu
(rectified linear unit) is a function meant to zero out negative values, $f(x)=x^+=max(0,x)$A sigmoid squashes arbitrary values into the
[0, 1]
interval, outputting something that can be interpreted as a probability, $f(x)=1/({1+\exp(-x)})$
The final network
layer, node는 하이퍼파라미터¶
from tensorflow.keras import models
from tensorflow.keras import layers
model = models.Sequential() # 인스턴스 생성
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) # 노드 16개, relu, input shape 10000dim 으로 선언.
model.add(layers.Dense(16, activation='relu')) # 노드 16, relu
model.add(layers.Dense(1, activation='sigmoid')) # 노드 1, sigmoid
- Why are the activation functions necessary?
출력은 늘 선형결합 값이 므로 비선형 함수로 변환하여 class를 분류할 수 있게함.¶
손실함수. binary에서는 ground truth y가 0이거나 1이다.¶
Finally, we need to choose a loss function and an optimizer.
We are dealing with a binary classification problem and the output of our network is a probability. -->
binary_crossentropy
lossbinary_crossentropy(y_pred, y) = -(y*log(y_pred) + (1-y)*log(1-y_pred))
def binary_crossentropy(y_pred, y): if y == 1: return -log(y_pred) else: return -log(1 - y_pred)
Note that it is not the only viable choice. For example,
mean_squared_error
.- crossentropy measures the distance between probability distributions or, in this case, between the ground-truth distribution and the predictions.
optimizer: rmrprop을 사용한다. GD 방식의 최적화 방법이다.¶
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy']) # 분류문제니까 accuracy 측정.
- Configuring the optimizer or using custom losses and metrics
from tensorflow.keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001), #learning rate
loss='binary_crossentropy',
metrics=['accuracy'])
from tensorflow.keras import losses
from tensorflow.keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss=losses.binary_crossentropy,
metrics=[metrics.binary_accuracy])
Validation¶
10000개를 뽑아 train.¶
- In order to monitor the accuracy of the model during training, we will create a validation set by splitting 10,000 samples from the original training data.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
- Training the model for 20 epochs in mini-batches of 512 samples.
- At the same time, we will monitor loss and accuracy on the 10,000 samples in a validation set.
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(partial_x_train,
partial_y_train,
epochs=20, # 개별 데이터를 20번 본다.
batch_size=512, # 한번 model parameter를 update할 때 랜덤하게 512개의 데이터를 토대로 평균 loss를 구하고 parameter를 update함.
validation_data=(x_val, y_val))
평균적인 loss가 감소하고 있고, accuracy가 점점 좋아지는 걸 알 수 있다.¶
- Note that the call to
model.fit()
returns aHistory
object.- This has a member
history
, which is a dictionary containing data about everything that happened during training.
- This has a member
history_dict = history.history
history_dict.keys()
- Using these information, we can plot the training and validation loss (or accuracy).
train data와 vali data에 따른 loss값의 변화. 꼭 확인해야함 !¶
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
training loss가 쭉 올라가는걸 보면 과적합되고 있다는 걸 알 수 있다.¶
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
- The training loss decreases with every epoch, and the training accuracy increases with every epoch.
- This is what we expect when running gradient-descent optimization.
- But that isn't the case for the validation loss and accuracy.
- They seem to peak at the fourth epoch.
Important note: a model that performs better on the training data isn't necessarily a model that will do better on data it has never seen before.
- overfitting
- After the second epoch, the network is overoptimized on the training data, and it learned representations that are specific to the training data.
- These representations don't generalize to data outside of the training set.
To prevent overfitting, we could stop training after three epochs.
- We will learn various techniques to mitigate overfitting later.
4epoch 정도로 돌리면 되겠구나! 해서 돌려보면,¶
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)
results #크로스 엔트로피 loss, 그리고 accuracy
Prediction on new data with a trained network¶
positive일 확률을 말함.¶
model.predict(x_test) # 새로운 데이터에 대한 예측 !
Further experiments¶
- Try using one or three hidden layers.
- Try using layers with more hidden units or fewer hidden units.
- Try using the
mse
loss function instead ofbinary_crossentropy
- Try using other activation functions (e.g.
tanh
,sigmoid
) instead ofrelu
- Try other optimizers instead of
rmsprop
SangheomHwang [deep learning class]
'딥러닝' 카테고리의 다른 글
regression (0) | 2020.06.20 |
---|---|
multi class classification (0) | 2020.06.20 |
MNIST 데이터를 활용한 딥러닝 기초 (2) | 2020.05.16 |
4. How Deep learning work (0) | 2020.03.28 |
3. Generalization (0) | 2020.03.28 |
댓글