Classifying movie reviews: a binary classification problem¶
The IMDB dataset¶
- A set of 50,000 highly polarized reviews (positive and negative) from the Internet Movie Database
- 25,000 reviews for training and 25,000 reviews for testing
- Each set consists of 50% negative and 50% positive reviews
Why use separate training and test sets?
Loading the IMDB dataset
- The argument
means that we only keep the top 10,000 most frequently occurring words in the training data.
- The argument
영화 리뷰를 2 개의 그룹으로 분류한다.¶
절반은 train, 절반은 test 용 데이터로 나뉜다.¶
import tensorflow
자주 등장하는 상위 10000개의 데이터를 불러온다.¶
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
are lists of reviews- Each review is a list of word indices.
are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.
train_data[0][:10] # 2만5천개중 첫번째 리뷰이다. 이 숫자들은 vocab속 word에 해당한다.
train_labels[0] # positive한 rivew이라는 것.
max([max(sequence) for sequence in train_data]) # 10000개의 단어라는 의미.
word_index = imdb.get_word_index() #단어 index 정보를 얻을 수 있다.
# how to decode one of reviews back to English words
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]]) # i-3:3번째부터 실제단어와 연결되어있음.
print(decoded_review) #decoding해서 찍어보자. 실제 텍스트 형태로 출력된다.
reverse_word_index[100] # 100번째 인덱스에 해당하는 단어를 출력하는 것.
Preparing the data¶
- We have to turn our lists into tensors.
- Pad the lists so that they all have the same length. --> Turn them into an integer tensor of shape
(samples, word_indices)
. - One-hot encode the lists to turn them into vectors of 0s and 1s.
- For example, the sequence
[3, 5]
--> a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s.
- For example, the sequence
- Pad the lists so that they all have the same length. --> Turn them into an integer tensor of shape
텐서로 표현하자. one-hot encoding !¶
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
x_train.shape # 25000개의 리뷰 x 10000개의 딘어
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
y_train # train set의 label값.
Building the network¶
시그모이드 함수와 렐루 함수.¶
- Note that the input data is vectors, and the labels are scalars.
Consider a simple stack of fully connected (
) layers withrelu
activations.Dense(16, activation='relu')
- The argument
represents the number of hidden units (nodes) of the layer. - A hidden unit is a dimension in the representation space of the layer.
- Recall that
output = relu(dot(W, input) + b)
. - Then, what is the shape of the weight matrix
The dimensionality of the representation space = How much freedom you are allowing the network to have when learning internal representations
There are two key architecture decisions about a stack of
layers:- How many layers to use
- How many hidden units to choose for each layer
In this example, we will use:
- Two intermediate layers with 16 hidden units each
- A third layer that will output the scalar prediction regarding the sentiment of the input review
The intermediate layers will use
as their activation function, and the final layer will use a sigmoid activation so as to output a probability.A
(rectified linear unit) is a function meant to zero out negative values, $f(x)=x^+=max(0,x)$A sigmoid squashes arbitrary values into the
[0, 1]
interval, outputting something that can be interpreted as a probability, $f(x)=1/({1+\exp(-x)})$
The final network
layer, node는 하이퍼파라미터¶
from tensorflow.keras import models
from tensorflow.keras import layers
model = models.Sequential() # 인스턴스 생성
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) # 노드 16개, relu, input shape 10000dim 으로 선언.
model.add(layers.Dense(16, activation='relu')) # 노드 16, relu
model.add(layers.Dense(1, activation='sigmoid')) # 노드 1, sigmoid
- Why are the activation functions necessary?
출력은 늘 선형결합 값이 므로 비선형 함수로 변환하여 class를 분류할 수 있게함.¶
손실함수. binary에서는 ground truth y가 0이거나 1이다.¶
Finally, we need to choose a loss function and an optimizer.
We are dealing with a binary classification problem and the output of our network is a probability. -->
lossbinary_crossentropy(y_pred, y) = -(y*log(y_pred) + (1-y)*log(1-y_pred))
def binary_crossentropy(y_pred, y): if y == 1: return -log(y_pred) else: return -log(1 - y_pred)
Note that it is not the only viable choice. For example,
.- crossentropy measures the distance between probability distributions or, in this case, between the ground-truth distribution and the predictions.
optimizer: rmrprop을 사용한다. GD 방식의 최적화 방법이다.¶
metrics=['accuracy']) # 분류문제니까 accuracy 측정.
- Configuring the optimizer or using custom losses and metrics
from tensorflow.keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001), #learning rate
from tensorflow.keras import losses
from tensorflow.keras import metrics
10000개를 뽑아 train.¶
- In order to monitor the accuracy of the model during training, we will create a validation set by splitting 10,000 samples from the original training data.
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
- Training the model for 20 epochs in mini-batches of 512 samples.
- At the same time, we will monitor loss and accuracy on the 10,000 samples in a validation set.
history =,
epochs=20, # 개별 데이터를 20번 본다.
batch_size=512, # 한번 model parameter를 update할 때 랜덤하게 512개의 데이터를 토대로 평균 loss를 구하고 parameter를 update함.
validation_data=(x_val, y_val))
평균적인 loss가 감소하고 있고, accuracy가 점점 좋아지는 걸 알 수 있다.¶
- Note that the call to
returns aHistory
object.- This has a member
, which is a dictionary containing data about everything that happened during training.
- This has a member
history_dict = history.history
- Using these information, we can plot the training and validation loss (or accuracy).
train data와 vali data에 따른 loss값의 변화. 꼭 확인해야함 !¶
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
training loss가 쭉 올라가는걸 보면 과적합되고 있다는 걸 알 수 있다.¶
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
- The training loss decreases with every epoch, and the training accuracy increases with every epoch.
- This is what we expect when running gradient-descent optimization.
- But that isn't the case for the validation loss and accuracy.
- They seem to peak at the fourth epoch.
Important note: a model that performs better on the training data isn't necessarily a model that will do better on data it has never seen before.
- overfitting
- After the second epoch, the network is overoptimized on the training data, and it learned representations that are specific to the training data.
- These representations don't generalize to data outside of the training set.
To prevent overfitting, we could stop training after three epochs.
- We will learn various techniques to mitigate overfitting later.
4epoch 정도로 돌리면 되겠구나! 해서 돌려보면,¶
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
metrics=['accuracy']), y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)
results #크로스 엔트로피 loss, 그리고 accuracy
Prediction on new data with a trained network¶
positive일 확률을 말함.¶
model.predict(x_test) # 새로운 데이터에 대한 예측 !
Further experiments¶
- Try using one or three hidden layers.
- Try using layers with more hidden units or fewer hidden units.
- Try using the
loss function instead ofbinary_crossentropy
- Try using other activation functions (e.g.
) instead ofrelu
- Try other optimizers instead of
SangheomHwang [deep learning class]
'딥러닝' 카테고리의 다른 글
regression (0) | 2020.06.20 |
multi class classification (0) | 2020.06.20 |
MNIST 데이터를 활용한 딥러닝 기초 (2) | 2020.05.16 |
4. How Deep learning work (0) | 2020.03.28 |
3. Generalization (0) | 2020.03.28 |