Classifying movie reviews: a binary classification problem¶

The IMDB dataset¶

A set of 50,000 highly polarized reviews (positive and negative) from the Internet Movie Database
25,000 reviews for training and 25,000 reviews for testing
Each set consists of 50% negative and 50% positive reviews
Why use separate training and test sets?
Loading the IMDB dataset
- The argument num_words=10000 means that we only keep the top 10,000 most frequently occurring words in the training data.

영화 리뷰를 2 개의 그룹으로 분류한다.¶

절반은 train, 절반은 test 용 데이터로 나뉜다.¶

import tensorflow
tensorflow.__version__

'2.1.0'

자주 등장하는 상위 10000개의 데이터를 불러온다.¶

from tensorflow.keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 1s 0us/step

train_data and test_data are lists of reviews
- Each review is a list of word indices.
train_labels and test_labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.

train_data[0][:10] # 2만5천개중 첫번째 리뷰이다. 이 숫자들은 vocab속 word에 해당한다.

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

train_labels[0] # positive한 rivew이라는 것.

1

max([max(sequence) for sequence in train_data]) # 10000개의 단어라는 의미.

9999

word_index = imdb.get_word_index() #단어 index 정보를 얻을 수 있다.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step

# how to decode one of reviews back to English words
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) 
decoded_review = ' '.join([reverse_word_index.get(i-3, '?') for i in train_data[0]]) # i-3:3번째부터 실제단어와 연결되어있음.
print(decoded_review) #decoding해서 찍어보자. 실제 텍스트 형태로 출력된다.

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

reverse_word_index[100] # 100번째 인덱스에 해당하는 단어를 출력하는 것.

'after'

Preparing the data¶

We have to turn our lists into tensors.
- Pad the lists so that they all have the same length. --> Turn them into an integer tensor of shape (samples, word_indices).
- One-hot encode the lists to turn them into vectors of 0s and 1s.
  - For example, the sequence [3, 5] --> a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s.

텐서로 표현하자. one-hot encoding !¶

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
  results = np.zeros((len(sequences), dimension))
  for i, sequence in enumerate(sequences):
    results[i, sequence] = 1.
  return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

x_train[0]

array([0., 1., 1., ..., 0., 0., 0.])

x_train.shape # 25000개의 리뷰 x 10000개의 딘어

(25000, 10000)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

y_train # train set의 label값.

array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)

Building the network¶

시그모이드 함수와 렐루 함수.¶

Note that the input data is vectors, and the labels are scalars.
Consider a simple stack of fully connected (Dense) layers with relu activations.
- Dense(16, activation='relu')
- The argument 16 represents the number of hidden units (nodes) of the layer.
- A hidden unit is a dimension in the representation space of the layer.
- Recall that output = relu(dot(W, input) + b).
- Then, what is the shape of the weight matrix W?
The dimensionality of the representation space = How much freedom you are allowing the network to have when learning internal representations
There are two key architecture decisions about a stack of Dense layers:
- How many layers to use
- How many hidden units to choose for each layer
In this example, we will use:
- Two intermediate layers with 16 hidden units each
- A third layer that will output the scalar prediction regarding the sentiment of the input review
The intermediate layers will use relu as their activation function, and the final layer will use a sigmoid activation so as to output a probability.
- A relu (rectified linear unit) is a function meant to zero out negative values, $f(x)=x^+=max(0,x)$
- A sigmoid squashes arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability, $f(x)=1/({1+\exp(-x)})$
The final network

layer, node는 하이퍼파라미터¶

from tensorflow.keras import models
from tensorflow.keras import layers

model = models.Sequential() # 인스턴스 생성
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) # 노드 16개, relu, input shape 10000dim 으로 선언.
model.add(layers.Dense(16, activation='relu')) # 노드 16, relu
model.add(layers.Dense(1, activation='sigmoid')) # 노드 1, sigmoid

Why are the activation functions necessary?

출력은 늘 선형결합 값이 므로 비선형 함수로 변환하여 class를 분류할 수 있게함.¶

손실함수. binary에서는 ground truth y가 0이거나 1이다.¶

Finally, we need to choose a loss function and an optimizer.
- We are dealing with a binary classification problem and the output of our network is a probability. --> binary_crossentropy loss
  - binary_crossentropy(y_pred, y) = -(y*log(y_pred) + (1-y)*log(1-y_pred))
```
def binary_crossentropy(y_pred, y):
        if y == 1:
            return -log(y_pred)
        else:
            return -log(1 - y_pred)
```
- Note that it is not the only viable choice. For example, mean_squared_error.
- crossentropy measures the distance between probability distributions or, in this case, between the ground-truth distribution and the predictions.

optimizer: rmrprop을 사용한다. GD 방식의 최적화 방법이다.¶

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy']) # 분류문제니까 accuracy 측정.

Configuring the optimizer or using custom losses and metrics

from tensorflow.keras import optimizers

model.compile(optimizer=optimizers.RMSprop(lr=0.001), #learning rate
              loss='binary_crossentropy',
              metrics=['accuracy'])

from tensorflow.keras import losses
from tensorflow.keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])

Validation¶

10000개를 뽑아 train.¶

In order to monitor the accuracy of the model during training, we will create a validation set by splitting 10,000 samples from the original training data.

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

Training the model for 20 epochs in mini-batches of 512 samples.
At the same time, we will monitor loss and accuracy on the 10,000 samples in a validation set.

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20, # 개별 데이터를 20번 본다.
                    batch_size=512, # 한번 model parameter를 update할 때 랜덤하게 512개의 데이터를 토대로 평균 loss를 구하고 parameter를 update함.
                    validation_data=(x_val, y_val))

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
15000/15000 [==============================] - 3s 175us/sample - loss: 0.5061 - acc: 0.7903 - val_loss: 0.3772 - val_acc: 0.8690
Epoch 2/20
15000/15000 [==============================] - 1s 86us/sample - loss: 0.2980 - acc: 0.9057 - val_loss: 0.2975 - val_acc: 0.8904
Epoch 3/20
15000/15000 [==============================] - 1s 86us/sample - loss: 0.2161 - acc: 0.9290 - val_loss: 0.2893 - val_acc: 0.8838
Epoch 4/20
15000/15000 [==============================] - 1s 84us/sample - loss: 0.1692 - acc: 0.9459 - val_loss: 0.2770 - val_acc: 0.8883
Epoch 5/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.1364 - acc: 0.9574 - val_loss: 0.2897 - val_acc: 0.8851
Epoch 6/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.1166 - acc: 0.9640 - val_loss: 0.2985 - val_acc: 0.8851
Epoch 7/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0931 - acc: 0.9726 - val_loss: 0.3124 - val_acc: 0.8841
Epoch 8/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0773 - acc: 0.9785 - val_loss: 0.3967 - val_acc: 0.8698
Epoch 9/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0628 - acc: 0.9835 - val_loss: 0.3710 - val_acc: 0.8778
Epoch 10/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0523 - acc: 0.9870 - val_loss: 0.3978 - val_acc: 0.8770
Epoch 11/20
15000/15000 [==============================] - 1s 82us/sample - loss: 0.0411 - acc: 0.9913 - val_loss: 0.4182 - val_acc: 0.8747
Epoch 12/20
15000/15000 [==============================] - 1s 82us/sample - loss: 0.0342 - acc: 0.9927 - val_loss: 0.4644 - val_acc: 0.8708
Epoch 13/20
15000/15000 [==============================] - 1s 84us/sample - loss: 0.0270 - acc: 0.9953 - val_loss: 0.6036 - val_acc: 0.8542
Epoch 14/20
15000/15000 [==============================] - 1s 84us/sample - loss: 0.0243 - acc: 0.9946 - val_loss: 0.5095 - val_acc: 0.8710
Epoch 15/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0151 - acc: 0.9985 - val_loss: 0.5691 - val_acc: 0.8621
Epoch 16/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0141 - acc: 0.9980 - val_loss: 0.5762 - val_acc: 0.8703
Epoch 17/20
15000/15000 [==============================] - 1s 82us/sample - loss: 0.0106 - acc: 0.9989 - val_loss: 0.6099 - val_acc: 0.8678
Epoch 18/20
15000/15000 [==============================] - 1s 82us/sample - loss: 0.0081 - acc: 0.9993 - val_loss: 0.7055 - val_acc: 0.8649
Epoch 19/20
15000/15000 [==============================] - 1s 86us/sample - loss: 0.0050 - acc: 0.9998 - val_loss: 0.6848 - val_acc: 0.8625
Epoch 20/20
15000/15000 [==============================] - 1s 83us/sample - loss: 0.0093 - acc: 0.9977 - val_loss: 0.7199 - val_acc: 0.8645

평균적인 loss가 감소하고 있고, accuracy가 점점 좋아지는 걸 알 수 있다.¶

Note that the call to model.fit() returns a History object.
- This has a member history, which is a dictionary containing data about everything that happened during training.

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])

Using these information, we can plot the training and validation loss (or accuracy).

train data와 vali data에 따른 loss값의 변화. 꼭 확인해야함 !¶

import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(loss_values) + 1)

plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

training loss가 쭉 올라가는걸 보면 과적합되고 있다는 걸 알 수 있다.¶

acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc_values, 'bo', label='Training acc') 
plt.plot(epochs, val_acc_values, 'b', label='Validation acc') 
plt.title('Training and validation accuracy') 
plt.xlabel('Epochs') 
plt.ylabel('Loss') 
plt.legend()

plt.show()

The training loss decreases with every epoch, and the training accuracy increases with every epoch.
- This is what we expect when running gradient-descent optimization.
But that isn't the case for the validation loss and accuracy.
- They seem to peak at the fourth epoch.
Important note: a model that performs better on the training data isn't necessarily a model that will do better on data it has never seen before.
- overfitting
- After the second epoch, the network is overoptimized on the training data, and it learned representations that are specific to the training data.
- These representations don't generalize to data outside of the training set.
To prevent overfitting, we could stop training after three epochs.
- We will learn various techniques to mitigate overfitting later.

4epoch 정도로 돌리면 되겠구나! 해서 돌려보면,¶

model = models.Sequential() 
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
model.add(layers.Dense(16, activation='relu')) 
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy', 
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512) 
results = model.evaluate(x_test, y_test)

Train on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 2s 68us/sample - loss: 0.4588 - accuracy: 0.8250s - loss: 0.4779 - accuracy: 
Epoch 2/4
25000/25000 [==============================] - 1s 50us/sample - loss: 0.2598 - accuracy: 0.9092
Epoch 3/4
25000/25000 [==============================] - 1s 51us/sample - loss: 0.1982 - accuracy: 0.9284
Epoch 4/4
25000/25000 [==============================] - 1s 48us/sample - loss: 0.1680 - accuracy: 0.9403
25000/25000 [==============================] - 2s 96us/sample - loss: 0.2969 - accuracy: 0.8826

results #크로스 엔트로피 loss, 그리고 accuracy

[0.29693335238456725, 0.88256]

Prediction on new data with a trained network¶

positive일 확률을 말함.¶

model.predict(x_test) # 새로운 데이터에 대한 예측 !

array([[0.17033087],
       [0.9994394 ],
       [0.68886703],
       ...,
       [0.13199855],
       [0.05588705],
       [0.5454024 ]], dtype=float32)

Further experiments¶

Try using one or three hidden layers.
Try using layers with more hidden units or fewer hidden units.
Try using the mse loss function instead of binary_crossentropy
Try using other activation functions (e.g. tanh, sigmoid) instead of relu
Try other optimizers instead of rmsprop

regression (0)	2020.06.20
multi class classification (0)	2020.06.20
MNIST 데이터를 활용한 딥러닝 기초 (2)	2020.05.16
4. How Deep learning work (0)	2020.03.28
3. Generalization (0)	2020.03.28

데하

binary classification_multi perceptron

Classifying movie reviews: a binary classification problem¶

The IMDB dataset¶

영화 리뷰를 2 개의 그룹으로 분류한다.¶

절반은 train, 절반은 test 용 데이터로 나뉜다.¶

자주 등장하는 상위 10000개의 데이터를 불러온다.¶

Preparing the data¶

텐서로 표현하자. one-hot encoding !¶

Building the network¶

시그모이드 함수와 렐루 함수.¶

layer, node는 하이퍼파라미터¶

출력은 늘 선형결합 값이 므로 비선형 함수로 변환하여 class를 분류할 수 있게함.¶

손실함수. binary에서는 ground truth y가 0이거나 1이다.¶

optimizer: rmrprop을 사용한다. GD 방식의 최적화 방법이다.¶

Validation¶

10000개를 뽑아 train.¶

평균적인 loss가 감소하고 있고, accuracy가 점점 좋아지는 걸 알 수 있다.¶

train data와 vali data에 따른 loss값의 변화. 꼭 확인해야함 !¶

training loss가 쭉 올라가는걸 보면 과적합되고 있다는 걸 알 수 있다.¶

4epoch 정도로 돌리면 되겠구나! 해서 돌려보면,¶

Prediction on new data with a trained network¶

positive일 확률을 말함.¶

Further experiments¶

'딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

binary classification_multi perceptron

Classifying movie reviews: a binary classification problem¶

The IMDB dataset¶

영화 리뷰를 2 개의 그룹으로 분류한다.¶

절반은 train, 절반은 test 용 데이터로 나뉜다.¶

자주 등장하는 상위 10000개의 데이터를 불러온다.¶

Preparing the data¶

텐서로 표현하자. one-hot encoding !¶

Building the network¶

시그모이드 함수와 렐루 함수.¶

layer, node는 하이퍼파라미터¶

출력은 늘 선형결합 값이 므로 비선형 함수로 변환하여 class를 분류할 수 있게함.¶

손실함수. binary에서는 ground truth y가 0이거나 1이다.¶

optimizer: rmrprop을 사용한다. GD 방식의 최적화 방법이다.¶

Validation¶

10000개를 뽑아 train.¶

평균적인 loss가 감소하고 있고, accuracy가 점점 좋아지는 걸 알 수 있다.¶

train data와 vali data에 따른 loss값의 변화. 꼭 확인해야함 !¶

training loss가 쭉 올라가는걸 보면 과적합되고 있다는 걸 알 수 있다.¶

4epoch 정도로 돌리면 되겠구나! 해서 돌려보면,¶

Prediction on new data with a trained network¶

positive일 확률을 말함.¶

Further experiments¶

'딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바